Puzzling performance/output behavior with rank-2 polymorphism in Haskell - performance

The below code (annotated inline with locations) gives a minimal example of the puzzling behavior I'm experiencing.
Essentially, why does (2) result in terrible space/time performance while (1) does not?
The below code is compiled and run as follows on ghc version 8.4.3:
ghc -prof -fprof-auto -rtsopts test.hs; ./test +RTS -p
{-# LANGUAGE Rank2Types #-}
import Debug.Trace
-- Not sure how to get rid of the record
data State = State {
-- (0) If vstate :: Float, the extra "hello"s go away
vstate :: forall a . (Fractional a) => a
}
step :: State -> State
step s =
-- (1) one "hello" per step
-- let vs = trace "hello" (vstate s) in
-- s { vstate = vs `seq` vstate s }
-- (2) increasing "hello"s per step
s { vstate = (trace "hello" (vstate s)) `seq` vstate s }
main :: IO ()
main = do
let initState = State { vstate = 0 }
-- (3) step 3 times
-- let res = step $ step $ step initState
-- print $ vstate res
-- (4) step 20 times to profile time/space performance
let res = iterate step initState
print $ vstate $ last $ take 20 res
print "done"
a. With (1) and (3) commented in, compiled without -O2, the code only outputs "hello" three times, as I expect it to.
b. With (2) and (3) commented in, compiled without -O2, the code outputs "hello" eight times. It seems to output one additional "hello" per step. I don't understand why this is happening.
c. With (1) and (4) commented in, compiled without -O2, the code runs extremely fast.
d. With (2) and (4) commented in, compiled without -O2, the code runs very slowly, and the performance report (included below) shows that makes many more calls to vstate and uses much more memory than variant c. I also don't understand why this is happening.
e. With (2) and (4) commented in, compiled with -O2, the code behaves the same as variant c. So clearly ghc is able to optimize away whatever pathological behavior is happening in variant d.
Here is the profiling report for variant c (fast):
Mon Aug 13 15:48 2018 Time and Allocation Profiling Report (Final)
partial +RTS -p -RTS
total time = 0.00 secs (0 ticks # 1000 us, 1 processor)
total alloc = 107,560 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
CAF GHC.IO.Handle.FD <entire-module> 0.0 32.3
CAF GHC.IO.Encoding <entire-module> 0.0 3.1
main Main partial.hs:(24,1)-(35,16) 0.0 13.4
main.res Main partial.hs:32:9-36 0.0 1.6
step Main partial.hs:(15,1)-(18,36) 0.0 1.1
step.vs Main partial.hs:17:9-37 0.0 46.1
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 114 0 0.0 0.6 0.0 100.0
CAF Main <entire-module> 227 0 0.0 0.1 0.0 52.2
main Main partial.hs:(24,1)-(35,16) 228 1 0.0 2.7 0.0 52.1
vstate Main partial.hs:11:5-10 230 20 0.0 0.0 0.0 0.0
main.initState Main partial.hs:25:9-40 239 0 0.0 0.0 0.0 0.0
main.res Main partial.hs:32:9-36 234 0 0.0 0.0 0.0 0.0
step Main partial.hs:(15,1)-(18,36) 235 0 0.0 0.0 0.0 0.0
main.initState Main partial.hs:25:9-40 233 1 0.0 0.0 0.0 0.0
main.res Main partial.hs:32:9-36 231 1 0.0 1.6 0.0 49.4
step Main partial.hs:(15,1)-(18,36) 232 19 0.0 1.1 0.0 47.8
step.vs Main partial.hs:17:9-37 236 19 0.0 46.1 0.0 46.7
vstate Main partial.hs:11:5-10 237 190 0.0 0.0 0.0 0.6
main.initState Main partial.hs:25:9-40 238 0 0.0 0.6 0.0 0.6
CAF Debug.Trace <entire-module> 217 0 0.0 0.2 0.0 0.2
CAF GHC.Conc.Signal <entire-module> 206 0 0.0 0.6 0.0 0.6
CAF GHC.IO.Encoding <entire-module> 189 0 0.0 3.1 0.0 3.1
CAF GHC.IO.Encoding.Iconv <entire-module> 187 0 0.0 0.2 0.0 0.2
CAF GHC.IO.Handle.FD <entire-module> 178 0 0.0 32.3 0.0 32.3
CAF GHC.IO.Handle.Text <entire-module> 176 0 0.0 0.1 0.0 0.1
main Main partial.hs:(24,1)-(35,16) 229 0 0.0 10.7 0.0 10.7
Here is the profiling report for variant d (slow; without -O2):
Mon Aug 13 15:25 2018 Time and Allocation Profiling Report (Final)
partial +RTS -p -RTS
total time = 1.48 secs (1480 ticks # 1000 us, 1 processor)
total alloc = 1,384,174,472 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
step Main partial.hs:(15,1)-(21,60) 95.7 98.8
main.initState Main partial.hs:25:9-40 3.0 1.2
vstate Main partial.hs:11:5-10 1.4 0.0
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 114 0 0.0 0.0 100.0 100.0
CAF Main <entire-module> 227 0 0.0 0.0 100.0 100.0
main Main partial.hs:(24,1)-(35,16) 228 1 0.0 0.0 100.0 100.0
vstate Main partial.hs:11:5-10 230 1048575 1.4 0.0 100.0 100.0
main.initState Main partial.hs:25:9-40 236 0 3.0 1.2 3.0 1.2
main.res Main partial.hs:32:9-36 234 0 0.0 0.0 95.7 98.8
step Main partial.hs:(15,1)-(21,60) 235 0 95.7 98.8 95.7 98.8
main.initState Main partial.hs:25:9-40 233 1 0.0 0.0 0.0 0.0
main.res Main partial.hs:32:9-36 231 1 0.0 0.0 0.0 0.0
step Main partial.hs:(15,1)-(21,60) 232 19 0.0 0.0 0.0 0.0
CAF Debug.Trace <entire-module> 217 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 206 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding <entire-module> 189 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv <entire-module> 187 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD <entire-module> 178 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Text <entire-module> 176 0 0.0 0.0 0.0 0.0
main Main partial.hs:(24,1)-(35,16) 229 0 0.0 0.0 0.0 0.0
Here are some notes/guesses/questions on why this is happening:
What's the relationship of the degrading performance to the increasing "hello" count? The pathological version seems to output one more "hello" with each additional step. Why?
I'm aware that polymorphism in Haskell is slow, as detailed in this StackOverflow question. That might be part of the problem, since the pathological behavior goes away when vstate is monomorphized as vstate :: Float. But I don't see why the lack of a let-binding, as in location (2), would cause such bad time/space performance.
This is the minimal version of a performance bug in a larger codebase, which we fixed by "monomorphizing" floating-type numbers using realToFrac (the commit is here in case anyone is curious). I know that compiling with -O2 fixes the behavior in the minimal example, but I tried it in the larger codebase and it does not fix the performance issues. (The reason we need rank-2 polymorphism in the larger codebase is to use the ad library for autodiff.) Is there a more principled fix than using realToFrac, such as an inline specialization that can be applied?

forall a . (Fractional a) => a is a function type.
It has two arguments, a type (a :: *) and an instance with type Fractional a. Whenever you see =>, it's a function operationally, and compiles to a function in GHC's core representation, and sometimes stays a function in machine code as well. The main difference between -> and => is that arguments for the latter cannot be explicitly given by programmers, and they are always filled in implicitly by instance resolution.
Let's see the fast step first:
step :: State -> State
step (State f) =
let vs = trace "hello" f
in State (vs `seq` f)
Here, vs has an undetermined Fractional type which defaults to Double. If you turn on -Wtype-defaults warning, GHC will point this out to you. Since vs :: Double, it's just a numeric value, which is captured by the returned closure. Right, vs `seq` f is a function, since it has a functional type forall a . (Fractional a) => a, and it's desugared to an actual lambda expression by GHC. This lambda abstracts over the two arguments, captures vs as free variable, and passes along both arguments to f.
So, each step creates a new function closure which captures a vs :: Double. If we call step three times, we get three closures with three Doubles inside, each closure referring to the previous one. Then, when we write vstate (step $ step $ step initState), we again default to Double, and GHC calls this closure with the Fractional Double instance. All the vs-es call previous closures with Fractional Double, but every vs is evaluated only once, because they are regular lazy Double values which are not recomputed.
However, if we enable NoMonomorphismRestriction, vs is generalized to forall a. Fractional a => a, so it becomes a function as well, and its calls are not memoized anymore. Hence, in this case the fast version behaves the same as the slow version.
Now, the slow step:
step :: State -> State
step (State f) = State ((trace "hello" f) `seq` f)
This has exponential number of calls in the number of steps, because step f calls f twice, and without optimizations no computation is shared, because both calls occur under a lambda. In (trace "hello" f) `seq` f, the first call to f defaults to Fractional Double, and the second call just passes along the implicit Fractional a instance as before.
If we switch on optimization, GHC observes that the first f call does not depend on the function parameters, and floats out trace "hello" f to a let-binding, resulting in pretty much the same code as in the fast version.

Related

Julia 2d Matrix set value at index (i, j)

In Julia I created a Matrix A and want to set its value at (1, 1) to the value of 5. How do I do that? I tried this:
A = zeros(5, 5)
A[1][1] = 5
It throws the error
MethodError: no method matching setindex!(::Float64, ::Int64, ::Int64)
Stacktrace:
[1] top-level scope
# In[38]:4
[2] eval
# .\boot.jl:373 [inlined]
[3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
# Base .\loading.jl:1196
DNF's comment already gives you the answer, but I'll elaborate a little.
The basic issue with what you're doing is that A[1] is a valid indexing operation returning the first element of A, and therefore A[1][1] is trying to index into that first element, which is just a number. To see:
julia> A[1]
0.0
julia> A[1][1]
0.0
Essentially what you are doing is equivalent to this:
julia> x = 1
1
julia> x[1] = 2
ERROR: MethodError: no method matching setindex!(::Int64, ::Int64, ::Int64)
Stacktrace:
[1] top-level scope
# REPL[12]:1
numbers in Julia are immutable, i.e. you cannot "replace" a number with another number in place. Arrays on the other hand are mutable:
julia> y = [1]
1-element Vector{Int64}:
1
julia> y[1] = 2
2
So you could also have done:
julia> A[1] = 5
5
julia> A
5×5 Matrix{Float64}:
5.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
Why does A[1] work? A single index will give you the element you arrive at when you count down row-wise, continuing at the next column when you get to the bottom:
julia> z = rand(2, 2)
2×2 Matrix{Float64}:
0.727719 0.983778
0.792305 0.0408728
julia> z[3]
0.9837776491559934

Given transition matrix for Markov chain of 5 states, find first passage time and recurrence time

Transition matrix for a Markov chain:
0.5 0.3 0.0 0.0 0.2
0.0 0.5 0.0 0.0 0.5
0.0 0.4 0.4 0.2 0.0
0.3 0.0 0.2 0.0 0.5
0.5 0.2 0.0 0.0 0.3
This is a transition matrix with states {1,2,3,4,5}. States {1,2,5} are recurrent and states {3,4} are transient. How can I (without using the fundamental matrix trick):
Compute the expected number of steps needed to first return to state 1, conditioned on starting in state 1
Compute the expected number of steps needed to first reach any of the states {1,2,5}, conditioned on starting in state 3.
If you don't want to use the fundamental matrix, you can do two things:
Create a function that simulates the Markov chain until the stopping condition is met and that returns the number of steps. Take the average over a large number of runs to get the expectation.
Introduce dummy absorbing states in your transition matrix and repeatedly calculate p = Pp where p is a vector with 1 in the index of starting state and 0 elsewhere. With some accounting you can get the expected values that you want.

Julia AffineTransforms sign of rotation angle

I am using AffineTransforms to rotate a volume. I am confused now by the sign of the rotation angle. For a right-hand system, when looking down an axis, say Z axis, rotating the XY plane counter-clockwise should be positive angles. I define a rotation matrix r = [0.0 -1. 0.0; 1.0 0.0 0.0; 0.0 0.0 1.0], which is to rotate along the Z axis 90 degree counter-clockwise. Indeed, r * [1 0 0]' gives [0 1 0]', which rotates X axis to Y axis.
Now I define a volume v.
3×3×3 Array{Float64,3}:
[:, :, 1] =
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
[:, :, 2] =
0.0 0.0 0.0
1.0 0.0 0.0
0.0 0.0 0.0
[:, :, 3] =
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
then I define tfm = AffineTransform(r, vec([0 0 0]))) which is the same as tfm = tformrotate(vec([0 0 1]), π/2).
then transform(v, tfm). The rotation center is the input array center. I got
3×3×3 Array{Float64,3}:
[:, :, 1] =
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
[:, :, 2] =
0.0 1.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
[:, :, 3] =
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
This is surprising to me because the output is the 90 degree rotation along Z axis but clockwise. It seems to me that this is actually a -90 degree rotation. Could somebody point out what I did wrong? Thanks.
Admittedly, this confused me too. Had to read the help for transform and TransformedArray again.
First, the print order of arrays is a bit confusing, with the first index shown in columns, but it is the X-axis, as the dimensions of v are x,y,z in this order.
In the original v, we have v[2,1,2] == 1.0. But, by default, transform uses the center of the array as origin, so 2,1,2 is relative to center (0,-1,0) i.e. a unit vector in the negative y-axis direction.
The array returned by transform has values which are evaluated at x,y,z by giving the value of the original v at tfm((x,y,z)) (see ?TransformedArray).
Specifically, we have transform(v,tfm)[1,2,2] is v[tfm((-1,0,0))] which is v[(0,-1,0)] (because rotating (-1,0,0) counterclockwise is (0,-1,0)) which is v[2,1,2] in the uncentered v indices. Finally, v[2,1,2] == 1.0 as was in the output in the question.
Coordinate transformation are always tricky, and it is easy to confuse transformations and their inverse.
Hope this helps.

Debugging performance bottlenecks for longest common subsequence algorithm

I am writing a longest common subsequence algorithm in Haskell using vector library and state monad (to encapsulate the very imperative and mutable nature of Miller O(NP) algorithm). I have already written it in C for some project I needed it for, and am now writing it in Haskell as a way to explore how to write those imperative grid-walk algorithms with good performance that match C. The version I wrote with unboxed vectors is about 4-times slower than C version for same inputs (and compiled with right optimization flags - I used both system clock time and Criterion methods to validate the relative time measurements between Haskell and C versions, and same data types, both large and tiny inputs). I have been trying to figure out where the performance issues might be, and will appreciate feedback - it is possible there is some well-known performance issue I might have run into here, especially in vector library which I am heavily using here.
In my code, I have one function called gridWalk that is called most often, and also, does most of the work. The performance slowdown is likely to be there, but I can't figure out what it could be. Complete Haskell code is here. Snippets of the code below:
import Data.Vector.Unboxed.Mutable as MU
import Data.Vector.Unboxed as U hiding (mapM_)
import Control.Monad.ST as ST
import Control.Monad.Primitive (PrimState)
import Control.Monad (when)
import Data.STRef (newSTRef, modifySTRef, readSTRef)
import Data.Int
type MVI1 s = MVector (PrimState (ST s)) Int
cmp :: U.Vector Int32 -> U.Vector Int32 -> Int -> Int -> Int
cmp a b i j = go 0 i j
where
n = U.length a
m = U.length b
go !len !i !j| (i<n) && (j<m) && ((unsafeIndex a i) == (unsafeIndex b j)) = go (len+1) (i+1) (j+1)
| otherwise = len
-- function to find previous y on diagonal k for furthest point
findYP :: MVI1 s -> Int -> Int -> ST s (Int,Int)
findYP fp k offset = do
let k0 = k+offset-1
k1 = k+offset+1
y0 <- MU.unsafeRead fp k0 >>= \x -> return $ 1+x
y1 <- MU.unsafeRead fp k1
if y0 > y1 then return (k0,y0)
else return (k1,y1)
{-#INLINE findYP #-}
gridWalk :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> ST s ()
gridWalk a b fp !k cmp = {-#SCC gridWalk #-} do
let !offset = 1+U.length a
(!kp,!yp) <- {-#SCC findYP #-} findYP fp k offset
let xp = yp-k
len = {-#SCC cmp #-} cmp a b xp yp
x = xp+len
y = yp+len
{-#SCC "updateFP" #-} MU.unsafeWrite fp (k+offset) y
return ()
{-#INLINE gridWalk #-}
-- The function below executes ct times, and updates furthest point as they are found during furthest point search
findSnakes :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> (Int -> Int -> Int) -> ST s ()
findSnakes a b fp !k !ct cmp op = {-#SCC findSnakes #-} U.forM_ (U.fromList [0..ct-1]) (\x -> gridWalk a b fp (op k x) cmp)
{-#INLINE findSnakes #-}
I added some cost center annotations, and ran profiling with a certain LCS input for testing. Here is what I get:
total time = 2.39 secs (2394 ticks # 1000 us, 1 processor)
total alloc = 4,612,756,880 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
gridWalk Main 67.5 52.7
findSnakes Main 23.2 27.8
cmp Main 4.2 0.0
findYP Main 3.5 19.4
updateFP Main 1.6 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 64 0 0.0 0.0 100.0 100.0
main Main 129 0 0.0 0.0 0.0 0.0
CAF Main 127 0 0.0 0.0 100.0 100.0
findSnakes Main 141 0 0.0 0.0 0.0 0.0
main Main 128 1 0.0 0.0 100.0 100.0
findSnakes Main 138 0 0.0 0.0 0.0 0.0
gridWalk Main 139 0 0.0 0.0 0.0 0.0
cmp Main 140 0 0.0 0.0 0.0 0.0
while Main 132 4001 0.1 0.0 100.0 100.0
findSnakes Main 133 12000 23.2 27.8 99.9 99.9
gridWalk Main 134 16004000 67.5 52.7 76.7 72.2
cmp Main 137 16004000 4.2 0.0 4.2 0.0
updateFP Main 136 16004000 1.6 0.0 1.6 0.0
findYP Main 135 16004000 3.5 19.4 3.5 19.4
newVI1 Main 130 1 0.0 0.0 0.0 0.0
newVI1.\ Main 131 8004 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 112 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 104 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 102 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 95 0 0.0 0.0 0.0 0.0
If I am interpreting the profiling output correctly (and assuming there are not too much distortions due to profiling), gridWalk takes most of the time, but the main functions cmp and findYP that do the heavy lifting in gridWalk, seem to take very little time in profiling report. So, perhaps the bottleneck is in forM_ wrapper that findSnakes function uses to call gridWalk? The heap profile too seems very clean:
Reading the core, nothing really jumps out. I thought some values in inner loops might be boxed, but I don't spot them in the core. I hope the performance issue is due to something simple I have missed.
Update
Following suggestion of #DanielFischer, I replaced forM_ of Data.Vector.Unboxed with that of Control.Monad in findSnakes function which improved the performance from 4x to 2.5x of C version. Haskell and C versions are now posted here if you want to try them out.
I am still digging through the core to see where the bottlenecks are. gridWalk is most frequently called function, and for it to perform well, lcsh should reduce whileM_ loop to a nice iterative inner loop of condition check and inlined findSnakes code. I suspect that in assembly, this is not the case for whileM_ loop, but since I am not very knowledgeable about translating core, and locating name-mangled GHC functions in assembly, I guess it is just a matter of patiently plugging away at the problem until I figure it out. Meanwhile if there are any pointers on performance fixes, they will be appreciated.
Another possibility that I can think of is overhead of heap checks during function calls. As seen in profiling report, gridWalk is called 16004000 times. Assuming 6 cycles for heap check (I am guessing it is less, but still let us assume that), it is ~0.02 seconds on a 3.33GHz box for 96024000 cycles.
Also, some performance numbers:
Haskell code (GHC 7.6.1 x86_64): It was ~0.25s before forM_ fix.
time ./T
1
real 0m0.150s
user 0m0.145s
sys 0m0.003s
C code (gcc 4.7.2 x86_64):
time ./test
1
real 0m0.065s
user 0m0.063s
sys 0m0.000s
Update 2:
Updated code is here. Using STUArray doesn't change the numbers either. The performance is about 1.5x on Mac OS X (x86_64,ghc7.6.1), pretty similar to what #DanielFischer reported on Linux.
Haskell code:
$ time ./Diff
1
real 0m0.087s
user 0m0.084s
sys 0m0.003s
C code:
$ time ./test
1
real 0m0.056s
user 0m0.053s
sys 0m0.002s
Glancing at the cmm, the call is tail-recursive, and is turned into a loop by llvm. But each new iteration seems to allocate new values which invokes heap check too, and so, might explain the difference in performance. I have to think about how to write the tail recursion in such a way such that no values are allocated across iterations, avoiding heap-check and allocation overhead.
You take a huge hit at
U.forM_ (U.fromList [0..ct-1])
in findSnakes. I'm convinced that isn't supposed to happen (ticket?), but that allocates a new Vector to traverse every time findSnakes is called. If you use
Control.Monad.forM_ [0 .. ct-1]
instead, the running time roughly halves, and the allocation drops by a factor of about 500 here. (GHC optimises C.M.forM_ [0 :: Int .. limit] well, the list is eliminated, and what remains is basically a loop.) You can do slightly better by writing the loop yourself.
Some things that cause gratuitous allocation/code size bloat without hurting performance much are
the unused Bool argument of lcsh
the cmp argument to findSnakes and gridWalk; if these are never called with a different comparison than the top-level cmp, that argument leads to unnecessary code duplication.
the general type of while; specialising it to the used type ST s Bool -> ST s () -> ST s () reduces allocation (much), and also running time (slightly, but noticeably, here).
A general word about profiling: Compiling a programme for profiling inhibits many optimisations. In particular for libraries like vector, bytestring or text that make heavy use of fusion, profiling often produces misleading results.
For example, your original code produces here
total time = 3.42 secs (3415 ticks # 1000 us, 1 processor)
total alloc = 4,612,756,880 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
gridWalk Main 63.7 52.7 2176 2432608000
findSnakes Main 20.0 27.8 682 1281440080
cmp Main 9.2 0.0 313 16
findYP Main 4.2 19.4 144 896224000
updateFP Main 2.7 0.0 91 0
Just adding a bang on the binding of len in gridWalk changes nothing at all in the non-profiling version, but for the profiling version
total time = 2.98 secs (2985 ticks # 1000 us, 1 processor)
total alloc = 3,204,404,880 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
gridWalk Main 63.0 32.0 1881 1024256000
findSnakes Main 22.2 40.0 663 1281440080
cmp Main 7.2 0.0 214 16
findYP Main 4.7 28.0 140 896224000
updateFP Main 2.7 0.0 82 0
it makes a lot of difference. For the version including the changes mentioned above (and the bang on len in gridWalk), the profiling version says
total alloc = 1,923,412,776 bytes (excludes profiling overheads)
but the non-profiling version
1,814,424 bytes allocated in the heap
10,808 bytes copied during GC
49,064 bytes maximum residency (2 sample(s))
25,912 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.12s ( 0.12s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.12s ( 0.12s elapsed)
says it allocated 1000-fold less than the profiling version.
For vector and friends code, more reliable for identifying bottlenecks than profiling (unfortunately also much much more time-consuming and difficult) is studying the generated core (or assembly, if you are proficient in reading that).
Concerning the update, the C runs a little slower on my box (gcc-4.7.2, -O3)
$ time ./miltest1
real 0m0.074s
user 0m0.073s
sys 0m0.001s
but the Haskell about the same
$ time ./hsmiller
1
real 0m0.151s
user 0m0.149s
sys 0m0.001s
That is a little faster when compiling via the LLVM backend:
$ time ./hsmiller1
real 0m0.131s
user 0m0.129s
sys 0m0.001s
And when we replace the forM_ with a manual loop,
findSnakes a b fp !k !ct op = go 0
where
go x
| x < ct = gridWalk a b fp (op k x) >> go (x+1)
| otherwise = return ()
it gets a bit faster,
$ time ./hsmiller
1
real 0m0.124s
user 0m0.121s
sys 0m0.002s
resp. via LLVM:
$ time ./hsmiller
1
real 0m0.108s
user 0m0.107s
sys 0m0.000s
By and large, the generated core looks fine, one small annoyance was
Main.$wa
:: forall s.
GHC.Prim.Int#
-> GHC.Types.Int
-> GHC.Prim.State# s
-> (# GHC.Prim.State# s, Main.MVI1 s #)
and a slightly roundabout implementation. That is fixed by making newVI1 strict in its second argument,
newVI1 n !x = do
Since that isn't called often, the effect on performance is of course negligible.
The meat is the core for lcsh, and that doesn't look too bad. The only boxed things in that are the Ints read from /written to the STRef, and that is inevitable. What's not so pleasant is that the core contains a lot of code duplication, but in my experience, that rarely is a real performance problem, and not all duplicated code survives the code generation.
and for it to perform well, lcsh should reduce whileM_ loop to a nice iterative inner loop of condition check and inlined findSnakes code.
You get an inner loop when you add an INLINE pragma to whileM_, but that loop is not nice, and, in this case it is much slower than having the whileM_ out-of-line (I'm not sure whether it's solely due to code size, but it could be).

Profiling a Haskell program

I have a piece of code that repeatedly samples from a probability distribution using sequence. Morally, it does something like this:
sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
xs <- sequence (replicate n dist)
return (sum xs)
Except that it's a bit more complicated. The actual code I'm interested in is the function likelihoodWeighting at this Github repo.
I noticed that the running time scales nonlinearly with n. In particular, once n exceeds a certain value it hits the memory limit, and the running time explodes. I'm not certain, but I think this is because sequence is building up a long list of thunks which aren't getting evaluated until the call to sum.
Once I get past about 100,000 samples, the program slows to a crawl. I'd like to optimize this (my feeling is that 10 million samples shouldn't be a problem) so I decided to profile it - but I'm having a little trouble understanding the output of the profiler.
Profiling
I created a short executable in a file main.hs that runs my function with 100,000 samples. Here's the output from doing
$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s
First things I notice - it allocates nearly 1.5 GB of heap, and spends 60% of its time on garbage collection. Is this generally indicative of too much laziness?
1,377,538,232 bytes allocated in the heap
1,195,050,032 bytes copied during GC
169,411,368 bytes maximum residency (12 sample(s))
7,360,232 bytes maximum slop
423 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2574 collections, 0 parallel, 2.40s, 2.43s elapsed
Generation 1: 12 collections, 0 parallel, 1.07s, 1.28s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.92s ( 1.94s elapsed)
GC time 3.47s ( 3.70s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.23s ( 0.23s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.63s ( 5.87s elapsed)
%GC time 61.8% (63.1% elapsed)
Alloc rate 716,368,278 bytes per MUT second
Productivity 34.2% of total user, 32.7% of total elapsed
Here are the results from
$ ./main +RTS -p
The first time I ran this, it turned out that there was one function being called repeatedly, and it turned out I could memoize it, which sped things up by a factor of 2. It didn't solve the space leak, however.
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 434 4 0.0 0.0 100.0 100.0
likelihoodWeighting AI.Probability.Bayes 445 1 0.0 0.3 100.0 100.0
distributionLW AI.Probability.Bayes 448 1 0.0 2.6 0.0 2.6
getSampleLW AI.Probability.Bayes 446 100000 20.0 50.4 100.0 97.1
bnProb AI.Probability.Bayes 458 400000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 457 400000 6.7 0.8 6.7 0.8
bnVals AI.Probability.Bayes 455 400000 20.0 6.3 26.7 7.1
bnParents AI.Probability.Bayes 456 400000 6.7 0.8 6.7 0.8
bnSubRef AI.Probability.Bayes 454 800000 13.3 13.5 13.3 13.5
weightedSample AI.Probability.Bayes 447 100000 26.7 23.9 33.3 25.3
bnProb AI.Probability.Bayes 453 100000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 452 100000 0.0 0.2 0.0 0.2
bnVals AI.Probability.Bayes 450 100000 0.0 0.3 6.7 0.5
bnParents AI.Probability.Bayes 451 100000 6.7 0.2 6.7 0.2
bnSubRef AI.Probability.Bayes 449 200000 0.0 0.7 0.0 0.7
Here's a heap profile. I don't know why it claims the runtime is 1.8 seconds - this run took about 6 seconds.
Can anyone help me to interpret the output of the profiler - i.e. to identify where the bottleneck is, and provide suggestions for how to speed things up?
A huge improvement has already been achieved by incorporating JohnL's suggestion of using foldM in likelihoodWeighting. That reduced memory usage about tenfold here, and brought down the GC times significantly to almost or actually negligible.
A profiling run with the current source yields
probabilityIO AI.Util.Util 26.1 42.4 413 290400000
weightedSample.go AI.Probability.Bayes 16.1 19.1 255 131200080
bnParents AI.Probability.Bayes 10.8 1.2 171 8000384
bnVals AI.Probability.Bayes 10.4 7.8 164 53603072
bnCond AI.Probability.Bayes 7.9 1.2 125 8000384
ndSubRef AI.Util.Array 4.8 9.2 76 63204112
bnSubRef AI.Probability.Bayes 4.7 8.1 75 55203072
likelihoodWeighting.func AI.Probability.Bayes 3.3 2.8 53 19195128
%! AI.Util.Util 3.3 0.5 53 3200000
bnProb AI.Probability.Bayes 2.5 0.0 40 16
bnProb.p AI.Probability.Bayes 2.5 3.5 40 24001152
likelihoodWeighting AI.Probability.Bayes 2.5 2.9 39 20000264
likelihoodWeighting.func.x AI.Probability.Bayes 2.3 0.2 37 1600000
and 13MB memory usage reported by -s, ~5MB maximum residency. That's not too bad already.
Still, there remain some points we can improve. First, a relatively minor thing, in the grand scheme, AI.UTIl.Array.ndSubRef:
ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])
Reversing the list, and mapping (2^) over another list is inefficient, better is
ndSubRef = L.foldl' (\a d -> 2*a + d) 0
which doesn't need to keep the entire list in memory (probably not a big deal, since the lists will be short) as reversing it does, and doesn't need to allocate a second list. The reduction in allocation is noticeable, about 10%, and that part runs measurably faster,
ndSubRef AI.Util.Array 1.7 1.3 24 8000384
in the profile of the modified run, but since it takes only a small part of the overall time, the overall impact is small. There are potentially bigger fish to fry in weightedSample and likelihoodWeighting.
Let's add a bit of strictness in weightedSample to see how that changes things:
weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
go 1.0 (M.fromList fixed) (bnVars bn)
where
go w assignment [] = return (assignment, w)
go w assignment (v:vs) = if v `elem` vars
then
let w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let p = bnProb bn assignment (v,True)
x <- probabilityIO p
go w (M.insert v x assignment) vs
vars = map fst fixed
The weight parameter of go is never forced, nor is the assignment parameter, thus they can build up thunks. Let's enable {-# LANGUAGE BangPatterns #-} and force updates to take effect immediately, also evaluate p before passing it to probabilityIO:
go w assignment (v:vs) = if v `elem` vars
then
let !w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let !p = bnProb bn assignment (v,True)
x <- probabilityIO p
let !assignment' = M.insert v x assignment
go w assignment' vs
That brings a further reduction in allocation (~9%) and a small speedup (~%13%), but the total memory usage and maximum residence haven't changed much.
I see nothing else obvious to change there, so let's look at likelihoodWeighting:
func m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return $! x `seq` w `seq` M.adjust (+w) x m
In the last line, first, w is already evaluated in weightedSample now, so we don't need to seq it here, the key x is required to evaluate the updated map, so seqing that isn't necessary either. The bad thing on that line is M.adjust. adjust has no way of forcing the result of the updated function, so that builds thunks in the map's values. You can force evaluation of the thunks by looking up the modified value and forcing that, but Data.Map provides a much more convenient way here, since the key at which the map is updated is guaranteed to be present, insertWith':
func !m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return (M.insertWith' (+) x w m)
(Note: GHC optimises better with a bang-pattern on m than with return $! ... here). That slightly reduces the total allocation and doesn't measurably change the running time, but has a great impact on total memory used and maximum residency:
934,566,488 bytes allocated in the heap
1,441,744 bytes copied during GC
68,112 bytes maximum residency (1 sample(s))
23,272 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
The biggest improvement in running time to be had would be by avoiding randomIO, the used StdGen is very slow.
I am surprised how much time the bn* functions take, but don't see any obvious inefficiency in those.
I have trouble digesting these profiles, but I have gotten my ass kicked before because the MonadRandom on Hackage is strict. Creating a lazy version of MonadRandom made my memory problems go away.
My colleague has not yet gotten permission to release the code, but I've put Control.Monad.LazyRandom online at pastebin. Or if you want to see some excerpts that explain a fully lazy random search, including infinite lists of random computations, check out Experience Report: Haskell in Computational Biology.
I put together a very elementary example, posted here: http://hpaste.org/71919. I'm not sure if it's anything like your example.. just a very minimal thing that seemed to work.
Compiling with -prof and -fprof-auto and running with 100000 iterations yielded the following head of the profiling output (pardon my line numbers):
8 COST CENTRE MODULE %time %alloc
9
10 sample AI.Util.ProbDist 31.5 36.6
11 bnParents AI.Probability.Bayes 23.2 0.0
12 bnRank AI.Probability.Bayes 10.7 23.7
13 weightedSample.go AI.Probability.Bayes 9.6 13.4
14 bnVars AI.Probability.Bayes 8.6 16.2
15 likelihoodWeighting AI.Probability.Bayes 3.8 4.2
16 likelihoodWeighting.getSample AI.Probability.Bayes 2.1 0.7
17 sample.cumulative AI.Util.ProbDist 1.7 2.1
18 bnCond AI.Probability.Bayes 1.6 0.0
19 bnRank.ps AI.Probability.Bayes 1.1 0.0
And here are the summary statistics:
1,433,944,752 bytes allocated in the heap
1,016,435,800 bytes copied during GC
176,719,648 bytes maximum residency (11 sample(s))
1,900,232 bytes maximum slop
400 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.40s ( 1.41s elapsed)
GC time 1.08s ( 1.24s elapsed)
Total time 2.47s ( 2.65s elapsed)
%GC time 43.6% (46.8% elapsed)
Alloc rate 1,026,674,336 bytes per MUT second
Productivity 56.4% of total user, 52.6% of total elapsed
Notice that the profiler pointed its finger at sample. I forced the return in that function by using $!, and here are some summary statistics afterwards:
1,776,908,816 bytes allocated in the heap
165,232,656 bytes copied during GC
34,963,136 bytes maximum residency (7 sample(s))
483,192 bytes maximum slop
68 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.42s ( 2.44s elapsed)
GC time 0.21s ( 0.23s elapsed)
Total time 2.63s ( 2.68s elapsed)
%GC time 7.9% (8.8% elapsed)
Alloc rate 733,248,745 bytes per MUT second
Productivity 92.1% of total user, 90.4% of total elapsed
Much more productive in terms of GC, but not much changed on the time. You might be able to keep iterating in this profile/tweak fashion to target your bottlenecks and eke out some better performance.
I think your initial diagnosis is correct, and I've never seen a profiling report that's useful once memory effects kick in.
The problem is that you're traversing the list twice, once for sequence and again for sum. In Haskell, multiple list traversals of large lists are really, really bad for performance. The solution is generally to use some type of fold, such as foldM. Your sampleMean function can be written as
{-# LANGUAGE BangPatterns #-}
sampleMean2 :: MonadRandom m => Int -> m Float -> m Float
sampleMean2 n dist = foldM (\(!a) mb -> liftM (+a) mb) 0 $ replicate n dist
for example, traversing the list only once.
You can do the same sort of thing with likelihoodWeighting as well. In order to prevent thunks, it's important to make sure that the accumulator in your fold function has appropriate strictness.

Resources