What is the correct way to perform constant-space nested loops in Haskell? - performance

There are two obvious, "idiomatic" ways to perform nested loops in Haskell: using the list monad or using forM_ to replace traditional fors. I've set a benchmark to determine if those are compiled to tight loops:
import Control.Monad.Loop
import Control.Monad.Primitive
import Control.Monad
import Control.Monad.IO.Class
import qualified Data.Vector.Unboxed.Mutable as MV
import qualified Data.Vector.Unboxed as V
times = 100000
side = 100
-- Using `forM_` to replace traditional fors
test_a mvec =
forM_ [0..times-1] $ \ n -> do
forM_ [0..side-1] $ \ y -> do
forM_ [0..side-1] $ \ x -> do
MV.write mvec (y*side+x) 1
-- Using the list monad to replace traditional forms
test_b mvec = sequence_ $ do
n <- [0..times-1]
y <- [0..side-1]
x <- [0..side-1]
return $ MV.write mvec (y*side+x) 1
main = do
let vec = V.generate (side*side) (const 0)
mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
-- test_a mvec
-- test_b mvec
vec' <- V.unsafeFreeze mvec :: IO (V.Vector Int)
print $ V.sum vec'
This test creates a 100x100 vector, writes 1 to each index using nested loop and repeats that 100k times. Compiling those with just ghc -O2 test.hs -o test (ghc version 7.8.4), the results are: 3.853s for the forM_ version and 10.460s for the list monad. In order to provide a reference, I also programmed this test in JavaScript:
var side = 100;
var times = 100000;
var vec = [];
for (var i=0; i<side*side; ++i)
vec.push(0);
for (var n=0; n<times; ++n)
for (var y=0; y<side; ++y)
for (var x=0; x<side; ++x)
vec[x+y*side] = 1;
var s = 0;
for (var i=0; i<side*side; ++i)
s += vec[i];
console.log(s);
This equivalent JavaScript program takes 1s to complete, beating Haskell's unboxed vectors, which is unusual, suggesting that Haskell is not running the loop in constant space, but doing allocations instead. I've then found a library that claims to provide type-guaranteed tight loops Control.Monad.Loop:
-- Using `for` from Control.Monad.Loop
test_c mvec = exec_ $ do
n <- for 0 (< times) (+ 1)
x <- for 0 (< side) (+ 1)
y <- for 0 (< side) (+ 1)
liftIO (MV.write mvec (y*side+x) 1)
Which runs in 1s. That library isn't very used and far from idiomatic, though, so, what is the idiomatic way to get fast constant-space bidimensional computations? (Note this isn't a case for REPA as I want to perform arbitrary IO actions on the grid.)

Writing tight mutating code with GHC can be tricky sometimes. I'm going to write about a couple of different things, probably in a manner that is more rambling and tl;dr than I would prefer.
For starters, we should use GHC 7.10 in any case, since otherwise the forM_ and list monad solutions never fuse.
Also, I replaced MV.write with MV.unsafeWrite, partly because it's faster, but more importantly it reduces some of the clutter in the resultant Core. From now on runtime statistics refer to code with unsafeWrite.
The dreaded let floating
Even with GHC 7.10, we should first notice all those [0..times-1] and [0..side-1] expressions, because they will ruin performance every time if we don't take necessary steps. The issue is that they are constant ranges, and -ffull-laziness (which is enabled by default on -O) floats them out to top level. This prevents list fusion, and iterating over an Int# range is cheaper than iterating over a list of boxed Int-s anyway, so it's a really bad optimization.
Let's see some runtimes in seconds for the unchanged (aside from using unsafeWrite) code. ghc -O2 -fllvm is used, and I use +RTS -s for timing.
test_a: 1.6
test_b: 6.2
test_c: 0.6
For GHC Core viewing I used ghc -O2 -ddump-simpl -dsuppress-all -dno-suppress-type-signatures.
In the case of test_a, the [0..99] ranges are lifted out:
main4 :: [Int]
main4 = eftInt 0 99 -- means "enumFromTo" for Int.
although the outermost [0..9999] loop is fused into a tail-recursive helper:
letrec {
a3_s7xL :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7xL =
\ (x_X5zl :: Int#) (s1_X4QY :: State# RealWorld) ->
case a2_s7xF 0 s1_X4QY of _ { (# ipv2_a4NA, ipv3_a4NB #) ->
case x_X5zl of wild_X1S {
__DEFAULT -> a3_s7xL (+# wild_X1S 1) ipv2_a4NA;
99999 -> (# ipv2_a4NA, () #)
}
}; }
In the case of test_b, again only the [0..99] are lifted. However, test_b is much slower, because it has to build and sequence actual [IO ()] lists. At least GHC is sensible enough to only build a single [IO ()] for the two inner loops, and then perform sequencing it 10000 times.
let {
lvl7_s4M5 :: [IO ()]
lvl7_s4M5 = -- omitted
letrec {
a2_s7Av :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a2_s7Av =
\ (x_a5xi :: Int#) (eta_B1 :: State# RealWorld) ->
letrec {
a3_s7Au
:: [IO ()] -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7Au =
\ (ds_a4Nu :: [IO ()]) (eta1_X1c :: State# RealWorld) ->
case ds_a4Nu of _ {
[] ->
case x_a5xi of wild1_X1y {
__DEFAULT -> a2_s7Av (+# wild1_X1y 1) eta1_X1c;
99999 -> (# eta1_X1c, () #)
};
: y_a4Nz ys_a4NA ->
case (y_a4Nz `cast` ...) eta1_X1c
of _ { (# ipv2_a4Nf, ipv3_a4Ng #) ->
a3_s7Au ys_a4NA ipv2_a4Nf
}
}; } in
a3_s7Au lvl7_s4M5 eta_B1; } in
-- omitted
How can we remedy this? We could nuke the problem with {-# OPTIONS_GHC -fno-full-laziness #-}. This indeed helps a lot in our case:
test_a: 0.5
test_b: 0.48
test_c: 0.5
Alternatively, we could fiddle around with INLINE pragmas. Apparently inlining functions after the let floating is done preserves good performance. I found that GHC inlines our test functions even without a pragma, but an explicit pragma causes it to inline only after let floating. For example, this results in good performance without -fno-full-laziness:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
{-# INLINE test_a #-}
But inlining too early results in poor performance:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
{-# INLINE [~2] test_a #-} -- "inline before the first phase please"
The problem with this INLINE solution is that it's rather fragile in the face of GHC's floating onslaught. For example, manual inlining does not preserve performance. The following code is slow because similarly to INLINE [~2] it gives GHC a chance to float out:
main = do
let vec = V.generate (side*side) (const 0)
mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
So what should we do?
First, I think using -fno-full-laziness is a perfectly viable and even preferable option for those who'd like to write high performance code and have a good idea what they are doing. For example, it's used in unordered-containers. With it we have more precise control over sharing, and we can always just float out or inline manually.
For more regular code, I believe there's nothing wrong with using Control.Monad.Loop or any other package that provides the functionality. Many Haskell users are not scrupulous about depending on small "fringe" libraries. We can also just reimplement for, in a desired generality. For instance, the following performs just as well as the other solutions:
for :: Monad m => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) -> m ()
for init while step body = go init where
go !i | while i = body i >> go (step i)
go i = return ()
{-# INLINE for #-}
Looping in really constant space
I was at first very puzzled by the +RTS -s data on heap allocation. test_a allocated non-trivially with -fno-full-laziness, and also test_c without full laziness, and these allocations scaled linearly with the number of times iterations, but test_b with full laziness allocated only for the vector:
-- with -fno-full-laziness, no INLINE pragmas
test_a: 242,521,008 bytes
test_b: 121,008 bytes
test_c: 121,008 bytes -- but 240,120,984 with full laziness!
Also, INLINE pragmas for test_c did not help at all in this case.
I spent some time trying to find signs of heap allocation in the Core for the relevant programs, without success, until the realization struck me: GHC stack frames are on the heap, including the frames of the main thread, and the functions that were doing heap allocation were essentially running the thrice-nested loops in at most three stack frames. The heap allocation registered by +RTS -s is just the constant popping and pushing of stack frames.
This is pretty much apparent from the Core for the following code:
{-# OPTIONS_GHC -fno-full-laziness #-}
-- ...
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
main = do
let vec = V.generate (side*side) (const 0)
mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
test_a mvec
Which I'm including here in its glory. Feel free to skip.
main1 :: State# RealWorld -> (# State# RealWorld, () #)
main1 =
\ (s_a5HK :: State# RealWorld) ->
case divInt# 9223372036854775807 8 of ww4_a5vr { __DEFAULT ->
-- start of vector creation ----------------------
case tagToEnum# (># 10000 ww4_a5vr) of _ {
False ->
case newByteArray# 80000 (s_a5HK `cast` ...)
of _ { (# ipv_a5fv, ipv1_a5fw #) ->
letrec {
$s$wa_s8jS
:: Int#
-> Int#
-> State# (PrimState IO)
-> (# State# (PrimState IO), Int #)
$s$wa_s8jS =
\ (sc_s8jO :: Int#)
(sc1_s8jP :: Int#)
(sc2_s8jR :: State# (PrimState IO)) ->
case tagToEnum# (<# sc1_s8jP 10000) of _ {
False -> (# sc2_s8jR, I# sc_s8jO #);
True ->
case writeIntArray# ipv1_a5fw sc_s8jO 0 (sc2_s8jR `cast` ...)
of s'#_a5Gn { __DEFAULT ->
$s$wa_s8jS (+# sc_s8jO 1) (+# sc1_s8jP 1) (s'#_a5Gn `cast` ...)
}
}; } in
case $s$wa_s8jS 0 0 (ipv_a5fv `cast` ...)
-- end of vector creation -------------------
of _ { (# ipv6_a4Hv, ipv7_a4Hw #) ->
letrec {
a2_s7MJ :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a2_s7MJ =
\ (x_a5Ho :: Int#) (eta_B1 :: State# RealWorld) ->
letrec {
a3_s7ME :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7ME =
\ (x1_X5Id :: Int#) (eta1_XR :: State# RealWorld) ->
case ipv7_a4Hw of _ { I# dt4_a5x6 ->
case writeIntArray#
(ipv1_a5fw `cast` ...) (*# x1_X5Id 100) 1 (eta1_XR `cast` ...)
of s'#_a5Gn { __DEFAULT ->
letrec {
a4_s7Mz :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a4_s7Mz =
\ (x2_X5J8 :: Int#) (eta2_X1U :: State# RealWorld) ->
case writeIntArray#
(ipv1_a5fw `cast` ...)
(+# (*# x1_X5Id 100) x2_X5J8)
1
(eta2_X1U `cast` ...)
of s'#1_X5Hf { __DEFAULT ->
case x2_X5J8 of wild_X2o {
__DEFAULT -> a4_s7Mz (+# wild_X2o 1) (s'#1_X5Hf `cast` ...);
99 -> (# s'#1_X5Hf `cast` ..., () #)
}
}; } in
case a4_s7Mz 1 (s'#_a5Gn `cast` ...)
of _ { (# ipv2_a4QH, ipv3_a4QI #) ->
case x1_X5Id of wild_X1e {
__DEFAULT -> a3_s7ME (+# wild_X1e 1) ipv2_a4QH;
99 -> (# ipv2_a4QH, () #)
}
}
}
}; } in
case a3_s7ME 0 eta_B1 of _ { (# ipv2_a4QH, ipv3_a4QI #) ->
case x_a5Ho of wild_X1a {
__DEFAULT -> a2_s7MJ (+# wild_X1a 1) ipv2_a4QH;
99999 -> (# ipv2_a4QH, () #)
}
}; } in
a2_s7MJ 0 (ipv6_a4Hv `cast` ...)
}
};
True ->
case error
(unpackAppendCString#
"Primitive.basicUnsafeNew: length to large: "#
(case $wshowSignedInt 0 10000 ([])
of _ { (# ww5_a5wm, ww6_a5wn #) ->
: ww5_a5wm ww6_a5wn
}))
of wild_00 {
}
}
}
main :: IO ()
main = main1 `cast` ...
main2 :: State# RealWorld -> (# State# RealWorld, () #)
main2 = runMainIO1 (main1 `cast` ...)
main :: IO ()
main = main2 `cast` ...
We can also nicely demonstrate the allocation of frames the following way. Let's change test_a:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-50] $ \ x -> -- change here
MV.unsafeWrite mvec (y*side+x) 1
Now the heap allocation stays exactly the same, because the innermost loop is tail-recursive and uses a single frame. With the following change, the heap allocation halves (to 124,921,008 bytes), because we push and pop half as many frames:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-50] $ \ y -> -- change here
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
test_b and test_c (with no full laziness) instead compile to code that uses a nested case construct inside a single stack frame, and walks over the indices to see which one should be incremented. See the Core for the following main:
{-# LANGUAGE BangPatterns #-} -- later I'll talk about this
{-# OPTIONS_GHC -fno-full-laziness #-}
main = do
let vec = V.generate (side*side) (const 0)
!mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
test_c mvec
Voila:
main1 :: State# RealWorld -> (# State# RealWorld, () #)
main1 =
\ (s_a5Iw :: State# RealWorld) ->
case divInt# 9223372036854775807 8 of ww4_a5vT { __DEFAULT ->
-- start of vector creation ----------------------
case tagToEnum# (># 10000 ww4_a5vT) of _ {
False ->
case newByteArray# 80000 (s_a5Iw `cast` ...)
of _ { (# ipv_a5g3, ipv1_a5g4 #) ->
letrec {
$s$wa_s8ji
:: Int#
-> Int#
-> State# (PrimState IO)
-> (# State# (PrimState IO), Int #)
$s$wa_s8ji =
\ (sc_s8je :: Int#)
(sc1_s8jf :: Int#)
(sc2_s8jh :: State# (PrimState IO)) ->
case tagToEnum# (<# sc1_s8jf 10000) of _ {
False -> (# sc2_s8jh, I# sc_s8je #);
True ->
case writeIntArray# ipv1_a5g4 sc_s8je 0 (sc2_s8jh `cast` ...)
of s'#_a5GP { __DEFAULT ->
$s$wa_s8ji (+# sc_s8je 1) (+# sc1_s8jf 1) (s'#_a5GP `cast` ...)
}
}; } in
case $s$wa_s8ji 0 0 (ipv_a5g3 `cast` ...)
of _ { (# ipv6_a4MX, ipv7_a4MY #) ->
case ipv7_a4MY of _ { I# dt4_a5xy ->
-- end of vector creation
letrec {
a2_s7Q6 :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a2_s7Q6 =
\ (x_a5HT :: Int#) (eta_B1 :: State# RealWorld) ->
letrec {
a3_s7Q5 :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7Q5 =
\ (x1_X5J9 :: Int#) (eta1_XP :: State# RealWorld) ->
letrec {
a4_s7MZ :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a4_s7MZ =
\ (x2_X5Jl :: Int#) (s1_X4Xb :: State# RealWorld) ->
case writeIntArray#
(ipv1_a5g4 `cast` ...)
(+# (*# x1_X5J9 100) x2_X5Jl)
1
(s1_X4Xb `cast` ...)
of s'#_a5GP { __DEFAULT ->
-- the interesting part! ------------------
case x2_X5Jl of wild_X1y {
__DEFAULT -> a4_s7MZ (+# wild_X1y 1) (s'#_a5GP `cast` ...);
99 ->
case x1_X5J9 of wild1_X1o {
__DEFAULT -> a3_s7Q5 (+# wild1_X1o 1) (s'#_a5GP `cast` ...);
99 ->
case x_a5HT of wild2_X1c {
__DEFAULT -> a2_s7Q6 (+# wild2_X1c 1) (s'#_a5GP `cast` ...);
99999 -> (# s'#_a5GP `cast` ..., () #)
}
}
}
}; } in
a4_s7MZ 0 eta1_XP; } in
a3_s7Q5 0 eta_B1; } in
a2_s7Q6 0 (ipv6_a4MX `cast` ...)
}
}
};
True ->
case error
(unpackAppendCString#
"Primitive.basicUnsafeNew: length to large: "#
(case $wshowSignedInt 0 10000 ([])
of _ { (# ww5_a5wO, ww6_a5wP #) ->
: ww5_a5wO ww6_a5wP
}))
of wild_00 {
}
}
}
main :: IO ()
main = main1 `cast` ...
main2 :: State# RealWorld -> (# State# RealWorld, () #)
main2 = runMainIO1 (main1 `cast` ...)
main :: IO ()
main = main2 `cast` ...
I have to admit that I basically don't know why some code avoids stack frame creation and some doesn't. I suspect that inlining from "the inside" out helps, and a quick inspection informed me that Control.Monad.Loop uses a CPS encoding, which might be relevant here, although the Monad.Loop solution is sensitive to let floating, and I couldn't determine on short notice from the Core why test_c with let floating fails to run in a single stack frame.
Now, the performance benefit of running in a single stack frame is small. We've seen that test_b is only slightly faster than test_a. I include this detour in the answer because I found it edifying.
The state hack and strict bindings
The so-called state hack makes GHC aggressive in inlining into IO and ST actions. I think I should mention it here, because besides let floating this is the other thing that can thoroughly ruin performance.
The state hack is enabled with optimizations -O, and can possibly slow down programs asymptotically. A simple example from Reid Barton:
import Control.Monad
import Debug.Trace
expensive :: String -> String
expensive x = trace "$$$" x
main :: IO ()
main = do
str <- fmap expensive getLine
replicateM_ 3 $ print str
With GHC-7.10.2, this prints "$$$" once without optimizations but three times with -O2. And it seems that with GHC-7.10, we can't get rid of this behavior with -fno-state-hack (which is the subject of the linked ticket from Reid Barton).
Strict monadic bindings reliably get rid of this problem:
main :: IO ()
main = do
!str <- fmap expensive getLine
replicateM_ 3 $ print str
I think it's good habit to do strict bindings in IO and ST. And I have some experience (not definitive though; I'm far from being a GHC expert) that strict bindings are especially needed if we use -fno-full-laziness. Apparently full laziness can help get rid of some of the work duplication introduced by the inlining caused by the state hack; with test_b and no full laziness, omitting the strict binding on !mvec <- V.unsafeThaw vec caused a slight slowdown and extremely ugly Core output.

In my experience forM_ [0..n-1] can perform well, but unfortunately it's not reliable. Just adding an INLINE pragma to test_a and using -O2 makes it run much faster (4s to 1s for me), but manually inlining it (copy paste) slows it down again.
A more reliable function is is for from statistics which is implemented as
-- | Simple for loop. Counts from /start/ to /end/-1.
for :: Monad m => Int -> Int -> (Int -> m ()) -> m ()
for n0 !n f = loop n0
where
loop i | i == n = return ()
| otherwise = f i >> loop (i+1)
{-# INLINE for #-}
Using it looks similar to forM_ with lists:
test_d :: MV.IOVector Int -> IO ()
test_d mv =
for 0 times $ \_ ->
for 0 side $ \i ->
for 0 side $ \j ->
MV.unsafeWrite mv (i*side + j) 1
but performs reliably well (0.85s for me) without any risk of allocating a list.

Related

Implicit pattern matching in Haskell

Should one expect performance differences between these two emptys, or is it merely a matter of stylistic preference?
foo list = case list of
[] -> True
(_ : _) -> False
bar list = case list of
(_ : _) -> False
_ -> True
In general you should not expect performance to change predictably between trivial fiddling around with patterns like what you're asking about, and can often expect the generated code to be identical.
But the way to actually check is to look at core and or benchmark with criterion. In this case the generated code is the same, and indeed GHC seems to actually combine them:
I compiled the snippet above with
ghc -Wall -O2 -ddump-to-file -ddump-simpl -dsuppress-module-prefixes -dsuppress-uniques -fforce-recomp YourCode.hs
And we see this core:
foo :: forall t. [t] -> Bool
[GblId,
Arity=1,
Caf=NoCafRefs,
Str=DmdType <S,1*U>,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False)
Tmpl= \ (# t) (list [Occ=Once!] :: [t]) ->
case list of _ [Occ=Dead] {
[] -> True;
: _ [Occ=Dead] _ [Occ=Dead] -> False
}}]
foo =
\ (# t) (list :: [t]) ->
case list of _ [Occ=Dead] {
[] -> True;
: ds ds1 -> False
}
-- RHS size: {terms: 1, types: 0, coercions: 0}
bar :: forall t. [t] -> Bool
[GblId,
Arity=1,
Caf=NoCafRefs,
Str=DmdType <S,1*U>,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False)
Tmpl= \ (# t) (list [Occ=Once!] :: [t]) ->
case list of _ [Occ=Dead] {
[] -> True;
: _ [Occ=Dead] _ [Occ=Dead] -> False
}}]
bar = foo
I think the Tmpl stuff is the original implementation exposed for inlining in other modules, but I'm not certain.

Why does Haskell's 'even' function slow my program down? [duplicate]

This question already has an answer here:
GHC 7.10 generates slower code than older versions
(1 answer)
Closed 6 years ago.
I have following code. It costs 1s to run with argument 1000000, but it costs 5s to run if replace myEven with standard even function. I checked the code, the standard even function does exactly the same as * myEven *.
import Data.Word
import Data.List
import System.Environment
collatzNext :: Word32 -> Word32
collatzNext a = (if myEven a then a else 3*a+1) `div` 2
myEven :: (Integral a) => a -> Bool
myEven a = (a `rem` 2) == 0
collatzLen :: Word32 -> Int
collatzLen a0 = length $ takeWhile (/= 1) $ iterate collatzNext a0
main = do
[a0] <- getArgs
let max_a0 = (read a0)::Word32
print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]
If you add {-# NOINLINE myEven #-}, you'll get the same slowdown. The issue is that myEven is defined locally, so it's source is available to compiler, and it is inlined. All allocations and function call itself are eliminated:
Main.$wgo1 [InlPrag=[0], Occ=LoopBreaker]
:: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,U>]
Main.$wgo1 =
\ (ww_s6n0 :: GHC.Prim.Word#) (ww1_s6n4 :: GHC.Prim.Int#) ->
case ww_s6n0 of wild_X2j {
__DEFAULT ->
case GHC.Prim.remWord# wild_X2j (__word 2) of _ [Occ=Dead] {
__DEFAULT ->
Main.$wgo1
(GHC.Prim.quotWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X2j))
(__word 1)))
(__word 2))
(GHC.Prim.+# ww1_s6n4 1);
__word 0 ->
Main.$wgo1
(GHC.Prim.quotWord# wild_X2j (__word 2)) (GHC.Prim.+# ww1_s6n4 1)
};
__word 1 -> ww1_s6n4
}
But even is defined in other module and it is not marked as INLINE or INLINEABLE. As a result it is not inlined, and each call to even allocates boxed Word32:
Main.$wgo1 [InlPrag=[0], Occ=LoopBreaker]
:: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Str=DmdType <S,U><L,U>]
Main.$wgo1 =
\ (ww_s6mz :: GHC.Prim.Word#) (ww1_s6mD :: GHC.Prim.Int#) ->
case ww_s6mz of wild_X1W {
__DEFAULT ->
case even
# Word32 GHC.Word.$fIntegralWord32 (GHC.Word.W32# wild_X1W)
of _ [Occ=Dead] {
False ->
Main.$wgo1
(GHC.Prim.quotWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X1W))
(__word 1)))
(__word 2))
(GHC.Prim.+# ww1_s6mD 1);
True ->
Main.$wgo1
(GHC.Prim.quotWord# wild_X1W (__word 2)) (GHC.Prim.+# ww1_s6mD 1)
};
__word 1 -> ww1_s6mD
}
Note that even is specialized for Int and Integer, but not for Word32, so the issue doesn't occurs if you use Int.

Unboxing a function

I have a function that I am trying to optimize. This is part of a bigger code where I suspect this function is preventing GHC from unboxing Int arguments at higher level function that calls it. So, I wrote a simple test with two things in mind - understand the core, and try different things to see what makes GHC unbox it, so that I can apply the lessons to bigger code. Here is the function cmp with a test function wrapper:
{-# LANGUAGE BangPatterns #-}
module Cmp
( cmp,
test )
where
import Data.Vector.Unboxed as U hiding (mapM_)
import Data.Word
cmp :: (U.Unbox a, Eq a) => U.Vector a -> U.Vector a -> Int -> Int -> Int
cmp a b !i !j = go a b 0 i j
where
go v1 v2 !len !i !j| (i<n) && (j<m) && ((unsafeIndex v1 i) == (unsafeIndex v2 j)) = go v1 v2 (len+1) (i+1) (j+1)
| otherwise = len
where
n = U.length a
m = U.length b
{-# INLINABLE cmp #-}
test :: (U.Unbox a, Eq a) => U.Vector a -> U.Vector a -> U.Vector Int -> Int
test a b i = U.sum $ U.map (\x -> cmp a b x x) i
Ideally, test should call unboxed version of cmp with following signature (of course, correct me if I am wrong):
U.Vector a -> U.Vector a -> Int# -> Int# -> Int#
Looking at the core generated in ghc 7.6.1 (command line option:ghc -fforce-recomp -ddump-simpl -dsuppress-uniques -dsuppress-idinfo -dsuppress-module-prefixes -O2 -fllvm), I see this for inner loop for test - snippets from core below, with my comments added:
-- cmp function doesn't have any helper functions with unboxed Int
--
cmp
:: forall a.
(Unbox a, Eq a) =>
Vector a -> Vector a -> Int -> Int -> Int
...
-- This is the function that is called by test - it does keep the result
-- unboxed, but calls boxed cmp, and unboxes the result of cmp (I# y)
--
$wa
:: forall a.
(Unbox a, Eq a) =>
Vector a -> Vector a -> Vector Int -> Int#
$wa =
\ (# a)
(w :: Unbox a)
(w1 :: Eq a)
(w2 :: Vector a)
(w3 :: Vector a)
(w4 :: Vector Int) ->
case w4
`cast` (<TFCo:R:VectorInt> ; <NTCo:R:VectorInt>
:: Vector Int ~# Vector Int)
of _ { Vector ipv ipv1 ipv2 ->
letrec {
$s$wfoldlM'_loop :: Int# -> Int# -> Int#
$s$wfoldlM'_loop =
\ (sc :: Int#) (sc1 :: Int#) ->
case >=# sc1 ipv1 of _ {
False ->
case indexIntArray# ipv2 (+# ipv sc1) of wild { __DEFAULT ->
let {
x :: Int
x = I# wild } in
--
-- Calls cmp and unboxes the Int result as I# y
--
case cmp # a w w1 w2 w3 x x of _ { I# y ->
$s$wfoldlM'_loop (+# sc y) (+# sc1 1)
}
};
True -> sc
}; } in
$s$wfoldlM'_loop 0 0
}
-- helper function called by test - it calls $wa which calls boxed cmp
--
test1
:: forall a.
(Unbox a, Eq a) =>
Vector a -> Vector a -> Vector Int -> Id Int
test1 =
\ (# a)
(w :: Unbox a)
(w1 :: Eq a)
(w2 :: Vector a)
(w3 :: Vector a)
(w4 :: Vector Int) ->
case $wa # a w w1 w2 w3 w4 of ww { __DEFAULT ->
(I# ww) `cast` (Sym <(NTCo:Id <Int>)> :: Int ~# Id Int)
}
I will appreciate pointers on how to force unboxed version of cmp to be called from test. I tried strictifying different arguments, but that was like throwing the kitchen sink at it, which of course didn't work. I hope to use the lessons learnt here to solve the boxing/unboxing performance issue in the more complicated code.
Also, one more question - I have seen cast being used in the core, but haven't found any core references on Haskell/GHC wiki that explain what it is. It seems a type casting operation. I would appreciate explanation of what it is, and how to interpret it in the test1 function above.
Now I don't have ghc, so my advices would be verbal:
Why do you avoid {-# INLINE #-} pragma? High performance in Haskell is significantly based on function inlining. Add INLINE pragma to the go function.
Remove first two excessive parameters of go function. Read more about interoperation of inlining, specializing (unboxing) of parameters here: http://www.haskell.org/ghc/docs/latest/html/users_guide/pragmas.html#inline-pragma
Move m and n definitions one level up, along with go.

Fusion optimization with intermediate values

Will GHC transform an expression with intermediate values as efficiently as one without?
e.g.
main = print $ f ["aa", "bb", "cc"]
f x =
let a = map (map toUpper) x
b = filter (\z -> 'C' /= head z) a
c = foldl1 (++) b
in c
seems to have very different core output (with -ddump-simple) than with
f x = foldl1 (++) $ filter (\z -> 'C' /= head z) $ map (map toUpper) x
Could an expression with intermediate values take (significantly) longer to evaluate?
Linear use of intermediate let bindings is equivalent to putting (.) between the values.
GHC will fuse through such pipelines. You can see from the results of -ddump-simpl-stats
With let Bindings:
15 RuleFired
1 ++
1 Class op /=
1 Class op show
1 Class op showList
1 filter
1 fold/build
1 foldr/app
1 map
1 neChar#->case
3 unpack
3 unpack-list
Using a pipeline:
15 RuleFired
1 ++
1 Class op /=
1 Class op show
1 Class op showList
1 filter
1 fold/build
1 foldr/app
1 map
1 neChar#->case
3 unpack
3 unpack-list
And the same fused worker:
With let Bindings:
Main.main_go =
\ (ds_aAz :: [[GHC.Types.Char]]) ->
case ds_aAz of _ {
[] -> GHC.Types.[] # [GHC.Types.Char];
: y_aAE ys_aAF ->
case GHC.Base.map
# GHC.Types.Char # GHC.Types.Char GHC.Unicode.toUpper y_aAE
of wild1_azI {
[] ->
GHC.List.badHead
`cast` (UnsafeCo (forall a_azK. a_azK) [[GHC.Types.Char]]
:: (forall a_azK. a_azK) ~ [[GHC.Types.Char]]);
: x_azM ds1_azN ->
case x_azM of _ { GHC.Types.C# c2_aAa ->
case c2_aAa of _ {
__DEFAULT ->
GHC.Types.: # [GHC.Types.Char] wild1_azI (Main.main_go ys_aAF);
'C' -> Main.main_go ys_aAF
}
Pipeline:
Main.main_go =
\ (ds_aAA :: [[GHC.Types.Char]]) ->
case ds_aAA of _ {
[] -> GHC.Types.[] # [GHC.Types.Char];
: y_aAF ys_aAG ->
case GHC.Base.map
# GHC.Types.Char # GHC.Types.Char GHC.Unicode.toUpper y_aAF
of wild1_azB {
[] ->
GHC.List.badHead
`cast` (UnsafeCo (forall a_azD. a_azD) [[GHC.Types.Char]]
:: (forall a_azD. a_azD) ~ [[GHC.Types.Char]]);
: x_azF ds1_azG ->
case x_azF of _ { GHC.Types.C# c2_aA3 ->
case c2_aA3 of _ {
__DEFAULT ->
GHC.Types.: # [GHC.Types.Char] wild1_azB (Main.main_go ys_aAG);
'C' -> Main.main_go ys_aAG
}
}
Did you forget to compile with -O2 ?

How much overhead do function calls have in Haskell?

I am new to Haskell and I am puzzled by the cost of a function call, which seems to be completely unreasonable to me, and makes me think I am doing something fundamentally wrong.
Consider the following Haskell code:
module Main where
logistic x = 4.0*x*(1.0-x)
lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (logistic x) (n-1)
main = putStrLn $ show $ lg 0.7861 100000000
Compiling this with the command
ghc -O3 -XBangPatterns -o tsths tst.hs
and running it, I get:
real 0m15.904s
user 0m15.853s
sys 0m0.016s
If instead of calling the function logistic I calculate the expression inline:
module Main where
lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (4.0*x*(1.0-x)) (n-1)
main = putStrLn $ show $ lg 0.7861 100000000
the execution time becomes:
real 0m0.838s
user 0m0.828s
sys 0m0.004s
which is exactly the same as the equivalent C program, which is
#include <stdio.h>
int main() {
int i, num=100000000;
double x=0.7861;
for (i=0; i<num; ++i)
x *= 4.0*(1.0-x);
printf("%lg\n", x);
}
Am I doing something terribly wrong?
Many thanks.
It's a bug in GHC-7.4.1. Looking at the generated core (only the core for the function lg is important, from GHC-7.4.2, we get
Main.lg3 :: GHC.Types.Double
[GblId,
Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
ConLike=False, Cheap=False, Expandable=False,
Guidance=IF_ARGS [] 30 0}]
Main.lg3 = GHC.Float.$w$cfromRational Main.lg4 Main.lg2
Main.lg1 :: GHC.Types.Double
[GblId,
Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
ConLike=False, Cheap=False, Expandable=False,
Guidance=IF_ARGS [] 30 0}]
Main.lg1 = GHC.Float.$w$cfromRational Main.lg2 Main.lg2
Main.$wlg :: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double#
[GblId,
Arity=2,
Str=DmdType LL,
Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=2, Value=True,
ConLike=True, Cheap=True, Expandable=True,
Guidance=IF_ARGS [0 30] 158 0}]
Main.$wlg =
\ (ww_s1Oy :: GHC.Prim.Double#) (ww1_s1OC :: GHC.Prim.Int#) ->
case ww1_s1OC of ds_Xvs {
__DEFAULT ->
case Main.lg3 of _ { GHC.Types.D# x_awJ ->
case Main.lg1 of _ { GHC.Types.D# x1_awV ->
letrec {
$wlg1_X1PF [Occ=LoopBreaker]
:: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double#
[LclId, Arity=2, Str=DmdType LL]
$wlg1_X1PF =
\ (ww2_X1Pv :: GHC.Prim.Double#) (ww3_X1PA :: GHC.Prim.Int#) ->
case ww3_X1PA of ds1_Xwr {
__DEFAULT ->
$wlg1_X1PF
(GHC.Prim.*##
(GHC.Prim.*## x_awJ ww2_X1Pv) (GHC.Prim.-## x1_awV ww2_X1Pv))
(GHC.Prim.-# ds1_Xwr 1);
0 -> ww2_X1Pv
}; } in
$wlg1_X1PF
(GHC.Prim.*##
(GHC.Prim.*## x_awJ ww_s1Oy) (GHC.Prim.-## x1_awV ww_s1Oy))
(GHC.Prim.-# ds_Xvs 1)
}
};
0 -> ww_s1Oy
}
two top-level Doubles and a decent loop.
GHC-7.4.1 was a bit too inlining-happy, that produced
Rec {
Main.$wlg [Occ=LoopBreaker]
:: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double#
[GblId, Arity=2, Str=DmdType LL]
Main.$wlg =
\ (ww_s1NS :: GHC.Prim.Double#) (ww1_s1NW :: GHC.Prim.Int#) ->
case ww1_s1NW of ds_Xvb {
__DEFAULT ->
case GHC.Float.$wfromRat'' (-1021) 53 Main.logistic4 Main.logistic2
of ww2_a1Mt { __DEFAULT ->
case GHC.Float.$wfromRat'' (-1021) 53 Main.logistic2 Main.logistic2
of ww3_X1Nq { __DEFAULT ->
Main.$wlg
(GHC.Prim.*##
(GHC.Prim.*## ww2_a1Mt ww_s1NS) (GHC.Prim.-## ww3_X1Nq ww_s1NS))
(GHC.Prim.-# ds_Xvb 1)
}
};
0 -> ww_s1NS
}
end Rec }
and gave you two calls to the fromRational worker in each iteration.
Now, fromRational is a fairly complicated function. It is still rather slow, despite having gotten a much faster implementation in the 7.2 series, so these calls hurt big time.
With a type signature, there are no Rational top-level constants produced, only Double constants, and these are then used, which of course doesn't include a gratuitous slowdown.
As suggested by Dan Burton it is actually overhead of polymorphic function, because GHC infers type logistic :: Fractional a => a -> a. If you specify type explicitly you generally enable both better checking and better optimizations. I believe it is good practice to specify type of function explicitly.
If you want to have function with polymorphic type but have full speed of monomorphic call in case of specific usage you can use SPECIALIZE pragma, but I believe this is GHC specific.
{-# LANGUAGE BangPatterns #-}
module Main where
logistic :: Fractional a => a -> a
{-# SPECIALISE logistic :: Double -> Double #-}
logistic x = 4.0*x*(1.0-x)
lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (logistic x) (n-1)
main = putStrLn $ show $ lg 0.7861 100000000
Also note that you can specify LANGUAGE pragma at the beginning of file to enable bang patterns and don't need to enable them on command line.
Times on my machine were 21 seconds for original, 0.67 sec for explicit type, 0.7 sec for specialize (which is basically the same).
I believe overhead of specialized call is very small because it is just bunch of instructions which gets inlined anyway but polymorphic function results in call. Though it is weird that GHC cant inline despite polymorphism.
Add a type signature to logistic and you will see the speedup. Allow me to use CPP to demonstrate the difference.
bash> cat tst.hs
module Main where
#if defined(SIG)
logistic :: Double -> Double
#endif
logistic x = 4.0*x*(1.0-x)
lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (logistic x) (n-1)
main = putStrLn $ show $ lg 0.7861 100000000
bash> ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.4.1
If compiled without SIG defined (the type signature is excluded):
bash> ghc -O3 -XBangPatterns -XCPP -o tsths tst.hs
[1 of 1] Compiling Main ( tst.hs, tst.o )
Linking tsths ...
bash> time ./tsths
0.34209286442469333
real 0m13.187s
user 0m13.177s
sys 0m0.008s
Now lets compile with SIG defined so the signature is included:
bash> rm tsths *.o *.hi
bash> ghc -O3 -XBangPatterns -XCPP -DSIG -o tsths tst.hs
[1 of 1] Compiling Main ( tst.hs, tst.o )
Linking tsths ...
bash> time ./tsths
0.34209286442469333
real 0m0.464s
user 0m0.440s
sys 0m0.020s
Not sure why GHC doesn't optimize it without the signature; the monomorphism restriction should restrict it to Double -> Double anyways.

Resources