Why does Haskell's 'even' function slow my program down? [duplicate]

Why does Haskell's 'even' function slow my program down? [duplicate] - performance

This question already has an answer here:
GHC 7.10 generates slower code than older versions
(1 answer)
Closed 6 years ago.
I have following code. It costs 1s to run with argument 1000000, but it costs 5s to run if replace myEven with standard even function. I checked the code, the standard even function does exactly the same as * myEven *.
import Data.Word
import Data.List
import System.Environment
collatzNext :: Word32 -> Word32
collatzNext a = (if myEven a then a else 3*a+1) `div` 2
myEven :: (Integral a) => a -> Bool
myEven a = (a `rem` 2) == 0
collatzLen :: Word32 -> Int
collatzLen a0 = length $ takeWhile (/= 1) $ iterate collatzNext a0
main = do
[a0] <- getArgs
let max_a0 = (read a0)::Word32
print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]

If you add {-# NOINLINE myEven #-}, you'll get the same slowdown. The issue is that myEven is defined locally, so it's source is available to compiler, and it is inlined. All allocations and function call itself are eliminated:
Main.$wgo1 [InlPrag=[0], Occ=LoopBreaker]
:: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,U>]
Main.$wgo1 =
\ (ww_s6n0 :: GHC.Prim.Word#) (ww1_s6n4 :: GHC.Prim.Int#) ->
case ww_s6n0 of wild_X2j {
__DEFAULT ->
case GHC.Prim.remWord# wild_X2j (__word 2) of _ [Occ=Dead] {
__DEFAULT ->
Main.$wgo1
(GHC.Prim.quotWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X2j))
(__word 1)))
(__word 2))
(GHC.Prim.+# ww1_s6n4 1);
__word 0 ->
Main.$wgo1
(GHC.Prim.quotWord# wild_X2j (__word 2)) (GHC.Prim.+# ww1_s6n4 1)
};
__word 1 -> ww1_s6n4
}
But even is defined in other module and it is not marked as INLINE or INLINEABLE. As a result it is not inlined, and each call to even allocates boxed Word32:
Main.$wgo1 [InlPrag=[0], Occ=LoopBreaker]
:: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Str=DmdType <S,U><L,U>]
Main.$wgo1 =
\ (ww_s6mz :: GHC.Prim.Word#) (ww1_s6mD :: GHC.Prim.Int#) ->
case ww_s6mz of wild_X1W {
__DEFAULT ->
case even
# Word32 GHC.Word.$fIntegralWord32 (GHC.Word.W32# wild_X1W)
of _ [Occ=Dead] {
False ->
Main.$wgo1
(GHC.Prim.quotWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X1W))
(__word 1)))
(__word 2))
(GHC.Prim.+# ww1_s6mD 1);
True ->
Main.$wgo1
(GHC.Prim.quotWord# wild_X1W (__word 2)) (GHC.Prim.+# ww1_s6mD 1)
};
__word 1 -> ww1_s6mD
}
Note that even is specialized for Int and Integer, but not for Word32, so the issue doesn't occurs if you use Int.

Related

ICFPC 2006 task on Haskell is too slow

I am learning Haskell, and as a task for myself I was trying to implement a Universal Machine from ICFP Contest 2006. I came up with a code, which, at a first glance, seems to be working. However, when I try to execute any application for that Universal Machine provided on the contest's website (e.g. sandmark.umz), my implementation is too slow to actually run anything. Self-check did not finish in a couple of hours, and I had to kill the process. So, I am clearly doing something wrong, I just don't know what.
I have tried to use Haskell's profiler, but I couldn't make any sense out of those numbers as well. Garbage collection doesn't seem to be taking a lot of time (3 seconds out of 173 seconds of a sample). However, total allocated memory during those 173 seconds was almost 6 GB, while the maximum heap size was 13 MB.
Could you help me understand, what is wrong with my code? I know that the amount of code is quite large, but I am not sure how to come up with a minimum reproducible example in my case, when I don't really know what is relevant, and what is not. Thank you.
module Main where
import System.Environment (getArgs)
import System.Exit (exitSuccess)
import System.IO (hPutStrLn, stderr)
import System.IO.Error (catchIOError)
import Control.Monad (when)
import Control.Monad.Loops (iterateM_)
import Data.Array.IO (IOUArray, newArray, newListArray, readArray, writeArray, mapArray)
import Data.Bits
import Data.Binary.Get (getWord32be, runGet, isEmpty, Get)
import Data.Char (chr, ord)
import Data.Word (Word32)
import Data.Maybe (fromJust)
import qualified Data.IntMap.Strict as M
import qualified Data.ByteString.Lazy as B
import qualified Data.IntSet as IntSet
data UMState = UMState {
getFinger :: Word32,
getRegisters :: IOUArray Int Word32,
getArrays :: M.IntMap (IOUArray Word32 Word32),
getFreeIds :: [Int],
getMaxId :: Int,
getCopiedPlatters :: IntSet.IntSet
}
getOperation :: Word32 -> Int
getOperation x = fromIntegral $ (x `shiftR` 28) .&. 15
getRegisterIds :: Word32 -> (Int, Int, Int)
getRegisterIds x = (fromIntegral $ (x `shiftR` 6) .&. 7, fromIntegral $ (x `shiftR` 3) .&. 7, fromIntegral $ x .&. 7)
getOrthography :: Word32 -> (Int, Word32)
getOrthography x = (fromIntegral $ (x `shiftR` 25) .&. 7, x .&. 33554431)
setFinger :: UMState -> Word32 -> UMState
setFinger (UMState {
getFinger = _,
getRegisters = regs,
getArrays = arr,
getFreeIds = fids,
getMaxId = mid,
getCopiedPlatters = cp
}) f' = UMState {
getFinger = f',
getRegisters = regs,
getArrays = arr,
getFreeIds = fids,
getMaxId = mid,
getCopiedPlatters = cp
}
removePlatter :: UMState -> Int -> UMState
removePlatter (UMState {
getFinger = f,
getRegisters = regs,
getArrays = arr,
getFreeIds = fids,
getMaxId = mid,
getCopiedPlatters = cp
}) pid = UMState {
getFinger = f,
getRegisters = regs,
getArrays = M.delete pid arr,
getFreeIds = (pid:fids),
getMaxId = mid,
getCopiedPlatters = cp
}
insertPlatter :: UMState -> Int -> IOUArray Word32 Word32 -> UMState
insertPlatter (UMState {
getFinger = f,
getRegisters = regs,
getArrays = arr,
getFreeIds = fids#(hfid:tfids),
getMaxId = mid,
getCopiedPlatters = cp
}) pid platter = UMState {
getFinger = f,
getRegisters = regs,
getArrays = M.insert pid platter arr,
getFreeIds = if pid == hfid then tfids else fids,
getMaxId = max mid pid,
getCopiedPlatters = cp
}
insertPlatter (UMState {
getFinger = f,
getRegisters = regs,
getArrays = arr,
getFreeIds = [],
getMaxId = mid,
getCopiedPlatters = cp
}) pid platter = UMState {
getFinger = f,
getRegisters = regs,
getArrays = M.insert pid platter arr,
getFreeIds = [],
getMaxId = max mid pid,
getCopiedPlatters = cp
}
setCopiedPlatters :: UMState -> IntSet.IntSet -> UMState
setCopiedPlatters (UMState {
getFinger = f,
getRegisters = regs,
getArrays = arr,
getFreeIds = fids,
getMaxId = mid,
getCopiedPlatters = _
}) copied' = UMState {
getFinger = f,
getRegisters = regs,
getArrays = arr,
getFreeIds = fids,
getMaxId = mid,
getCopiedPlatters = copied'
}
main = do
args <- getArgs
fileName <- parseArgs $ filter (\arg -> arg /= "--") args
platters <- B.readFile fileName
array0 <- listToArray (runGet readPlatters platters)
regs <- (newArray (0, 7) 0 :: IO (IOUArray Int Word32))
let initState = (UMState {
getFinger = 0,
getRegisters = regs,
getArrays = M.insert 0 array0 M.empty,
getFreeIds = [],
getMaxId = 0,
getCopiedPlatters = IntSet.empty
})
in iterateM_ spinCycle initState
parseArgs :: [String] -> IO (String)
parseArgs [arg] = return arg
parseArgs args = fail $ "Exactly one argument expected. Found: " ++ (show args)
readPlatters :: Get [Word32]
readPlatters = do
empty <- isEmpty
if empty
then return []
else do
platter <- getWord32be
theRest <- readPlatters
return (platter:theRest)
listToArray :: [Word32] -> IO (IOUArray Word32 Word32)
listToArray lst = newListArray (fromIntegral 0, fromIntegral (length lst) - 1) lst
spinCycle :: UMState -> IO (UMState)
spinCycle state = do
platter <- readArray (fromJust (M.lookup 0 (getArrays state))) (getFinger state)
let state' = setFinger state $ getFinger state + 1
(aId, bId, cId) = getRegisterIds platter
regs = getRegisters state'
arrays = getArrays state' in (
case (getOperation platter) of
0 -> do
runConditionalMove aId bId cId regs
return state'
1 -> do
runArrayIndex aId bId cId regs arrays
return state'
2 -> runArrayAmendment aId bId cId state'
3 -> do
runAddition aId bId cId regs
return state'
4 -> do
runMultiplication aId bId cId regs
return state'
5 -> do
runDivision aId bId cId regs
return state'
6 -> do
runNand aId bId cId regs
return state'
7 -> runHalt
8 -> runAllocation bId cId state'
9 -> runAbandonment cId state'
10 -> do
runOutput cId regs
return state'
11 -> do
runInput cId regs
return state'
12 -> runLoadProgram bId cId state'
13 -> do
let (reg, val) = getOrthography platter
in (runOrthography reg val regs)
return state'
)
-- #0. Conditional Move.
runConditionalMove :: Int -> Int -> Int -> IOUArray Int Word32 -> IO ()
runConditionalMove a b c regs = do
hPutStrLn stderr ("conditionalMove " ++ (show a) ++ " " ++ (show b) ++ " " ++ (show c))
cRead <- readArray regs c
when (cRead /= 0) $ do
bRead <- readArray regs b
writeArray regs a bRead
-- #1. Array Index.
runArrayIndex :: Int -> Int -> Int -> IOUArray Int Word32 -> M.IntMap (IOUArray Word32 Word32) -> IO ()
runArrayIndex a b c regs arrays = do
hPutStrLn stderr ("arrayIndex " ++ (show a) ++ " " ++ (show b) ++ " " ++ (show c))
bRead <- readArray regs b
cRead <- readArray regs c
val <- readArray (fromJust (M.lookup (fromIntegral bRead) arrays)) cRead
writeArray regs a val
-- #2. Array Amendment.
runArrayAmendment :: Int -> Int -> Int -> UMState -> IO (UMState)
runArrayAmendment a b c state = do
hPutStrLn stderr ("arrayAmendment " ++ (show a) ++ " " ++ (show b) ++ " " ++ (show c))
aRead <- readArray (getRegisters state) a
bRead <- readArray (getRegisters state) b
cRead <- readArray (getRegisters state) c
stateToWrite <- if IntSet.member (fromIntegral aRead) (getCopiedPlatters state) then (do
pCopy <- mapArray id (fromJust (M.lookup (fromIntegral aRead) (getArrays state)))
let state' = insertPlatter state (fromIntegral aRead) pCopy
state'' = setCopiedPlatters state' $ IntSet.delete (fromIntegral aRead) (getCopiedPlatters state')
in return state''
) else return state
writeArray (fromJust (M.lookup (fromIntegral aRead) (getArrays stateToWrite))) bRead cRead
return stateToWrite
-- #3. Addition.
runAddition :: Int -> Int -> Int -> IOUArray Int Word32 -> IO ()
runAddition a b c regs = do
hPutStrLn stderr ("addition " ++ (show a) ++ " " ++ (show b) ++ " " ++ (show c))
bRead <- readArray regs b
cRead <- readArray regs c
writeArray regs a (bRead + cRead)
-- #4. Multiplication.
runMultiplication :: Int -> Int -> Int -> IOUArray Int Word32 -> IO ()
runMultiplication a b c regs = do
hPutStrLn stderr ("multiplication " ++ (show a) ++ " " ++ (show b) ++ " " ++ (show c))
bRead <- readArray regs b
cRead <- readArray regs c
writeArray regs a (bRead * cRead)
-- #5. Division.
runDivision :: Int -> Int -> Int -> IOUArray Int Word32 -> IO ()
runDivision a b c regs = do
hPutStrLn stderr ("division " ++ (show a) ++ " " ++ (show b) ++ " " ++ (show c))
bRead <- readArray regs b
cRead <- readArray regs c
writeArray regs a (bRead `div` cRead)
-- #6. Not-And.
runNand :: Int -> Int -> Int -> IOUArray Int Word32 -> IO ()
runNand a b c regs = do
hPutStrLn stderr ("nand " ++ (show a) ++ " " ++ (show b) ++ " " ++ (show c))
bRead <- readArray regs b
cRead <- readArray regs c
writeArray regs a (complement $ bRead .&. cRead)
-- #7. Halt.
runHalt = exitSuccess :: IO (UMState)
-- #8. Allocation.
runAllocation :: Int -> Int -> UMState -> IO (UMState)
runAllocation b c state = do
hPutStrLn stderr ("allocation " ++ (show b) ++ " " ++ (show c))
cRead <- readArray (getRegisters state) c
pArray <- (newArray (0, cRead) 0 :: IO (IOUArray Word32 Word32))
(state', newId) <-
case (getFreeIds state) of
(freeId:_) -> return (insertPlatter state freeId pArray, freeId)
[] -> let maxId' = getMaxId state + 1 in return (insertPlatter state maxId' pArray, maxId')
writeArray (getRegisters state') b (fromIntegral newId)
return state'
-- #9. Abandonment.
runAbandonment :: Int -> UMState -> IO (UMState)
runAbandonment c state = do
hPutStrLn stderr ("abandonment " ++ (show c))
cRead <- readArray (getRegisters state) c
return (removePlatter state $ fromIntegral cRead)
-- #10. Output.
runOutput :: Int -> IOUArray Int Word32 -> IO ()
runOutput c regs = do
cRead <- readArray regs c
when (cRead < 256) $ putChar . chr . fromIntegral $ cRead
-- #11. Input.
runInput :: Int -> IOUArray Int Word32 -> IO ()
runInput c regs = do
cRead <- getChar `catchIOError` (\_ -> return $ chr 255)
writeArray regs c (fromIntegral $ ord cRead)
-- #12. Load Program.
runLoadProgram :: Int -> Int -> UMState -> IO (UMState)
runLoadProgram b c state = do
hPutStrLn stderr ("loadProgram " ++ (show b) ++ " " ++ (show c))
bRead <- readArray (getRegisters state) b
cRead <- readArray (getRegisters state) c
let bReadInt = fromIntegral bRead
pCopy = fromJust (M.lookup bReadInt (getArrays state))
copied = IntSet.insert 0 (getCopiedPlatters state)
copied' = IntSet.insert bReadInt copied
state' = insertPlatter state 0 pCopy
state'' = setFinger state' cRead
state''' = setCopiedPlatters state'' copied'
in return state'''
-- #13. Orthography.
runOrthography :: Int -> Word32 -> IOUArray Int Word32 -> IO ()
runOrthography reg val regs = writeArray regs reg val

Total allocation of 3 gigabytes for a Haskell programming running for 176 seconds is miniscule. Most Haskell programs allocate 3-6 gigabytes per second for their entire runtime. In your case, much of the program is running in tight, allocation-free loops (generally a good thing when you're trying to write a fast program), which may explain the small amount of allocation. The small proportion of time spent garbage collecting is also a good sign.
I tested your program on sandmark.umz andcodex.umz, built with -O2 and no profiling.
I believe the main problem is that the hPutStrLn logging lines are generating tons of output, so your universal machine is spending all its time writing logs.
Comment out all the hPutStrLn lines, and SANDmark prints a line every few seconds. I have no idea how fast it's supposed to be, but it's certainly running.
For Codex, it completes the self-check succeeded in a few seconds and accepts a 32-character key. If you enter the wrong key, it prints "wrong key". If you enter the right key, it prints "decrypting..." At this point, it seems to freeze up, so I suspect your implementation is too slow, but not nearly as slow as you were reporting.
Note that you may find it helpful to turn off buffering on stdin and stdout at the start of main:
main = do
hSetBuffering stdin NoBuffering
hSetBuffering stdout NoBuffering
...
so that the getChar and putChar-based I/O operate immediately. This isn't strictly necessary, but might help avoid apparent lockups that are actually just buffering issues.

Implicit pattern matching in Haskell

Should one expect performance differences between these two emptys, or is it merely a matter of stylistic preference?
foo list = case list of
[] -> True
(_ : _) -> False
bar list = case list of
(_ : _) -> False
_ -> True

In general you should not expect performance to change predictably between trivial fiddling around with patterns like what you're asking about, and can often expect the generated code to be identical.
But the way to actually check is to look at core and or benchmark with criterion. In this case the generated code is the same, and indeed GHC seems to actually combine them:
I compiled the snippet above with
ghc -Wall -O2 -ddump-to-file -ddump-simpl -dsuppress-module-prefixes -dsuppress-uniques -fforce-recomp YourCode.hs
And we see this core:
foo :: forall t. [t] -> Bool
[GblId,
Arity=1,
Caf=NoCafRefs,
Str=DmdType <S,1*U>,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False)
Tmpl= \ (# t) (list [Occ=Once!] :: [t]) ->
case list of _ [Occ=Dead] {
[] -> True;
: _ [Occ=Dead] _ [Occ=Dead] -> False
}}]
foo =
\ (# t) (list :: [t]) ->
case list of _ [Occ=Dead] {
[] -> True;
: ds ds1 -> False
}
-- RHS size: {terms: 1, types: 0, coercions: 0}
bar :: forall t. [t] -> Bool
[GblId,
Arity=1,
Caf=NoCafRefs,
Str=DmdType <S,1*U>,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False)
Tmpl= \ (# t) (list [Occ=Once!] :: [t]) ->
case list of _ [Occ=Dead] {
[] -> True;
: _ [Occ=Dead] _ [Occ=Dead] -> False
}}]
bar = foo
I think the Tmpl stuff is the original implementation exposed for inlining in other modules, but I'm not certain.

What is the correct way to perform constant-space nested loops in Haskell?

There are two obvious, "idiomatic" ways to perform nested loops in Haskell: using the list monad or using forM_ to replace traditional fors. I've set a benchmark to determine if those are compiled to tight loops:
import Control.Monad.Loop
import Control.Monad.Primitive
import Control.Monad
import Control.Monad.IO.Class
import qualified Data.Vector.Unboxed.Mutable as MV
import qualified Data.Vector.Unboxed as V
times = 100000
side = 100
-- Using `forM_` to replace traditional fors
test_a mvec =
forM_ [0..times-1] $ \ n -> do
forM_ [0..side-1] $ \ y -> do
forM_ [0..side-1] $ \ x -> do
MV.write mvec (y*side+x) 1
-- Using the list monad to replace traditional forms
test_b mvec = sequence_ $ do
n <- [0..times-1]
y <- [0..side-1]
x <- [0..side-1]
return $ MV.write mvec (y*side+x) 1
main = do
let vec = V.generate (side*side) (const 0)
mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
-- test_a mvec
-- test_b mvec
vec' <- V.unsafeFreeze mvec :: IO (V.Vector Int)
print $ V.sum vec'
This test creates a 100x100 vector, writes 1 to each index using nested loop and repeats that 100k times. Compiling those with just ghc -O2 test.hs -o test (ghc version 7.8.4), the results are: 3.853s for the forM_ version and 10.460s for the list monad. In order to provide a reference, I also programmed this test in JavaScript:
var side = 100;
var times = 100000;
var vec = [];
for (var i=0; i<side*side; ++i)
vec.push(0);
for (var n=0; n<times; ++n)
for (var y=0; y<side; ++y)
for (var x=0; x<side; ++x)
vec[x+y*side] = 1;
var s = 0;
for (var i=0; i<side*side; ++i)
s += vec[i];
console.log(s);
This equivalent JavaScript program takes 1s to complete, beating Haskell's unboxed vectors, which is unusual, suggesting that Haskell is not running the loop in constant space, but doing allocations instead. I've then found a library that claims to provide type-guaranteed tight loops Control.Monad.Loop:
-- Using `for` from Control.Monad.Loop
test_c mvec = exec_ $ do
n <- for 0 (< times) (+ 1)
x <- for 0 (< side) (+ 1)
y <- for 0 (< side) (+ 1)
liftIO (MV.write mvec (y*side+x) 1)
Which runs in 1s. That library isn't very used and far from idiomatic, though, so, what is the idiomatic way to get fast constant-space bidimensional computations? (Note this isn't a case for REPA as I want to perform arbitrary IO actions on the grid.)

Writing tight mutating code with GHC can be tricky sometimes. I'm going to write about a couple of different things, probably in a manner that is more rambling and tl;dr than I would prefer.
For starters, we should use GHC 7.10 in any case, since otherwise the forM_ and list monad solutions never fuse.
Also, I replaced MV.write with MV.unsafeWrite, partly because it's faster, but more importantly it reduces some of the clutter in the resultant Core. From now on runtime statistics refer to code with unsafeWrite.
The dreaded let floating
Even with GHC 7.10, we should first notice all those [0..times-1] and [0..side-1] expressions, because they will ruin performance every time if we don't take necessary steps. The issue is that they are constant ranges, and -ffull-laziness (which is enabled by default on -O) floats them out to top level. This prevents list fusion, and iterating over an Int# range is cheaper than iterating over a list of boxed Int-s anyway, so it's a really bad optimization.
Let's see some runtimes in seconds for the unchanged (aside from using unsafeWrite) code. ghc -O2 -fllvm is used, and I use +RTS -s for timing.
test_a: 1.6
test_b: 6.2
test_c: 0.6
For GHC Core viewing I used ghc -O2 -ddump-simpl -dsuppress-all -dno-suppress-type-signatures.
In the case of test_a, the [0..99] ranges are lifted out:
main4 :: [Int]
main4 = eftInt 0 99 -- means "enumFromTo" for Int.
although the outermost [0..9999] loop is fused into a tail-recursive helper:
letrec {
a3_s7xL :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7xL =
\ (x_X5zl :: Int#) (s1_X4QY :: State# RealWorld) ->
case a2_s7xF 0 s1_X4QY of _ { (# ipv2_a4NA, ipv3_a4NB #) ->
case x_X5zl of wild_X1S {
__DEFAULT -> a3_s7xL (+# wild_X1S 1) ipv2_a4NA;
99999 -> (# ipv2_a4NA, () #)
}
}; }
In the case of test_b, again only the [0..99] are lifted. However, test_b is much slower, because it has to build and sequence actual [IO ()] lists. At least GHC is sensible enough to only build a single [IO ()] for the two inner loops, and then perform sequencing it 10000 times.
let {
lvl7_s4M5 :: [IO ()]
lvl7_s4M5 = -- omitted
letrec {
a2_s7Av :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a2_s7Av =
\ (x_a5xi :: Int#) (eta_B1 :: State# RealWorld) ->
letrec {
a3_s7Au
:: [IO ()] -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7Au =
\ (ds_a4Nu :: [IO ()]) (eta1_X1c :: State# RealWorld) ->
case ds_a4Nu of _ {
[] ->
case x_a5xi of wild1_X1y {
__DEFAULT -> a2_s7Av (+# wild1_X1y 1) eta1_X1c;
99999 -> (# eta1_X1c, () #)
};
: y_a4Nz ys_a4NA ->
case (y_a4Nz `cast` ...) eta1_X1c
of _ { (# ipv2_a4Nf, ipv3_a4Ng #) ->
a3_s7Au ys_a4NA ipv2_a4Nf
}
}; } in
a3_s7Au lvl7_s4M5 eta_B1; } in
-- omitted
How can we remedy this? We could nuke the problem with {-# OPTIONS_GHC -fno-full-laziness #-}. This indeed helps a lot in our case:
test_a: 0.5
test_b: 0.48
test_c: 0.5
Alternatively, we could fiddle around with INLINE pragmas. Apparently inlining functions after the let floating is done preserves good performance. I found that GHC inlines our test functions even without a pragma, but an explicit pragma causes it to inline only after let floating. For example, this results in good performance without -fno-full-laziness:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
{-# INLINE test_a #-}
But inlining too early results in poor performance:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
{-# INLINE [~2] test_a #-} -- "inline before the first phase please"
The problem with this INLINE solution is that it's rather fragile in the face of GHC's floating onslaught. For example, manual inlining does not preserve performance. The following code is slow because similarly to INLINE [~2] it gives GHC a chance to float out:
main = do
let vec = V.generate (side*side) (const 0)
mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
So what should we do?
First, I think using -fno-full-laziness is a perfectly viable and even preferable option for those who'd like to write high performance code and have a good idea what they are doing. For example, it's used in unordered-containers. With it we have more precise control over sharing, and we can always just float out or inline manually.
For more regular code, I believe there's nothing wrong with using Control.Monad.Loop or any other package that provides the functionality. Many Haskell users are not scrupulous about depending on small "fringe" libraries. We can also just reimplement for, in a desired generality. For instance, the following performs just as well as the other solutions:
for :: Monad m => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) -> m ()
for init while step body = go init where
go !i | while i = body i >> go (step i)
go i = return ()
{-# INLINE for #-}
Looping in really constant space
I was at first very puzzled by the +RTS -s data on heap allocation. test_a allocated non-trivially with -fno-full-laziness, and also test_c without full laziness, and these allocations scaled linearly with the number of times iterations, but test_b with full laziness allocated only for the vector:
-- with -fno-full-laziness, no INLINE pragmas
test_a: 242,521,008 bytes
test_b: 121,008 bytes
test_c: 121,008 bytes -- but 240,120,984 with full laziness!
Also, INLINE pragmas for test_c did not help at all in this case.
I spent some time trying to find signs of heap allocation in the Core for the relevant programs, without success, until the realization struck me: GHC stack frames are on the heap, including the frames of the main thread, and the functions that were doing heap allocation were essentially running the thrice-nested loops in at most three stack frames. The heap allocation registered by +RTS -s is just the constant popping and pushing of stack frames.
This is pretty much apparent from the Core for the following code:
{-# OPTIONS_GHC -fno-full-laziness #-}
-- ...
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
main = do
let vec = V.generate (side*side) (const 0)
mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
test_a mvec
Which I'm including here in its glory. Feel free to skip.
main1 :: State# RealWorld -> (# State# RealWorld, () #)
main1 =
\ (s_a5HK :: State# RealWorld) ->
case divInt# 9223372036854775807 8 of ww4_a5vr { __DEFAULT ->
-- start of vector creation ----------------------
case tagToEnum# (># 10000 ww4_a5vr) of _ {
False ->
case newByteArray# 80000 (s_a5HK `cast` ...)
of _ { (# ipv_a5fv, ipv1_a5fw #) ->
letrec {
$s$wa_s8jS
:: Int#
-> Int#
-> State# (PrimState IO)
-> (# State# (PrimState IO), Int #)
$s$wa_s8jS =
\ (sc_s8jO :: Int#)
(sc1_s8jP :: Int#)
(sc2_s8jR :: State# (PrimState IO)) ->
case tagToEnum# (<# sc1_s8jP 10000) of _ {
False -> (# sc2_s8jR, I# sc_s8jO #);
True ->
case writeIntArray# ipv1_a5fw sc_s8jO 0 (sc2_s8jR `cast` ...)
of s'#_a5Gn { __DEFAULT ->
$s$wa_s8jS (+# sc_s8jO 1) (+# sc1_s8jP 1) (s'#_a5Gn `cast` ...)
}
}; } in
case $s$wa_s8jS 0 0 (ipv_a5fv `cast` ...)
-- end of vector creation -------------------
of _ { (# ipv6_a4Hv, ipv7_a4Hw #) ->
letrec {
a2_s7MJ :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a2_s7MJ =
\ (x_a5Ho :: Int#) (eta_B1 :: State# RealWorld) ->
letrec {
a3_s7ME :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7ME =
\ (x1_X5Id :: Int#) (eta1_XR :: State# RealWorld) ->
case ipv7_a4Hw of _ { I# dt4_a5x6 ->
case writeIntArray#
(ipv1_a5fw `cast` ...) (*# x1_X5Id 100) 1 (eta1_XR `cast` ...)
of s'#_a5Gn { __DEFAULT ->
letrec {
a4_s7Mz :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a4_s7Mz =
\ (x2_X5J8 :: Int#) (eta2_X1U :: State# RealWorld) ->
case writeIntArray#
(ipv1_a5fw `cast` ...)
(+# (*# x1_X5Id 100) x2_X5J8)
1
(eta2_X1U `cast` ...)
of s'#1_X5Hf { __DEFAULT ->
case x2_X5J8 of wild_X2o {
__DEFAULT -> a4_s7Mz (+# wild_X2o 1) (s'#1_X5Hf `cast` ...);
99 -> (# s'#1_X5Hf `cast` ..., () #)
}
}; } in
case a4_s7Mz 1 (s'#_a5Gn `cast` ...)
of _ { (# ipv2_a4QH, ipv3_a4QI #) ->
case x1_X5Id of wild_X1e {
__DEFAULT -> a3_s7ME (+# wild_X1e 1) ipv2_a4QH;
99 -> (# ipv2_a4QH, () #)
}
}
}
}; } in
case a3_s7ME 0 eta_B1 of _ { (# ipv2_a4QH, ipv3_a4QI #) ->
case x_a5Ho of wild_X1a {
__DEFAULT -> a2_s7MJ (+# wild_X1a 1) ipv2_a4QH;
99999 -> (# ipv2_a4QH, () #)
}
}; } in
a2_s7MJ 0 (ipv6_a4Hv `cast` ...)
}
};
True ->
case error
(unpackAppendCString#
"Primitive.basicUnsafeNew: length to large: "#
(case $wshowSignedInt 0 10000 ([])
of _ { (# ww5_a5wm, ww6_a5wn #) ->
: ww5_a5wm ww6_a5wn
}))
of wild_00 {
}
}
}
main :: IO ()
main = main1 `cast` ...
main2 :: State# RealWorld -> (# State# RealWorld, () #)
main2 = runMainIO1 (main1 `cast` ...)
main :: IO ()
main = main2 `cast` ...
We can also nicely demonstrate the allocation of frames the following way. Let's change test_a:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-1] $ \ y ->
forM_ [0..side-50] $ \ x -> -- change here
MV.unsafeWrite mvec (y*side+x) 1
Now the heap allocation stays exactly the same, because the innermost loop is tail-recursive and uses a single frame. With the following change, the heap allocation halves (to 124,921,008 bytes), because we push and pop half as many frames:
test_a mvec =
forM_ [0..times-1] $ \ n ->
forM_ [0..side-50] $ \ y -> -- change here
forM_ [0..side-1] $ \ x ->
MV.unsafeWrite mvec (y*side+x) 1
test_b and test_c (with no full laziness) instead compile to code that uses a nested case construct inside a single stack frame, and walks over the indices to see which one should be incremented. See the Core for the following main:
{-# LANGUAGE BangPatterns #-} -- later I'll talk about this
{-# OPTIONS_GHC -fno-full-laziness #-}
main = do
let vec = V.generate (side*side) (const 0)
!mvec <- V.unsafeThaw vec :: IO (MV.MVector (PrimState IO) Int)
test_c mvec
Voila:
main1 :: State# RealWorld -> (# State# RealWorld, () #)
main1 =
\ (s_a5Iw :: State# RealWorld) ->
case divInt# 9223372036854775807 8 of ww4_a5vT { __DEFAULT ->
-- start of vector creation ----------------------
case tagToEnum# (># 10000 ww4_a5vT) of _ {
False ->
case newByteArray# 80000 (s_a5Iw `cast` ...)
of _ { (# ipv_a5g3, ipv1_a5g4 #) ->
letrec {
$s$wa_s8ji
:: Int#
-> Int#
-> State# (PrimState IO)
-> (# State# (PrimState IO), Int #)
$s$wa_s8ji =
\ (sc_s8je :: Int#)
(sc1_s8jf :: Int#)
(sc2_s8jh :: State# (PrimState IO)) ->
case tagToEnum# (<# sc1_s8jf 10000) of _ {
False -> (# sc2_s8jh, I# sc_s8je #);
True ->
case writeIntArray# ipv1_a5g4 sc_s8je 0 (sc2_s8jh `cast` ...)
of s'#_a5GP { __DEFAULT ->
$s$wa_s8ji (+# sc_s8je 1) (+# sc1_s8jf 1) (s'#_a5GP `cast` ...)
}
}; } in
case $s$wa_s8ji 0 0 (ipv_a5g3 `cast` ...)
of _ { (# ipv6_a4MX, ipv7_a4MY #) ->
case ipv7_a4MY of _ { I# dt4_a5xy ->
-- end of vector creation
letrec {
a2_s7Q6 :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a2_s7Q6 =
\ (x_a5HT :: Int#) (eta_B1 :: State# RealWorld) ->
letrec {
a3_s7Q5 :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a3_s7Q5 =
\ (x1_X5J9 :: Int#) (eta1_XP :: State# RealWorld) ->
letrec {
a4_s7MZ :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
a4_s7MZ =
\ (x2_X5Jl :: Int#) (s1_X4Xb :: State# RealWorld) ->
case writeIntArray#
(ipv1_a5g4 `cast` ...)
(+# (*# x1_X5J9 100) x2_X5Jl)
1
(s1_X4Xb `cast` ...)
of s'#_a5GP { __DEFAULT ->
-- the interesting part! ------------------
case x2_X5Jl of wild_X1y {
__DEFAULT -> a4_s7MZ (+# wild_X1y 1) (s'#_a5GP `cast` ...);
99 ->
case x1_X5J9 of wild1_X1o {
__DEFAULT -> a3_s7Q5 (+# wild1_X1o 1) (s'#_a5GP `cast` ...);
99 ->
case x_a5HT of wild2_X1c {
__DEFAULT -> a2_s7Q6 (+# wild2_X1c 1) (s'#_a5GP `cast` ...);
99999 -> (# s'#_a5GP `cast` ..., () #)
}
}
}
}; } in
a4_s7MZ 0 eta1_XP; } in
a3_s7Q5 0 eta_B1; } in
a2_s7Q6 0 (ipv6_a4MX `cast` ...)
}
}
};
True ->
case error
(unpackAppendCString#
"Primitive.basicUnsafeNew: length to large: "#
(case $wshowSignedInt 0 10000 ([])
of _ { (# ww5_a5wO, ww6_a5wP #) ->
: ww5_a5wO ww6_a5wP
}))
of wild_00 {
}
}
}
main :: IO ()
main = main1 `cast` ...
main2 :: State# RealWorld -> (# State# RealWorld, () #)
main2 = runMainIO1 (main1 `cast` ...)
main :: IO ()
main = main2 `cast` ...
I have to admit that I basically don't know why some code avoids stack frame creation and some doesn't. I suspect that inlining from "the inside" out helps, and a quick inspection informed me that Control.Monad.Loop uses a CPS encoding, which might be relevant here, although the Monad.Loop solution is sensitive to let floating, and I couldn't determine on short notice from the Core why test_c with let floating fails to run in a single stack frame.
Now, the performance benefit of running in a single stack frame is small. We've seen that test_b is only slightly faster than test_a. I include this detour in the answer because I found it edifying.
The state hack and strict bindings
The so-called state hack makes GHC aggressive in inlining into IO and ST actions. I think I should mention it here, because besides let floating this is the other thing that can thoroughly ruin performance.
The state hack is enabled with optimizations -O, and can possibly slow down programs asymptotically. A simple example from Reid Barton:
import Control.Monad
import Debug.Trace
expensive :: String -> String
expensive x = trace "$$$" x
main :: IO ()
main = do
str <- fmap expensive getLine
replicateM_ 3 $ print str
With GHC-7.10.2, this prints "$$$" once without optimizations but three times with -O2. And it seems that with GHC-7.10, we can't get rid of this behavior with -fno-state-hack (which is the subject of the linked ticket from Reid Barton).
Strict monadic bindings reliably get rid of this problem:
main :: IO ()
main = do
!str <- fmap expensive getLine
replicateM_ 3 $ print str
I think it's good habit to do strict bindings in IO and ST. And I have some experience (not definitive though; I'm far from being a GHC expert) that strict bindings are especially needed if we use -fno-full-laziness. Apparently full laziness can help get rid of some of the work duplication introduced by the inlining caused by the state hack; with test_b and no full laziness, omitting the strict binding on !mvec <- V.unsafeThaw vec caused a slight slowdown and extremely ugly Core output.

In my experience forM_ [0..n-1] can perform well, but unfortunately it's not reliable. Just adding an INLINE pragma to test_a and using -O2 makes it run much faster (4s to 1s for me), but manually inlining it (copy paste) slows it down again.
A more reliable function is is for from statistics which is implemented as
-- | Simple for loop. Counts from /start/ to /end/-1.
for :: Monad m => Int -> Int -> (Int -> m ()) -> m ()
for n0 !n f = loop n0
where
loop i | i == n = return ()
| otherwise = f i >> loop (i+1)
{-# INLINE for #-}
Using it looks similar to forM_ with lists:
test_d :: MV.IOVector Int -> IO ()
test_d mv =
for 0 times $ \_ ->
for 0 side $ \i ->
for 0 side $ \j ->
MV.unsafeWrite mv (i*side + j) 1
but performs reliably well (0.85s for me) without any risk of allocating a list.

Why is the "better" digit-listing function slower?

I was playing around with Project Euler #34, and I wrote these functions:
import Data.Time.Clock.POSIX
import Data.Char
digits :: (Integral a) => a -> [Int]
digits x
| x < 10 = [fromIntegral x]
| otherwise = let (q, r) = x `quotRem` 10 in (fromIntegral r) : (digits q)
digitsByShow :: (Integral a, Show a) => a -> [Int]
digitsByShow = map (\x -> ord x - ord '0') . show
I thought that for sure digits has to be the faster one, as we don't convert to a String. I could not have been more wrong. I ran the two versions via pe034:
pe034 digitFunc = sum $ filter sumFactDigit [3..2540160]
where
sumFactDigit :: Int -> Bool
sumFactDigit n = n == (sum $ map sFact $ digitFunc n)
sFact :: Int -> Int
sFact n
| n == 0 = 1
| n == 1 = 1
| n == 2 = 2
| n == 3 = 6
| n == 4 = 24
| n == 5 = 120
| n == 6 = 720
| n == 7 = 5040
| n == 8 = 40320
| n == 9 = 362880
main = do
begin <- getPOSIXTime
print $ pe034 digitsByShow -- or digits
end <- getPOSIXTime
print $ end - begin
After compiling with ghc -O, digits consistently takes .5 seconds, while digitsByShow consistently takes .3 seconds. Why is this so? Why is the function which stays within Integer arithmetic slower, whereas the function which goes into string comparison is faster?
I ask this because I come from programming in Java and similar languages, where the % 10 trick of generating digits is way faster than the "convert to String" method. I haven't been able to wrap my head around the fact that converting to a string could be faster.

This is the best I can come up with.
digitsV2 :: (Integral a) => a -> [Int]
digitsV2 n = go n []
where
go x xs
| x < 10 = fromIntegral x : xs
| otherwise = case quotRem x 10 of
(q,r) -> go q (fromIntegral r : xs)
when compiled with -O2 and tested with Criterion
digits runs in 470.4 ms
digitsByShow runs in 421.8 ms
digitsV2 runs in 258.0 ms
results may vary
edit:
I am not sure why building the list like this helps so much.
But you can improve your codes speed by strictly evaluating quotRem x 10
You can do this with BangPatterns
| otherwise = let !(q, r) = x `quotRem` 10 in (fromIntegral r) : (digits q)
or with case
| otherwise = case quotRem x 10 of
(q,r) -> fromIntegral r : digits q
Doing this drops digits down to 323.5 ms
edit: time without using Criterion
digits = 464.3 ms
digitsStrict = 328.2 ms
digitsByShow = 259.2 ms
digitV2 = 252.5 ms
note: The criterion package measures software performance.

Let's investigate why #No_signal's solution is faster.
I made three runs of ghc:
ghc -O2 -ddump-simpl digits.hs >digits.txt
ghc -O2 -ddump-simpl digitsV2.hs >digitsV2.txt
ghc -O2 -ddump-simpl show.hs >show.txt
digits.hs
digits :: (Integral a) => a -> [Int]
digits x
| x < 10 = [fromIntegral x]
| otherwise = let (q, r) = x `quotRem` 10 in (fromIntegral r) : (digits q)
main = return $ digits 1
digitsV2.hs
digitsV2 :: (Integral a) => a -> [Int]
digitsV2 n = go n []
where
go x xs
| x < 10 = fromIntegral x : xs
| otherwise = let (q, r) = x `quotRem` 10 in go q (fromIntegral r : xs)
main = return $ digits 1
show.hs
import Data.Char
digitsByShow :: (Integral a, Show a) => a -> [Int]
digitsByShow = map (\x -> ord x - ord '0') . show
main = return $ digitsByShow 1
If you'd like to view the complete txt files, I placed them on ideone (rather than paste a 10000 char dump here):
digits.txt
digitsV2.txt
show.txt
If we carefully look through digits.txt, it appears that this is the relevant section:
lvl_r1qU = __integer 10
Rec {
Main.$w$sdigits [InlPrag=[0], Occ=LoopBreaker]
:: Integer -> (# Int, [Int] #)
[GblId, Arity=1, Str=DmdType <S,U>]
Main.$w$sdigits =
\ (w_s1pI :: Integer) ->
case integer-gmp-1.0.0.0:GHC.Integer.Type.ltInteger#
w_s1pI lvl_r1qU
of wild_a17q { __DEFAULT ->
case GHC.Prim.tagToEnum# # Bool wild_a17q of _ [Occ=Dead] {
False ->
let {
ds_s16Q [Dmd=<L,U(U,U)>] :: (Integer, Integer)
[LclId, Str=DmdType]
ds_s16Q =
case integer-gmp-1.0.0.0:GHC.Integer.Type.quotRemInteger
w_s1pI lvl_r1qU
of _ [Occ=Dead] { (# ipv_a17D, ipv1_a17E #) ->
(ipv_a17D, ipv1_a17E)
} } in
(# case ds_s16Q of _ [Occ=Dead] { (q_a11V, r_X12h) ->
case integer-gmp-1.0.0.0:GHC.Integer.Type.integerToInt r_X12h
of wild3_a17c { __DEFAULT ->
GHC.Types.I# wild3_a17c
}
},
case ds_s16Q of _ [Occ=Dead] { (q_X12h, r_X129) ->
case Main.$w$sdigits q_X12h
of _ [Occ=Dead] { (# ww1_s1pO, ww2_s1pP #) ->
GHC.Types.: # Int ww1_s1pO ww2_s1pP
}
} #);
True ->
(# GHC.Num.$fNumInt_$cfromInteger w_s1pI, GHC.Types.[] # Int #)
}
}
end Rec }
digitsV2.txt:
lvl_r1xl = __integer 10
Rec {
Main.$wgo [InlPrag=[0], Occ=LoopBreaker]
:: Integer -> [Int] -> (# Int, [Int] #)
[GblId, Arity=2, Str=DmdType <S,U><L,U>]
Main.$wgo =
\ (w_s1wh :: Integer) (w1_s1wi :: [Int]) ->
case integer-gmp-1.0.0.0:GHC.Integer.Type.ltInteger#
w_s1wh lvl_r1xl
of wild_a1dp { __DEFAULT ->
case GHC.Prim.tagToEnum# # Bool wild_a1dp of _ [Occ=Dead] {
False ->
case integer-gmp-1.0.0.0:GHC.Integer.Type.quotRemInteger
w_s1wh lvl_r1xl
of _ [Occ=Dead] { (# ipv_a1dB, ipv1_a1dC #) ->
Main.$wgo
ipv_a1dB
(GHC.Types.:
# Int
(case integer-gmp-1.0.0.0:GHC.Integer.Type.integerToInt ipv1_a1dC
of wild2_a1ea { __DEFAULT ->
GHC.Types.I# wild2_a1ea
})
w1_s1wi)
};
True -> (# GHC.Num.$fNumInt_$cfromInteger w_s1wh, w1_s1wi #)
}
}
end Rec }
I actually couldn't find the relevant section for show.txt. I'll work on that later.
Right off the bat, digitsV2.hs produces shorter code. That's probably a good sign for it.
digits.hs seems to be following this psuedocode:
def digits(w_s1pI):
if w_s1pI < 10: return [fromInteger(w_s1pI)]
else:
ds_s16Q = quotRem(w_s1pI, 10)
q_X12h = ds_s16Q[0]
r_X12h = ds_s16Q[1]
wild3_a17c = integerToInt(r_X12h)
ww1_s1pO = r_X12h
ww2_s1pP = digits(q_X12h)
ww2_s1pP.pushFront(ww1_s1pO)
return ww2_s1pP
digitsV2.hs seems to be following this psuedocode:
def digitsV2(w_s1wh, w1_s1wi=[]): # actually disguised as go(), as #No_signal wrote
if w_s1wh < 10:
w1_s1wi.pushFront(fromInteger(w_s1wh))
return w1_s1wi
else:
ipv_a1dB, ipv1_a1dC = quotRem(w_s1wh, 10)
w1_s1wi.pushFront(integerToIn(ipv1a1dC))
return digitsV2(ipv1_a1dC, w1_s1wi)
It might not be that these functions mutate lists like my psuedocode suggests, but this immediately suggests something: it looks as if digitsV2 is fully tail-recursive, whereas digits is actually not (may have to use some Haskell trampoline or something). It appears as if Haskell needs to store all the remainders in digits before pushing them all to the front of the list, whereas it can just push them and forget about them in digitsV2. This is purely speculation, but it is well-founded speculation.

Unboxing a function

I have a function that I am trying to optimize. This is part of a bigger code where I suspect this function is preventing GHC from unboxing Int arguments at higher level function that calls it. So, I wrote a simple test with two things in mind - understand the core, and try different things to see what makes GHC unbox it, so that I can apply the lessons to bigger code. Here is the function cmp with a test function wrapper:
{-# LANGUAGE BangPatterns #-}
module Cmp
( cmp,
test )
where
import Data.Vector.Unboxed as U hiding (mapM_)
import Data.Word
cmp :: (U.Unbox a, Eq a) => U.Vector a -> U.Vector a -> Int -> Int -> Int
cmp a b !i !j = go a b 0 i j
where
go v1 v2 !len !i !j| (i<n) && (j<m) && ((unsafeIndex v1 i) == (unsafeIndex v2 j)) = go v1 v2 (len+1) (i+1) (j+1)
| otherwise = len
where
n = U.length a
m = U.length b
{-# INLINABLE cmp #-}
test :: (U.Unbox a, Eq a) => U.Vector a -> U.Vector a -> U.Vector Int -> Int
test a b i = U.sum $ U.map (\x -> cmp a b x x) i
Ideally, test should call unboxed version of cmp with following signature (of course, correct me if I am wrong):
U.Vector a -> U.Vector a -> Int# -> Int# -> Int#
Looking at the core generated in ghc 7.6.1 (command line option:ghc -fforce-recomp -ddump-simpl -dsuppress-uniques -dsuppress-idinfo -dsuppress-module-prefixes -O2 -fllvm), I see this for inner loop for test - snippets from core below, with my comments added:
-- cmp function doesn't have any helper functions with unboxed Int
--
cmp
:: forall a.
(Unbox a, Eq a) =>
Vector a -> Vector a -> Int -> Int -> Int
...
-- This is the function that is called by test - it does keep the result
-- unboxed, but calls boxed cmp, and unboxes the result of cmp (I# y)
--
$wa
:: forall a.
(Unbox a, Eq a) =>
Vector a -> Vector a -> Vector Int -> Int#
$wa =
\ (# a)
(w :: Unbox a)
(w1 :: Eq a)
(w2 :: Vector a)
(w3 :: Vector a)
(w4 :: Vector Int) ->
case w4
`cast` (<TFCo:R:VectorInt> ; <NTCo:R:VectorInt>
:: Vector Int ~# Vector Int)
of _ { Vector ipv ipv1 ipv2 ->
letrec {
$s$wfoldlM'_loop :: Int# -> Int# -> Int#
$s$wfoldlM'_loop =
\ (sc :: Int#) (sc1 :: Int#) ->
case >=# sc1 ipv1 of _ {
False ->
case indexIntArray# ipv2 (+# ipv sc1) of wild { __DEFAULT ->
let {
x :: Int
x = I# wild } in
--
-- Calls cmp and unboxes the Int result as I# y
--
case cmp # a w w1 w2 w3 x x of _ { I# y ->
$s$wfoldlM'_loop (+# sc y) (+# sc1 1)
}
};
True -> sc
}; } in
$s$wfoldlM'_loop 0 0
}
-- helper function called by test - it calls $wa which calls boxed cmp
--
test1
:: forall a.
(Unbox a, Eq a) =>
Vector a -> Vector a -> Vector Int -> Id Int
test1 =
\ (# a)
(w :: Unbox a)
(w1 :: Eq a)
(w2 :: Vector a)
(w3 :: Vector a)
(w4 :: Vector Int) ->
case $wa # a w w1 w2 w3 w4 of ww { __DEFAULT ->
(I# ww) `cast` (Sym <(NTCo:Id <Int>)> :: Int ~# Id Int)
}
I will appreciate pointers on how to force unboxed version of cmp to be called from test. I tried strictifying different arguments, but that was like throwing the kitchen sink at it, which of course didn't work. I hope to use the lessons learnt here to solve the boxing/unboxing performance issue in the more complicated code.
Also, one more question - I have seen cast being used in the core, but haven't found any core references on Haskell/GHC wiki that explain what it is. It seems a type casting operation. I would appreciate explanation of what it is, and how to interpret it in the test1 function above.

Now I don't have ghc, so my advices would be verbal:
Why do you avoid {-# INLINE #-} pragma? High performance in Haskell is significantly based on function inlining. Add INLINE pragma to the go function.
Remove first two excessive parameters of go function. Read more about interoperation of inlining, specializing (unboxing) of parameters here: http://www.haskell.org/ghc/docs/latest/html/users_guide/pragmas.html#inline-pragma
Move m and n definitions one level up, along with go.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio