How do I optimize a loop which can be fully strict - performance

I'm trying to write a brute-force solution to Project Euler Problem #145, and I cannot get my solution to run in less than about 1 minute 30 secs.
(I'm aware there are various short-cuts and even paper-and-pencil solutions; for the purpose of this question I'm not considering those).
In the best version I've come up with so far, profiling shows that the majority of the time is spent in foldDigits. This function need not be lazy at all, and to my mind ought to be optimized to a simple loop. As you can see I've attempted to make various bits of the program strict.
So my question is: without changing the overall algorithm, is there some way to bring the execution time of this program down to the sub-minute mark?
(Or if not, is there a way to see that the code of foldDigits is as optimized as possible?)
-- ghc -O3 -threaded Euler-145.hs && Euler-145.exe +RTS -N4
{-# LANGUAGE BangPatterns #-}
import Control.Parallel.Strategies
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f !acc !n
| n < 10 = i
| otherwise = foldDigits f i d
where (d, m) = n `quotRem` 10
!i = f acc m
reverseNumber :: Int -> Int
reverseNumber !n
= foldDigits accumulate 0 n
where accumulate !v !d = v * 10 + d
allDigitsOdd :: Int -> Bool
allDigitsOdd n
= foldDigits andOdd True n
where andOdd !a d = a && isOdd d
isOdd !x = x `rem` 2 /= 0
isReversible :: Int -> Bool
isReversible n
= notDivisibleByTen n && allDigitsOdd (n + rn)
where rn = reverseNumber n
notDivisibleByTen !x = x `rem` 10 /= 0
countRange acc start end
| start > end = acc
| otherwise = countRange (acc + v) (start + 1) end
where v = if isReversible start then 1 else 0
main
= print $ sum $ parMap rseq cr ranges
where max = 1000000000
qmax = max `div` 4
ranges = [(1, qmax), (qmax, qmax * 2), (qmax * 2, qmax * 3), (qmax * 3, max)]
cr (s, e) = countRange 0 s e

As it stands, the core that ghc-7.6.1 produces for foldDigits (with -O2) is
Rec {
$wfoldDigits_r2cK
:: forall a_aha.
(a_aha -> GHC.Types.Int -> a_aha)
-> a_aha -> GHC.Prim.Int# -> a_aha
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType C(C(S))SL]
$wfoldDigits_r2cK =
\ (# a_aha)
(w_s284 :: a_aha -> GHC.Types.Int -> a_aha)
(w1_s285 :: a_aha)
(ww_s288 :: GHC.Prim.Int#) ->
case w1_s285 of acc_Xhi { __DEFAULT ->
let {
ds_sNo [Dmd=Just D(D(T)S)] :: (GHC.Types.Int, GHC.Types.Int)
[LclId, Str=DmdType]
ds_sNo =
case GHC.Prim.quotRemInt# ww_s288 10
of _ { (# ipv_aJA, ipv1_aJB #) ->
(GHC.Types.I# ipv_aJA, GHC.Types.I# ipv1_aJB)
} } in
case w_s284 acc_Xhi (case ds_sNo of _ { (d_arS, m_Xsi) -> m_Xsi })
of i_ahg { __DEFAULT ->
case GHC.Prim.<# ww_s288 10 of _ {
GHC.Types.False ->
case ds_sNo of _ { (d_Xsi, m_Xs5) ->
case d_Xsi of _ { GHC.Types.I# ww1_X28L ->
$wfoldDigits_r2cK # a_aha w_s284 i_ahg ww1_X28L
}
};
GHC.Types.True -> i_ahg
}
}
}
end Rec }
which, as you can see, re-boxes the result of the quotRem call. The problem is that no property of f is available here, and as a recursive function, foldDigits cannot be inlined.
With a manual worker-wrapper transform making the function argument static,
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f = go
where
go !acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (f acc r) q
foldDigits becomes inlinable, and you get specialised versions for your uses operating on unboxed data, but no top-level foldDigits, e.g.
Rec {
$wgo_r2di :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
$wgo_r2di =
\ (ww_s28F :: GHC.Prim.Int#) (ww1_s28J :: GHC.Prim.Int#) ->
case ww1_s28J of ds_XJh {
__DEFAULT ->
case GHC.Prim.quotRemInt# ds_XJh 10
of _ { (# ipv_aJK, ipv1_aJL #) ->
$wgo_r2di (GHC.Prim.+# (GHC.Prim.*# ww_s28F 10) ipv1_aJL) ipv_aJK
};
0 -> ww_s28F
}
end Rec }
and the effect on computation time is tangible, for the original, I got
$ ./eul145 +RTS -s -N2
608720
1,814,289,579,592 bytes allocated in the heap
196,407,088 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1827331 colls, 1827331 par 23.77s 11.86s 0.0000s 0.0041s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 54.94% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 620.52s (313.51s elapsed)
GC time 23.77s ( 11.86s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 644.29s (325.37s elapsed)
Alloc rate 2,923,834,808 bytes per MUT second
(I used -N2 since my i5 only has two physical cores), vs.
$ ./eul145 +RTS -s -N2
608720
16,000,063,624 bytes allocated in the heap
403,384 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 15852 colls, 15852 par 0.34s 0.17s 0.0000s 0.0037s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 43.86% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 314.85s (160.08s elapsed)
GC time 0.34s ( 0.17s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 315.20s (160.25s elapsed)
Alloc rate 50,817,657 bytes per MUT second
Productivity 99.9% of total user, 196.5% of total elapsed
with the modification. The running time roughly halved, and the allocations reduced 100-fold.

Related

ByteString concatMap performance

I have a 37MB bin file I am trying to convert to a ppm sequence. It works fine, and I'm trying to use this as an exercise to learn some profiling and more about lazy bytestrings in Haskell. My program seems to bomb at the concatMap, which is used to replicate each byte three times so I have R, G, and B. The code is fairly straight forward - every 2048 bytes I write a new header:
{-# LANGUAGE OverloadedStrings #-}
import System.IO
import System.Environment
import Control.Monad
import qualified Data.ByteString.Lazy.Char8 as B
main :: IO ()
main = do [from, to] <- getArgs
withFile from ReadMode $ \inH ->
withFile to WriteMode $ \outH ->
loop (B.hGet inH 2048) (process outH) B.null
loop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m ()
loop inp outp done = inp >>= \x -> unless (done x) (outp x >> loop inp outp done)
process :: Handle -> B.ByteString -> IO ()
process h bs | B.null bs = return ()
| otherwise = B.hPut h header >> B.hPut h bs'
where header = "P6\n32 64\n255\n" :: B.ByteString
bs' = B.concatMap (B.replicate 3) bs
This pulls it off in a little over 5s. It's not terrible, and my only comparison is my very naive C implementation that does it a little under 4s - so that or ideally under has been my goal.
Here is the RTS from the above code:
33,435,345,688 bytes allocated in the heap
14,963,640 bytes copied during GC
54,640 bytes maximum residency (77 sample(s))
21,136 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 64604 colls, 0 par 0.20s 0.25s 0.0000s 0.0001s
Gen 1 77 colls, 0 par 0.00s 0.01s 0.0001s 0.0006s
INIT time 0.00s ( 0.00s elapsed)
MUT time 5.09s ( 5.27s elapsed)
GC time 0.21s ( 0.26s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.29s ( 5.52s elapsed)
%GC time 3.9% (4.6% elapsed)
Alloc rate 6,574,783,667 bytes per MUT second
Productivity 96.1% of total user, 92.1% of total elapsed
Pretty gnarly results. When I remove the concatMap and just copy everything over with the headers every 2048 bytes, it's practically instant:
70,983,992 bytes allocated in the heap
48,912 bytes copied during GC
54,640 bytes maximum residency (2 sample(s))
19,744 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 204 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.01s ( 0.07s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.02s ( 0.07s elapsed)
%GC time 9.6% (2.9% elapsed)
Alloc rate 5,026,838,892 bytes per MUT second
Productivity 89.8% of total user, 22.3% of total elapsed
So I guess my question is two fold:
How can I improve the overall performance?
And had the bottleneck not been so obvious, what are some ways I could have tracked it down?
Thank you.
Edit
Here's the final code and RTS if anyone is interested! I was also able to find additional bottlenecks by making use of ghc's profiler with -prof -auto-all -caf-all after reading up on the Profiling and optimization chapter of Real World Haskell.
{-# LANGUAGE OverloadedStrings #-}
import System.IO
import System.Environment
import Control.Monad
import Data.Monoid
import qualified Data.ByteString.Builder as BU
import qualified Data.ByteString.Lazy.Char8 as BL
main :: IO ()
main = do [from, to] <- getArgs
withFile from ReadMode $ \inH ->
withFile to WriteMode $ \outH ->
loop (BL.hGet inH 2048) (process outH) BL.null
loop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m ()
loop inp outp done = inp >>= \x -> unless (done x) (outp x >> loop inp outp done)
upConcatMap :: Monoid c => (Char -> c) -> BL.ByteString -> c
upConcatMap f bs = mconcat . map f $ BL.unpack bs
process :: Handle -> BL.ByteString -> IO ()
process h bs | BL.null bs = return ()
| otherwise = BU.hPutBuilder h frame
where header = "P6\n32 64\n255\n"
bs' = BU.toLazyByteString $ upConcatMap trip bs
frame = BU.lazyByteString $ mappend header bs'
trip c = let b = BU.char8 c in mconcat [b, b, b]
6,383,263,640 bytes allocated in the heap
18,596,984 bytes copied during GC
54,640 bytes maximum residency (2 sample(s))
31,056 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 11165 colls, 0 par 0.06s 0.06s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.69s ( 0.83s elapsed)
GC time 0.06s ( 0.06s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.75s ( 0.89s elapsed)
%GC time 7.4% (7.2% elapsed)
Alloc rate 9,194,103,284 bytes per MUT second
Productivity 92.6% of total user, 78.0% of total elapsed
What about Builder?
This version is ~5x faster for me:
process :: Handle -> B.ByteString -> IO ()
process h bs
| B.null bs = return ()
| otherwise = B.hPut h header >> B.hPutBuilder h bs'
where header = "P6\n32 64\n255\n" :: B.ByteString
bs' = mconcat $ map triple $ B.unpack bs
triple c = let b = B.char8 c in mconcat [b, b, b]
It allocates much less garbage.
ADD: for the reference, runtime statistics:
4,642,746,104 bytes allocated in the heap
390,110,640 bytes copied during GC
63,592 bytes maximum residency (2 sample(s))
21,648 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8992 colls, 0 par 0.54s 0.63s 0.0001s 0.0017s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.98s ( 1.13s elapsed)
GC time 0.54s ( 0.63s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.52s ( 1.76s elapsed)
%GC time 35.4% (36.0% elapsed)
Alloc rate 4,718,237,910 bytes per MUT second
Productivity 64.6% of total user, 55.9% of total elapsed
Use a Builder to concatenate your ByteString from smaller ones and it will go faster. It's in the ByteString documentation.
Looking at the source, concatMap goes via a list:
concatMap :: (Word8 -> ByteString) -> ByteString -> ByteString
concatMap f = concat . foldr ((:) . f) []
And concat has to do a decent amount of work. Looks like the Builder advice is good.

Memoization done, what now?

I was trying to solve a puzzle in Haskell and had written the following code:
u 0 p = 0.0
u 1 p = 1.0
u n p = 1.0 + minimum [((1.0-q)*(s k p)) + (u (n-k) p) | k <-[1..n], let q = (1.0-p)**(fromIntegral k)]
s 1 p = 0.0
s n p = 1.0 + minimum [((1.0-q)*(s (n-k) p)) + q*((s k p) + (u (n-k) p)) | k <-[1..(n-1)], let q = (1.0-(1.0-p)**(fromIntegral k))/(1.0-(1.0-p)**(fromIntegral n))]
This code was terribly slow though. I suspect the reason for this is that the same things get calculated again and again. I therefore made a memoized version:
memoUa = array (0,10000) ((0,0.0):(1,1.0):[(k,mua k) | k<- [2..10000]])
mua n = (1.0) + minimum [((1.0-q)*(memoSa ! k)) + (memoUa ! (n-k)) | k <-[1..n], let q = (1.0-0.02)**(fromIntegral k)]
memoSa = array (0,10000) ((0,0.0):(1,0.0):[(k,msa k) | k<- [2..10000]])
msa n = (1.0) + minimum [((1.0-q) * (memoSa ! (n-k))) + q*((memoSa ! k) + (memoUa ! (n-k))) | k <-[1..(n-1)], let q = (1.0-(1.0-0.02)**(fromIntegral k))/(1.0-(1.0-0.02)**(fromIntegral n))]
This seems to be a lot faster, but now I get an out of memory error. I do not understand why this happens (the same strategy in java, without recursion, has no problems). Could somebody point me in the right direction on how to improve this code?
EDIT: I am adding my java version here (as I don't know where else to put it). I realize that the code isn't really reader-friendly (no meaningful names, etc.), but I hope it is clear enough.
public class Main {
public static double calc(double p) {
double[] u = new double[10001];
double[] s = new double[10001];
u[0] = 0.0;
u[1] = 1.0;
s[0] = 0.0;
s[1] = 0.0;
for (int n=2;n<10001;n++) {
double q = 1.0;
double denom = 1.0;
for (int k = 1; k <= n; k++ ) {
denom = denom * (1.0 - p);
}
denom = 1.0 - denom;
s[n] = (double) n;
u[n] = (double) n;
for (int k = 1; k <= n; k++ ) {
q = (1.0 - p) * q;
if (k<n) {
double qs = (1.0-q)/denom;
double bs = (1.0-qs)*s[n-k] + qs*(s[k]+ u[n-k]) + 1.0;
if (bs < s[n]) {
s[n] = bs;
}
}
double bu = (1.0-q)*s[k] + 1.0 + u[n-k];
if (bu < u[n]) {
u[n] = bu;
}
}
}
return u[10000];
}
public static void main(String[] args) {
double s = 0.0;
int i = 2;
//for (int i = 1; i<51; i++) {
s = s + calc(i*0.01);
//}
System.out.println("result = " + s);
}
}
I don't run out of memory when I run the compiled version, but there is a significant difference between how the Java version works and how the Haskell version works which I'll illustrate here.
The first thing to do is to add some important type signatures. In particular, you don't want Integer array indices, so I added:
memoUa :: Array Int Double
memoSa :: Array Int Double
I found these using ghc-mod check. I also added a main so that you can run it from the command line:
import System.Environment
main = do
(arg:_) <- getArgs
let n = read arg
print $ mua n
Now to gain some insight into what's going on, we can compile the program using profiling:
ghc -O2 -prof memo.hs
Then when we invoke the program like this:
memo 1000 +RTS -s
we will get profiling output which looks like:
164.31333233347755
98,286,872 bytes allocated in the heap
29,455,360 bytes copied during GC
657,080 bytes maximum residency (29 sample(s))
38,260 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 161 colls, 0 par 0.03s 0.03s 0.0002s 0.0011s
Gen 1 29 colls, 0 par 0.03s 0.03s 0.0011s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.21s ( 0.21s elapsed)
GC time 0.06s ( 0.06s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.27s elapsed)
%GC time 21.8% (22.3% elapsed)
Alloc rate 468,514,624 bytes per MUT second
Productivity 78.2% of total user, 77.3% of total elapsed
Important things to pay attention to are:
maximum residency
Total time
%GC time (or Productivity)
Maximum residency is a measure of how much memory is needed by the program. %GC time the proportion of the time spent in garbage collection and Productivity is the complement (100% - %GC time).
If you run the program for various input values you will see a productivity of around 80%:
n Max Res. Prod. Time Output
2000 779,076 79.4% 1.10s 328.54535361588535
4000 1,023,016 80.7% 4.41s 657.0894961398351
6000 1,299,880 81.3% 9.91s 985.6071032981068
8000 1,539,352 81.5% 17.64s 1314.0968411684714
10000 1,815,600 81.7% 27.57s 1642.5891214360522
This means that about 20% of the run time is spent in garbage collection. Also, we see increasing memory usage as n increases.
It turns out we can dramatically improve productivity and memory usage by telling Haskell the order in which to evaluate the array elements instead of relying on lazy evaluation:
import Control.Monad (forM_)
main = do
(arg:_) <- getArgs
let n = read arg
forM_ [1..n] $ \i -> mua i `seq` return ()
print $ mua n
And the new profiling stats are:
n Max Res. Prod. Time Output
2000 482,800 99.3% 1.31s 328.54535361588535
4000 482,800 99.6% 5.88s 657.0894961398351
6000 482,800 99.5% 12.09s 985.6071032981068
8000 482,800 98.1% 21.71s 1314.0968411684714
10000 482,800 96.1% 34.58s 1642.5891214360522
Some interesting observations here: productivity is up, memory usage is down (constant now over the range of inputs) but run time is up. This suggests that we forced more computations than we needed to. In an imperative language like Java you have to give an evaluation order so you would know exactly which computations need to be performed. It would interesting to see your Java code to see which computations it is performing.

Why does this Haskell program perform so poorly?

I'm a Haskell newbie and I'm lost as to how this program performs so poorly. I tried forcing strict variables in various places but it doesn't seem to make a difference.
Here is my code (the purpose of this program is produce the frequencies of the input bytes found from standard input):
{-# LANGUAGE BangPatterns #-}
import Control.Concurrent (forkIO, killThread)
import Control.Concurrent.MVar
import qualified Data.IntMap as IntMap
import Data.IntMap.Strict (IntMap)
import Control.Monad.Fix
import Control.Monad (when)
import qualified Data.Char as Char
import qualified System.IO as IO
import System.IO (hSetBinaryMode, hFlush)
import Data.List as List
import Text.PrettyPrint.Boxes as Boxes
import Text.Printf (printf)
import Data.Function
data BFreq = BFreq Integer (IntMap Integer)
main :: IO ()
main = do
putStrLn "analyze data from stdin"
hSetBinaryMode IO.stdin True
mv <- newEmptyMVar
tid <- forkIO $ statusUpdater mv
bf <- run mv
killThread tid
displayResults bf
resultTable :: [[String]] -> Box
resultTable rows =
Boxes.hsep 4 Boxes.left boxed_cols
where
cols = transpose rows
boxed_cols = map (Boxes.vcat Boxes.left . map text) cols
displayResults :: BFreq -> IO ()
displayResults (BFreq n counts) = do
putStrLn $ "read " ++ (show n) ++ " bytes"
when (n > 0) (displayFreqs n counts)
displayFreqs :: Integer -> IntMap Integer -> IO ()
displayFreqs n counts =
do
putStrLn "frequencies:"
Boxes.printBox $ resultTable rows
where
cmp x y = compare (snd y) (snd x)
sorted_counts = List.sortBy cmp $ IntMap.assocs counts
intdiv :: Integer -> Integer -> Float
intdiv a b = (fromIntegral a) / (fromIntegral b)
percent y = printf "%.2f" (100*intdiv y n)
show_byte x = (show $ Char.chr x) ++ " (" ++ (show x) ++ "):"
show_count y = (percent y) ++ "% (" ++ (show y) ++ ")"
rows = map (\(x,y) -> [show_byte x, show_count y]) sorted_counts
run :: MVar Integer -> IO BFreq
run mv =
fn mv 0 IntMap.empty
where
fn mv !n !mp =
do
tryPutMVar mv n
eof <- IO.isEOF
if eof
then return $ BFreq n mp
else do
b <- getChar
fn mv (1+n) (new_map b)
where
k x = Char.ord x
old_val x = IntMap.findWithDefault 0 (k x) mp
new_map x = IntMap.insert (k x) ((old_val x)+1) mp
statusUpdater :: MVar Integer -> IO ()
statusUpdater mv =
do
takeMVar mv >>= print_progress
statusUpdater mv
where
print_progress n =
do
putStr $ "\rbytes: "
when (gbs > 0) $ putStr $ (show gbs) ++ " GBs "
when (mbs > 0) $ putStr $ (show mbs) ++ " MBs "
when (kbs > 0) $ putStr $ (show kbs) ++ " KBs "
when (gbs < 1 && mbs < 1 && kbs < 1) $ putStr $ (show bs) ++ " Bs "
hFlush IO.stdout
where
(gbs, gbr) = quotRem n 0x40000000
(mbs, mbr) = quotRem gbr 0x100000
(kbs, bs) = quotRem mbr 0x400
here is what happens when I run it (note: I am compiling with -O2):
$> cabal build -v
creating dist/build
creating dist/build/autogen
Building bfreq-0.1.0.0...
Preprocessing executable 'bfreq' for bfreq-0.1.0.0...
Building executable bfreq...
creating dist/build/bfreq
creating dist/build/bfreq/bfreq-tmp
/usr/bin/ghc --make -o dist/build/bfreq/bfreq -hide-all-packages -fbuilding-cabal-package -package-conf dist/package.conf.inplace -i -idist/build/bfreq/bfreq-tmp -i. -idist/build/autogen -Idist/build/autogen -Idist/build/bfreq/bfreq-tmp -optP-include -optPdist/build/autogen/cabal_macros.h -odir dist/build/bfreq/bfreq-tmp -hidir dist/build/bfreq/bfreq-tmp -stubdir dist/build/bfreq/bfreq-tmp -package-id base-4.5.0.0-40b99d05fae6a4eea95ea69e6e0c9702 -package-id boxes-0.1.3-e03668bca38fe3e879f9d695618ddef3 -package-id containers-0.5.3.1-80819105034e34d03d22b1c20d6fd868 -O -O2 -rtsopts -XHaskell98 ./bfreq.hs
[1 of 1] Compiling Main ( bfreq.hs, dist/build/bfreq/bfreq-tmp/Main.o )
Linking dist/build/bfreq/bfreq ...
$> cat /dev/urandom | head -c 9999999 > test_data
$> cat ./test_data | ./dist/build/bfreq/bfreq +RTS -sstderr
analyze data from stdin
bytes: 9 MBs 521 KBs read 9999999 bytes
frequencies:
'\137' (137): 0.40% (39642)
'H' (72): 0.40% (39608)
<...>
'L' (76): 0.39% (38617)
'\246' (246): 0.39% (38609)
'I' (73): 0.38% (38462)
'q' (113): 0.38% (38437)
9,857,106,520 bytes allocated in the heap
14,492,245,840 bytes copied during GC
3,406,696,360 bytes maximum residency (13 sample(s))
14,691,672 bytes maximum slop
6629 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 18348 colls, 0 par 10.90s 10.90s 0.0006s 0.0180s
Gen 1 13 colls, 0 par 15.20s 19.65s 1.5119s 12.6403s
INIT time 0.00s ( 0.00s elapsed)
MUT time 14.45s ( 14.79s elapsed)
GC time 26.10s ( 30.56s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 40.55s ( 45.35s elapsed)
%GC time 64.4% (67.4% elapsed)
Alloc rate 682,148,818 bytes per MUT second
Productivity 35.6% of total user, 31.9% of total elapsed
So unless I'm misinterpreting the above debug output, my program is using 6 GB?
The test data is less than 10 MB, so what's going on?
Any general advice on how to approach a problem like this in Haskell would be nice too. In otherwords, should I avoid Haskell for this kind of I/O-centric stuff? Should I be using the pipes library for this kind of thing?
EDIT:
Thanks for the help, correctly importing the strict version of IntMap fixes the memory issues.
I haven't been able to get the profiling (-fprof-auto) to work because it seems none of my packages are compiled for profiling. I solved the lack of profiling base libraries by installing the ghc profiling package for my OS (ubuntu: ghc-prof), but according to this I'll need to manually re-install all my haskell libraries for profiling. I don't have the time to do this at the moment, so I'm just putting this link here for the benefit of anyone who has a similar problem.
If you compile with -fprof-auto as per the GHC guide chapter on profiling, you'll see large allocation happening in run.fn.new_map and run.fn.
The code in question:
new_map x = IntMap.insert (k x) ((old_val x)+1) mp
Suspicion: ((old_val x)+1) is creating a chain of unevaluated thunks. Proposed change:
new_map x = let ov = old_val x + 1 in
ov `seq` IntMap.insert (k x) ov mp
Voila! Allocations, GC, and memory use are all way down.
EDIT: You probably intended to import qualified Data.IntMap.Strict as IntMap, making this change unnecessary.

Why Haskell function execution time measurment differs from ghc timings?

I want to measure time which Haskell spent to execute some function and use TimeIt package(also i tried these recommendations). But showed time differs from actual time application spent (I've ran application with +RTS -sstderr option):
CPU time: 4.85s
...
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.98s ( 61.69s elapsed)
GC time 0.22s ( 0.19s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.20s ( 61.89s elapsed)
Application source:
import qualified Data.ByteString.Lazy.Char8 as LBS
import System.Environment
import Data.Char
import Data.Int
import System.TimeIt
readChunks :: Int64 -> LBS.ByteString -> Int64
readChunks size str
| LBS.null str = 0
| otherwise = let (chunk, rest) = LBS.splitAt size str
in do
let len = LBS.length chunk
len `seq` len + readChunks size rest
processFile :: String -> IO()
processFile name = do
putStrLn name
content <- LBS.readFile name
let
(recNumStr, rest) = LBS.span (not.isControl) content
recNum = LBS.readInt recNumStr
case recNum of
Nothing -> putStrLn "can't parse"
Just (value, rest) -> print (value)
let chunkSize = 100*1024*1024
timeIt $ print (readChunks chunkSize rest)
UPDATE: I've found that Chronograph package shows right execution time (information taken from this question).
Well you are doing significant work that isn't being timed - it seems reasonable that this work makes up the difference, namely:
putStrLn name
content <- LBS.readFile name
let
(recNumStr, rest) = LBS.span (not.isControl) content
recNum = LBS.readInt recNumStr
case recNum of
Nothing -> putStrLn "can't parse"
Just (value, rest) -> print (value)
If you time that as well then you'll probably find most the difference. Also note there are other operations before you even enter main (which is true even for C programs).

Is indexing of Data.Vector.Unboxed.Mutable.MVector really this slow?

I have an application that spends about 80% of its time computing the centroid of a large list (10^7) of high dimensional vectors (dim=100) using the Kahan summation algorithm. I have done my best at optimizing the summation, but it is still 20x slower than an equivalent C implementation. Profiling indicates that the culprits are the unsafeRead and unsafeWrite functions from Data.Vector.Unboxed.Mutable. My question is: are these functions really this slow or am I misunderstanding the profiling statistics?
Here are the two implementations. The Haskell one is compiled with ghc-7.0.3 using the llvm backend. The C one is compiled with llvm-gcc.
Kahan summation in Haskell:
{-# LANGUAGE BangPatterns #-}
module Test where
import Control.Monad ( mapM_ )
import Data.Vector.Unboxed ( Vector, Unbox )
import Data.Vector.Unboxed.Mutable ( MVector )
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import Data.Word ( Word )
import Data.Bits ( shiftL, shiftR, xor )
prng :: Word -> Word
prng w = w' where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
mkVect :: Word -> Vector Double
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
foldV :: (Unbox a, Unbox b)
=> (a -> b -> a) -- componentwise function to fold
-> Vector a -- initial accumulator value
-> [Vector b] -- data vectors
-> Vector a -- final accumulator value
foldV fn accum vs = U.modify (\x -> mapM_ (liftV fn x) vs) accum where
liftV f acc = fV where
fV v = go 0 where
n = min (U.length v) (UM.length acc)
go i | i < n = step >> go (i + 1)
| otherwise = return ()
where
step = {-# SCC "fV_step" #-} do
a <- {-# SCC "fV_read" #-} UM.unsafeRead acc i
b <- {-# SCC "fV_index" #-} U.unsafeIndexM v i
{-# SCC "fV_write" #-} UM.unsafeWrite acc i $! {-# SCC "fV_apply" #-} f a b
kahan :: [Vector Double] -> Vector Double
kahan [] = U.singleton 0.0
kahan (v:vs) = fst . U.unzip $ foldV kahanStep acc vs where
acc = U.map (\z -> (z, 0.0)) v
kahanStep :: (Double, Double) -> Double -> (Double, Double)
kahanStep (s, c) x = (s', c') where
!y = x - c
!s' = s + y
!c' = (s' - s) - y
{-# NOINLINE kahanStep #-}
zero :: U.Vector Double
zero = U.replicate 100 0.0
myLoop n = kahan $ map mkVect [1..n]
main = print $ myLoop 100000
Compiling with ghc-7.0.3 using the llvm backend:
ghc -o Test_hs --make -fforce-recomp -O3 -fllvm -optlo-O3 -msse2 -main-is Test.main Test.hs
time ./Test_hs
real 0m1.948s
user 0m1.936s
sys 0m0.008s
Profiling information:
16,710,594,992 bytes allocated in the heap
33,047,064 bytes copied during GC
35,464 bytes maximum residency (1 sample(s))
23,888 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 31907 collections, 0 parallel, 0.28s, 0.27s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 24.73s ( 24.74s elapsed)
GC time 0.28s ( 0.27s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 25.01s ( 25.02s elapsed)
%GC time 1.1% (1.1% elapsed)
Alloc rate 675,607,179 bytes per MUT second
Productivity 98.9% of total user, 98.9% of total elapsed
Thu Feb 23 02:42 2012 Time and Allocation Profiling Report (Final)
Test_hs +RTS -s -p -RTS
total time = 24.60 secs (1230 ticks # 20 ms)
total alloc = 8,608,188,392 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
fV_write Test 31.1 26.0
fV_read Test 27.2 23.2
mkVect Test 12.3 27.2
fV_step Test 11.7 0.0
foldV Test 5.9 5.7
fV_index Test 5.2 9.3
kahanStep Test 3.3 6.5
prng Test 2.2 1.8
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
CAF:main1 Test 339 1 0.0 0.0 0.0 0.0
main Test 346 1 0.0 0.0 0.0 0.0
CAF:main2 Test 338 1 0.0 0.0 100.0 100.0
main Test 347 0 0.0 0.0 100.0 100.0
myLoop Test 348 1 0.2 0.2 100.0 100.0
mkVect Test 350 400000 12.3 27.2 14.5 29.0
prng Test 351 9900000 2.2 1.8 2.2 1.8
kahan Test 349 102 0.0 0.0 85.4 70.7
foldV Test 359 1 5.9 5.7 85.4 70.7
fV_step Test 360 9999900 11.7 0.0 79.5 65.1
fV_write Test 367 19999800 31.1 26.0 35.4 32.5
fV_apply Test 368 9999900 1.0 0.0 4.3 6.5
kahanStep Test 369 9999900 3.3 6.5 3.3 6.5
fV_index Test 366 9999900 5.2 9.3 5.2 9.3
fV_read Test 361 9999900 27.2 23.2 27.2 23.2
CAF:lvl19_r3ei Test 337 1 0.0 0.0 0.0 0.0
kahan Test 358 0 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad3_r3eg Test 336 1 0.0 0.0 0.0 0.0
kahan Test 357 0 0.0 0.0 0.0 0.0
CAF:$dMVector2_r3ee Test 335 1 0.0 0.0 0.0 0.0
CAF:$dVector1_r3ec Test 334 1 0.0 0.0 0.0 0.0
CAF:poly_$dMonad_r3ea Test 333 1 0.0 0.0 0.0 0.0
CAF:$dMVector1_r3e2 Test 330 1 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad2_r3e0 Test 328 1 0.0 0.0 0.0 0.0
foldV Test 365 0 0.0 0.0 0.0 0.0
CAF:lvl11_r3dM Test 322 1 0.0 0.0 0.0 0.0
kahan Test 354 0 0.0 0.0 0.0 0.0
CAF:lvl10_r3dK Test 321 1 0.0 0.0 0.0 0.0
kahan Test 355 0 0.0 0.0 0.0 0.0
CAF:$dMVector_r3dI Test 320 1 0.0 0.0 0.0 0.0
kahan Test 356 0 0.0 0.0 0.0 0.0
CAF GHC.Float 297 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 256 2 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 214 2 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 211 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Generic 182 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Unboxed 174 2 0.0 0.0 0.0 0.0
The equivalent implementation in C:
#include <stdint.h>
#include <stdio.h>
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahanStep (double *s, double *c, double x) {
double y, t;
y = x - *c;
t = *s + y;
*c = (t - *s) - y;
*s = t;
}
void kahan(double s[], double c[]) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
kahanStep(&s[j], &c[j], w);
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled with llvm-gcc:
>llvm-gcc -o Test_c -O3 -msse2 -std=c99 test.c
>time ./Test_c
real 0m0.096s
user 0m0.088s
sys 0m0.004s
Update 1: I un-inlined kahanStep in the C version. It barely made a dent in the performance. I hope that now we can all acknowledge Amdahl's law and move on. As
inefficient as kahanStep might be, unsafeRead and unsafeWrite are 9-10x slower. I was hoping someone could shed some light on the possible causes of that fact.
Also, I should say that since I am interacting with a library that uses Data.Vector.Unboxed, so I am kinda married to it at this point, and parting with it would be very traumatic :-)
Update 2: I guess I was not clear enough in my original question. I am not looking for ways to speed up this microbenchmark. I am looking for an explanation of the counter intuitive profiling stats, so I can decide whether or not to file a bug report against vector.
Your C version is not equivalent to your Haskell implementation. In C you've inlined the important Kahan summation step yourself, in Haskell you created a polymorphic higher order function that does a lot more and takes the transformation step as a parameter. Moving kahanStep to a separate function in C isn't the point, it will still be inlined by the compiler. Even if you put it into its own source file, compile separately and link without link-time optimisation, you have only addressed part of the difference.
I have made a C version that is closer to the Haskell version,
kahan.h:
typedef struct DPair_st {
double fst, snd;
} DPair;
DPair kahanStep(DPair pr, double x);
kahanStep.c:
#include "kahan.h"
DPair kahanStep (DPair pr, double x) {
double y, t;
y = x - pr.snd;
t = pr.fst + y;
pr.snd = (t - pr.fst) - y;
pr.fst = t;
return pr;
}
main.c:
#include <stdint.h>
#include <stdio.h>
#include "kahan.h"
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahan(double s[], double c[], DPair (*fun)(DPair,double)) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
DPair pr;
pr.fst = s[j];
pr.snd = c[j];
pr = fun(pr,w);
s[j] = pr.fst;
c[j] = pr.snd;
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err,kahanStep);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled separately and linked, that runs about 25% slower than the first C version here (0.1s vs. 0.079s).
Now you have a higher order function in C, considerably slower than the original, but still much faster than the Haskell code. One important difference is that the C function takes an unboxed pair of doubles and an unboxed double as arguments, while the Haskell kahanStep takes a boxed pair of boxed Doubles and a boxed Double and returns a boxed pair of boxed Doubles, requiring expensive boxing and unboxing in the foldV loop. That is addressable by more inlining. Explicitly inlining foldV, kahanStep, and step brings the time down from 0.90s to 0.74s here with ghc-7.0.4 (it has a smaller effect on ghc-7.4.1's output, from 0.99s down to 0.90s).
But the boxing and unboxing is, alas, the smaller part of the difference. foldV does much more than C's kahan, it takes a list of vectors used to modify the accumulator. That list of vectors is completely absent in the C code, and that makes a big difference. All these 100000 vectors have to be allocated, filled and put into a list (due to laziness, not all of them are simultaneously alive, so there's no space problem, but they, as well as the list cells, have to be allocated and garbage collected, which takes considerable time). And in the loop proper, instead of having a Word# passed in a register, the precomputed value is read from the vector.
If you use a more direct translation of the C to Haskell,
{-# LANGUAGE CPP, BangPatterns #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
prng :: Word -> Word
prng w = w'
where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
!cj <- unsafeRead c j
!sj <- unsafeRead s j
let !y = fromIntegral w - cj
!t = sj + y
!w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
it's much faster. Admittedly it's still about three times slower than the C, but the original was 13 times slower here (and I don't have llvm installed, so I use vanilla gcc and the native backed of GHC, using llvm may give slightly different results).
I don't think that indexing is really the culprit. The vector package heavily relies on compiler magic, but compiling for profiling support massively interferes with that. For packages like vector or bytestring which use their own fusion framework for optimisation, the profiling interference can be rather disastrous and the profiling results utterly useless. I'm inclined to believe we have such a case here.
In the Core, all reads and writes are transformed to the primops readDoubleArray#, indexDoubleArray# and writeDoubleArray#, which are fast. Maybe a bit slower than a C array access, but not very much. So I'm confident that that's not the problem and the cause of the big difference. But you have put {-# SCC #-} annotations on them, so disabling any optimisation involving a rearrangement of any of those terms. And each time one of these points is entered, it has to be recorded. I'm not familiar enough with the profiler and optimiser to know what exactly happens, but, as a data point, with the {-# INLINE #-} pragmas on foldV, step and kahanStep, a profiling run with these SCCs took 3.17s, and with the SCCs fV_step, fV_read, fV_index, fV_write and fV_apply removed (nothing else changed) a profiling run took only 2.03s (both times as reported by +RTS -P, so with the profiling overhead subtracted). That difference shows that SCCs on cheap functions and too fine-grained SCCs can massively skew the profiling results. Now if we also put {-# INLINE #-} pragmas on mkVect, kahan and prng, we are left with a completely uninformative profile, but the run takes only 1.23s. (These last inlinings have, however, no effect for the non-profiling runs, without profiling, they are inlined automatically.)
So, don't take the profiling results as unquestionable truths. The more your code (directly or indirectly through the libraries used) depends on optimisations, the more it is vulnerable to misleading profiling results caused by disabled optimisations. This also holds, but to a much lesser extent, for heap-profiling to pin down space leaks.
When you have a suspicious profiling result, check what happens when you remove some SCCs. If that results in a big drop of run time, that SCC was not your primary problem (it may become a problem again after other problems have been fixed).
Looking at the Core generated for your programme, what jumped out was that your kahanStep - by the way, remove the {-# NOINLINE #-} pragma from that, it's counter-productive - produced a boxed pair of boxed Doubles in the loop, that was immediately deconstructed and the components unboxed. Such unnecessary intermediate boxing of values is expensive and slows computations down a lot.
As this came up on haskell-cafe again today where somebody got terrible performance from the above code with ghc-7.4.1, tibbe took it upon himself to investigate the core that GHC produced and found that GHC produced suboptimal code for the conversion from Word to Double. Replacing the fromIntegral of the conversion with a custom conversion using only (wrapped) primitives (and removing the bang patterns that don't make a difference here, GHC's strictness analyser is good enough to see through the algorithm, I should learn to trust it more ;), we obtain a version that is on par with gcc -O3's output for the original C:
{-# LANGUAGE CPP #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
import GHC.Float (int2Double)
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
main :: IO ()
main = print . elems $ runSTUArray calc
There is a funny mixing in of list combinators in all of this seemingly Data.Vector code. If I make the very first obvious emendation, replacing
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
with the correct use of Data.Vector.Unboxed:
mkVect = U.force . U.map fromIntegral . U.iterateN 100 prng
then my time falls by two thirds -- from real 0m1.306s to real 0m0.429s It looks like all of the top level functions have this problem except prng and zero
This came up on the mailing lists and I discovered that there's a bug in the Word->Double conversion code in GHC 7.4.1 (at least). This version, which works around the bug, is as fast as the C code on my machine:
{-# LANGUAGE CPP, BangPatterns, MagicHash #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import GHC.Word
import GHC.Exts
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner !w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
outer i | i <= VNUM = inner (fromIntegral i) 0 >> outer (i + 1)
| otherwise = return ()
outer (1 :: Int)
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
{- I originally used this function, which isn't quite correct.
We need a real bug fix in GHC.
word2Double :: Word -> Double
word2Double (W# w) = D# (int2Double# (word2Int# w))
-}
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
Other than working around the Word->Double bug, I've also removed extra lists to match the C version better.
I know you didn't ask for a way to improve this micro-benchmark, but I'll give you an explanation that might prove helpful when writing loops in the future:
An unknown function call, such as the one made to the higher-order argument of foldV, can be expensive when done frequently in a loop. In particular, it will inhibit unboxing of the function arguments, leading to increased allocation. The reason it inhibits argument unboxing is that we don't know that the function we're calling is strict in those arguments and thus we pass the arguments as e.g. (Double, Double), instead of as Double# -> Double#.
The compiler can figure out the strictness information if the loop (e.g. foldV) meets the loop body (e.g. kahanStep). For that reason I recommend that people INLINE higher-order functions. In this case, inlining foldV and removing the NOINLINE on kahanStep improves the runtime quite a bit for me.
This doesn't bring the performance on par with C in this case, as there are other things going on (as others have commented on), but it's a step in the right direction (and it's a step you can do without every having to look at profiling output).

Resources