Performance problem with Euler problem and recursion on Int64 types - windows

I'm currently learning Haskell using the project Euler problems as my playground.
I was astound by how slow my Haskell programs turned out to be compared to similar
programs written in other languages. I'm wondering if I've forseen something, or if this is the kind of performance penalties one has to expect when using Haskell.
The following program in inspired by Problem 331, but I've changed it before posting so I don't spoil anything for other people. It computes the arc length of a discrete circle drawn on a 2^30 x 2^30 grid. It is a simple tail recursive implementation and I make sure that the updates of the accumulation variable keeping track of the arc length is strict. Yet it takes almost one and a half minute to complete (compiled with the -O flag with ghc).
import Data.Int
arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' x y norm2 acc
| x > y = acc
| norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
| otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)
main = print $ arcLength (2^30)
Here is a corresponding implementation in Java. It takes about 4.5 seconds to complete.
public class ArcLength {
public static void main(String args[]) {
long n = 1 << 30;
long x = 0;
long y = n-1;
long acc = 0;
long norm2 = 0;
long time = System.currentTimeMillis();
while(x <= y) {
if (norm2 < 0) {
norm2 += 2*x + 1;
x++;
} else if (norm2 > 2*(n-1)) {
norm2 += 2 - 2*(x+y);
x--;
y--;
} else {
norm2 += 2*x + 1;
x++;
acc++;
}
}
time = System.currentTimeMillis() - time;
System.err.println(acc);
System.err.println(time);
}
}
EDIT: After the discussions in the comments I made som modifications in the Haskell code and did some performance tests. First I changed n to 2^29 to avoid overflows. Then I tried 6 different version: With Int64 or Int and with bangs before either norm2 or both and norm2 and acc in the declaration arcLength' x y !norm2 !acc. All are compiled with
ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs
Here are the results:
(Int !norm2 !acc)
total time = 3.00 secs (150 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 !acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int64 norm2 acc)
arctest.exe: out of memory
(Int64 norm2 !acc)
total time = 48.46 secs (2423 ticks # 20 ms)
total alloc = 26,246,173,228 bytes (excludes profiling overheads)
(Int64 !norm2 !acc)
total time = 31.46 secs (1573 ticks # 20 ms)
total alloc = 3,032 bytes (excludes profiling overheads)
I'm using GHC 7.0.2 under a 64-bit Windows 7 (The Haskell platform binary distribution). According to the comments, the problem does not occur when compiling under other configurations. This makes me think that the Int64 type is broken in the Windows release.

Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program (-ddump-simpl):
Main.$warcLength' =
\ (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
(ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
case {__pkg_ccall ghc-prim hs_gtInt64 [...]
ww_s1my ww1_s1mC GHC.Prim.realWorld#
[...]
So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64 call looks suspiciously like a C call. Looking at the assembler output (-ddump-asm), we see stuff like:
pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp
So this looks very much like every operation on the Int64 get turned into a full-blown C call in the backend. Which is slow, obviously.
The source code of GHC.IntWord64 seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.

Hmm, this is interesting. So I just compiled both of your programs, and tried them out:
% java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java
% java ArcLength
843298604
6630
So about 6.6 seconds for the Java solution. Next is ghc with some optimization:
% ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc
843298604
./arc 12.68s user 0.04s system 99% cpu 12.718 total
Just under 13 seconds for ghc -O
Trying with some further optimization:
% ghc --make -O3
% time ./arc [13:16]
843298604
./arc 5.75s user 0.00s system 99% cpu 5.754 total
With further optimization flags, the haskell solution took under 6 seconds
It would be interesting to know what version compiler you are using.

There's a couple of interesting things in your question.
You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).
Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.
Make sure it is the same as the Java
One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.
import Data.Bits
import Data.Int
loop :: Int -> Int
loop n = go 0 (n-1) 0 0
where
go :: Int -> Int -> Int -> Int -> Int
go x y acc norm2
| x <= y = case () of { _
| norm2 < 0 -> go (x+1) y acc (norm2 + 2*x + 1)
| norm2 > 2 * (n-1) -> go (x-1) (y-1) acc (norm2 + 2 - 2 * (x+y))
| otherwise -> go (x+1) y (acc+1) (norm2 + 2*x + 1)
}
| otherwise = acc
main = print $ loop (1 `shiftL` 30)
Peek at the core
We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:
main_$s$wgo
:: Int#
-> Int#
-> Int#
-> Int#
-> Int#
main_$s$wgo =
\ (sc_sQa :: Int#)
(sc1_sQb :: Int#)
(sc2_sQc :: Int#)
(sc3_sQd :: Int#) ->
case <=# sc3_sQd sc2_sQc of _ {
False -> sc1_sQb;
True ->
case <# sc_sQa 0 of _ {
False ->
case ># sc_sQa 2147483646 of _ {
False ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc1_sQb 1)
sc2_sQc
(+# sc3_sQd 1);
True ->
main_$s$wgo
(-#
(+# sc_sQa 2)
(*# 2 (+# sc3_sQd sc2_sQc)))
sc1_sQb
(-# sc2_sQc 1)
(-# sc3_sQd 1)
};
True ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
sc1_sQb
sc2_sQc
(+# sc3_sQd 1)
that is, all unboxed into registers. That loop looks great!
And performs just fine (Linux/x86-64/GHC 7.03):
./A 5.95s user 0.01s system 99% cpu 5.980 total
Checking the asm
We get reasonable assembly too, as a nice loop:
Main_mainzuzdszdwgo_info:
cmpq %rdi, %r8
jg .L8
.L3:
testq %r14, %r14
movq %r14, %rdx
js .L4
cmpq $2147483646, %r14
jle .L9
.L5:
leaq (%rdi,%r8), %r10
addq $2, %rdx
leaq -1(%rdi), %rdi
addq %r10, %r10
movq %rdx, %r14
leaq -1(%r8), %r8
subq %r10, %r14
jmp Main_mainzuzdszdwgo_info
.L9:
leaq 1(%r14,%r8,2), %r14
addq $1, %rsi
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
.L8:
movq %rsi, %rbx
jmp *0(%rbp)
.L4:
leaq 1(%r14,%r8,2), %r14
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
Using the -fvia-C backend.
So this looks fine!
My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.
Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.
Lesson: use hardware 64 bits if at all possible.

The normal optimization flag for performance concerned code is -O2. What you used, -O, does very little. -O3 doesn't do much (any?) more than -O2 - it even used to include experimental "optimizations" that often made programs notably slower.
With -O2 I get performance competitive with Java:
tommd#Mavlo:Test$ uname -r -m
2.6.37 x86_64
tommd#Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3
tommd#Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m4.948s
user 0m4.896s
sys 0m0.000s
And Java is about 1 second faster (20%):
tommd#Mavlo:Test$ time java ArcLength
843298604
3880
real 0m3.961s
user 0m3.936s
sys 0m0.024s
But an interesting thing about GHC is it has many different backends. By default it uses the native code generator (NCG), which we timed above. There's also an LLVM backend that often does better... but not here:
tommd#Mavlo:Test$ ghc -O2 so.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m5.973s
user 0m5.968s
sys 0m0.000s
But, as FUZxxl mentioned in the comments, LLVM does much better when you add a few strictness annotations:
$ ghc -O2 -fllvm -fforce-recomp so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m4.099s
user 0m4.088s
sys 0m0.000s
There's also an old "via-c" generator that uses C as an intermediate language. It does well in this case:
tommd#Mavlo:Test$ ghc -O2 so.hs -fvia-c -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
on the commandline:
Warning: The -fvia-c flag will be removed in a future GHC release
Linking so ...
ttommd#Mavlo:Test$ ti
tommd#Mavlo:Test$ time ./so
843298604
real 0m3.982s
user 0m3.972s
sys 0m0.000s
Hopefully the NCG will be improved to match via-c for this case before they remove this backend.

dberg, I feel like all of this got off to a bad start with the unfortunate -O flag. Just to emphasize a point made by others, for run-of-the-mill compilation and testing, do like me and paste this into your .bashrc or whatever:
alias ggg="ghc --make -O2"
alias gggg="echo 'Glorious Glasgow for Great Good!' && ghc --make -O2 --fforce-recomp"

I've played with the code a little and this version seems to run faster than Java version on my laptop (3.55s vs 4.63s):
{-# LANGUAGE BangPatterns #-}
arcLength :: Int->Int
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' :: Int -> Int -> Int -> Int -> Int
arcLength' !x !y !norm2 !acc
| x > y = acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y - 1) (norm2 - 2*(x + y) + 2) acc
| norm2 < 0 = arcLength' (succ x) y (norm2 + x*2 + 1) acc
| otherwise = arcLength' (succ x) y (norm2 + 2*x + 1) (acc + 1)
main = print $ arcLength (2^30)
:
$ ghc -O2 tmp1.hs -fforce-recomp
[1 of 1] Compiling Main ( tmp1.hs, tmp1.o )
Linking tmp1 ...
$ time ./tmp1
843298604
real 0m3.553s
user 0m3.539s
sys 0m0.006s

Related

Haskell program runs very slow

I wrote my first program calculating prime numbers. However it runs really slow, and I can't figure out why. I wrote similar code in java and for n = 10000 the java program doesn't take any time, while the Haskell program takes like 2 minutes.
import Data.List
main = do
print "HowManyPrimes? - OnlyInteger"
inputNumber <- getLine
let x = (read inputNumber :: Int)
print (firstNPrimes x)
-- prime - algorithm
primeNumber:: Int -> Bool
primeNumber 2 = True
primeNumber x = primNumberRec x (div x 2)
primNumberRec:: Int -> Int -> Bool
primNumberRec x y
|y == 0 = False
|y == 1 = True
|mod x y == 0 = False
|otherwise = primNumberRec x (y-1)
-- prime numbers till n
primesTillN:: Int -> [Int]
primesTillN n = 2:[ x | x <- [3,5..n], primeNumber x ]
--firstNPrimes
firstNPrimes:: Int -> [Int]
firstNPrimes 0 = []
firstNPrimes n = 2: take (n-1) [x|x <- [3,5..], primeNumber x]
Thanks in advance.
Similar java code:
import java.util.Scanner;
public class PrimeNumbers{
static Scanner scan = new Scanner(System.in);
public boolean primeAlgorithm(int x){
if (x < 2)
return false;
return primeAlgorithm(x, (int)Math.sqrt(x));
}
public boolean primeAlgorithm(int x, int divider){
if (divider == 1)
return true;
if (x%divider == 0)
return false;
return primeAlgorithm(x, divider-1);
}
public static void main(String[] args){
PrimeNumbers p = new PrimeNumbers();
int howManyPrimes = scan.nextInt();
int number = 3;
while(howManyPrimes!=0){
if(p.primeAlgorithm(number)){
System.out.print(number+" ");
howManyPrimes--;
}
number+=2;
}
}
}
When doing timing measurements, always compile; ghci is designed for a fast change-rebuild-run loop, not for speedy execution of the produced code. However, even after following this advice there is a huge timing difference between your two snippets.
The key difference between your java and Haskell is using sqrt instead of dividing by 2. Your originals, on my machine:
% javac Test.java && echo 10000 | /usr/bin/time java Test >/dev/null
0.21user 0.02system 0:00.13elapsed 186%CPU (0avgtext+0avgdata 38584maxresident)k
0inputs+0outputs (0major+5823minor)pagefaults 0swaps
% ghc -O2 test && echo 10000 | /usr/bin/time ./test >/dev/null
8.85user 0.00system 0:08.87elapsed 99%CPU (0avgtext+0avgdata 4668maxresident)k
0inputs+0outputs (0major+430minor)pagefaults 0swaps
So 0.2s for java, 8.9s for Haskell. After switching to using square root with the following change:
- primeNumber x = primNumberRec x (div x 2)
+ primeNumber x = primNumberRec x (ceiling (sqrt (fromIntegral x)))
I get the following timing for the Haskell:
% ghc -O2 test && echo 10000 | /usr/bin/time ./test >/dev/null
0.07user 0.00system 0:00.07elapsed 98%CPU (0avgtext+0avgdata 4560maxresident)k
0inputs+0outputs (0major+427minor)pagefaults 0swaps
Now 3x faster than the java code. (And of course there are significantly better algorithms that will make it even faster still.)
Compile it!
Haskell code in GHCi is far from optimised; try to compile it into a binary with ghc -o prime prime.hs or even better use -O2 optimisation. I had a script once that took 5min in GHCi but mere seconds once compiled.

F# Performance Impact of Checked Calcs?

Is there a performance impact from using the Checked module? I've tested it out with sequences of type int and see no noticeable difference. Sometimes the checked version is faster and sometimes unchecked is faster, but generally not by much.
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:05.272, CPU: 00:00:05.272, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
open Checked
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:04.785, CPU: 00:00:04.773, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
Basically I'm trying to figure out if there would be any downside to always opening Checked. (I encountered an overflow that wasn't immediately obvious, so I'm now playing the role of the jilted lover who doesn't want another broken heart.) The only non-contrived reason I can come up with for not always using Checked is if there were some performance hit, but I haven't seen one yet.
When you measure performance it's usually not a good idea to include Seq as Seq adds lots of overhead (at least compared to int operations) so you risk that most of the time is spent in Seq, not in the code you like to test.
I wrote a small test program for (+):
let clock =
let sw = System.Diagnostics.Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let dbreak () = System.Diagnostics.Debugger.Break ()
let time a =
let b = clock ()
let r = a ()
let n = clock ()
let d = n - b
d, r
module Unchecked =
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
module Checked =
open Checked
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
[<EntryPoint>]
let main argv =
let count = 1000000000
let testCases =
[|
"Unchecked" , Unchecked.run
"Checked" , Checked.run
|]
for nm, a in testCases do
printfn "Running %s ..." nm
let ms, r = time (a count)
printfn "... it took %d ms, result is %A" ms r
0
The performance results are this:
Running Unchecked ...
... it took 561 ms, result is 1000000000
Running Checked ...
... it took 1103 ms, result is 1000000000
So it seems some overhead is added by using Checked. The cost of int add should be less than the loop overhead so the overhead of Checked is higher than 2x maybe closer to 4x.
Out of curiousity we can check the IL Code using tools like ILSpy:
Unchecked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
Checked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add.ovf
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add.ovf
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
The only difference is that Unchecked uses add and Checked uses add.ovf. add.ovf is add with overflow check.
We can dig even deeper by looking at the jitted x86_64 code.
Unchecked:
; if i < c then
00007FF926A611B3 cmp esi,ebx
00007FF926A611B5 jge 00007FF926A611BD
; i + 1
00007FF926A611B7 inc esi
; a + 1
00007FF926A611B9 inc edi
; loop (a + 1) (i + 1)
00007FF926A611BB jmp 00007FF926A611B3
Checked:
; if i < c then
00007FF926A62613 cmp esi,ebx
00007FF926A62615 jge 00007FF926A62623
; a + 1
00007FF926A62617 add edi,1
; Overflow?
00007FF926A6261A jo 00007FF926A6262D
; i + 1
00007FF926A6261C add esi,1
; Overflow?
00007FF926A6261F jo 00007FF926A6262D
; loop (a + 1) (i + 1)
00007FF926A62621 jmp 00007FF926A62613
Now the reason for the Checked overhead is visible. After each operation the jitter inserts the conditional instruction jo which jumps to code that raises OverflowException if the overflow flag is set.
This chart shows us that the cost of an integer add is less than 1 clock cycle. The reason it's less than 1 clock cycle is that modern CPU can execute certain instructions in parallel.
The chart also shows us that branch that was correctly predicted by the CPU takes around 1-2 clock cycles.
So assuming a throughtput of at least 2 the cost of two integer additions in the Unchecked example should be 1 clock cycle.
In the Checked example we do add, jo, add, jo. Most likely CPU can't parallelize in this case and the cost of this should be around 4-6 clock cycles.
Another interesting difference is that the order of additions changed. With checked additions the order of the operations matter but with unchecked the jitter (and the CPU) has a greater flexibility moving the operations possibly improving performance.
So long story short; for cheap operations like (+) the overhead of Checked should be around 4x-6x compared to Unchecked.
This assumes no overflow exception. The cost of a .NET exception is probably around 100,000x times more expensive than an integer addition.

Why does this Haskell program perform so poorly?

I'm a Haskell newbie and I'm lost as to how this program performs so poorly. I tried forcing strict variables in various places but it doesn't seem to make a difference.
Here is my code (the purpose of this program is produce the frequencies of the input bytes found from standard input):
{-# LANGUAGE BangPatterns #-}
import Control.Concurrent (forkIO, killThread)
import Control.Concurrent.MVar
import qualified Data.IntMap as IntMap
import Data.IntMap.Strict (IntMap)
import Control.Monad.Fix
import Control.Monad (when)
import qualified Data.Char as Char
import qualified System.IO as IO
import System.IO (hSetBinaryMode, hFlush)
import Data.List as List
import Text.PrettyPrint.Boxes as Boxes
import Text.Printf (printf)
import Data.Function
data BFreq = BFreq Integer (IntMap Integer)
main :: IO ()
main = do
putStrLn "analyze data from stdin"
hSetBinaryMode IO.stdin True
mv <- newEmptyMVar
tid <- forkIO $ statusUpdater mv
bf <- run mv
killThread tid
displayResults bf
resultTable :: [[String]] -> Box
resultTable rows =
Boxes.hsep 4 Boxes.left boxed_cols
where
cols = transpose rows
boxed_cols = map (Boxes.vcat Boxes.left . map text) cols
displayResults :: BFreq -> IO ()
displayResults (BFreq n counts) = do
putStrLn $ "read " ++ (show n) ++ " bytes"
when (n > 0) (displayFreqs n counts)
displayFreqs :: Integer -> IntMap Integer -> IO ()
displayFreqs n counts =
do
putStrLn "frequencies:"
Boxes.printBox $ resultTable rows
where
cmp x y = compare (snd y) (snd x)
sorted_counts = List.sortBy cmp $ IntMap.assocs counts
intdiv :: Integer -> Integer -> Float
intdiv a b = (fromIntegral a) / (fromIntegral b)
percent y = printf "%.2f" (100*intdiv y n)
show_byte x = (show $ Char.chr x) ++ " (" ++ (show x) ++ "):"
show_count y = (percent y) ++ "% (" ++ (show y) ++ ")"
rows = map (\(x,y) -> [show_byte x, show_count y]) sorted_counts
run :: MVar Integer -> IO BFreq
run mv =
fn mv 0 IntMap.empty
where
fn mv !n !mp =
do
tryPutMVar mv n
eof <- IO.isEOF
if eof
then return $ BFreq n mp
else do
b <- getChar
fn mv (1+n) (new_map b)
where
k x = Char.ord x
old_val x = IntMap.findWithDefault 0 (k x) mp
new_map x = IntMap.insert (k x) ((old_val x)+1) mp
statusUpdater :: MVar Integer -> IO ()
statusUpdater mv =
do
takeMVar mv >>= print_progress
statusUpdater mv
where
print_progress n =
do
putStr $ "\rbytes: "
when (gbs > 0) $ putStr $ (show gbs) ++ " GBs "
when (mbs > 0) $ putStr $ (show mbs) ++ " MBs "
when (kbs > 0) $ putStr $ (show kbs) ++ " KBs "
when (gbs < 1 && mbs < 1 && kbs < 1) $ putStr $ (show bs) ++ " Bs "
hFlush IO.stdout
where
(gbs, gbr) = quotRem n 0x40000000
(mbs, mbr) = quotRem gbr 0x100000
(kbs, bs) = quotRem mbr 0x400
here is what happens when I run it (note: I am compiling with -O2):
$> cabal build -v
creating dist/build
creating dist/build/autogen
Building bfreq-0.1.0.0...
Preprocessing executable 'bfreq' for bfreq-0.1.0.0...
Building executable bfreq...
creating dist/build/bfreq
creating dist/build/bfreq/bfreq-tmp
/usr/bin/ghc --make -o dist/build/bfreq/bfreq -hide-all-packages -fbuilding-cabal-package -package-conf dist/package.conf.inplace -i -idist/build/bfreq/bfreq-tmp -i. -idist/build/autogen -Idist/build/autogen -Idist/build/bfreq/bfreq-tmp -optP-include -optPdist/build/autogen/cabal_macros.h -odir dist/build/bfreq/bfreq-tmp -hidir dist/build/bfreq/bfreq-tmp -stubdir dist/build/bfreq/bfreq-tmp -package-id base-4.5.0.0-40b99d05fae6a4eea95ea69e6e0c9702 -package-id boxes-0.1.3-e03668bca38fe3e879f9d695618ddef3 -package-id containers-0.5.3.1-80819105034e34d03d22b1c20d6fd868 -O -O2 -rtsopts -XHaskell98 ./bfreq.hs
[1 of 1] Compiling Main ( bfreq.hs, dist/build/bfreq/bfreq-tmp/Main.o )
Linking dist/build/bfreq/bfreq ...
$> cat /dev/urandom | head -c 9999999 > test_data
$> cat ./test_data | ./dist/build/bfreq/bfreq +RTS -sstderr
analyze data from stdin
bytes: 9 MBs 521 KBs read 9999999 bytes
frequencies:
'\137' (137): 0.40% (39642)
'H' (72): 0.40% (39608)
<...>
'L' (76): 0.39% (38617)
'\246' (246): 0.39% (38609)
'I' (73): 0.38% (38462)
'q' (113): 0.38% (38437)
9,857,106,520 bytes allocated in the heap
14,492,245,840 bytes copied during GC
3,406,696,360 bytes maximum residency (13 sample(s))
14,691,672 bytes maximum slop
6629 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 18348 colls, 0 par 10.90s 10.90s 0.0006s 0.0180s
Gen 1 13 colls, 0 par 15.20s 19.65s 1.5119s 12.6403s
INIT time 0.00s ( 0.00s elapsed)
MUT time 14.45s ( 14.79s elapsed)
GC time 26.10s ( 30.56s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 40.55s ( 45.35s elapsed)
%GC time 64.4% (67.4% elapsed)
Alloc rate 682,148,818 bytes per MUT second
Productivity 35.6% of total user, 31.9% of total elapsed
So unless I'm misinterpreting the above debug output, my program is using 6 GB?
The test data is less than 10 MB, so what's going on?
Any general advice on how to approach a problem like this in Haskell would be nice too. In otherwords, should I avoid Haskell for this kind of I/O-centric stuff? Should I be using the pipes library for this kind of thing?
EDIT:
Thanks for the help, correctly importing the strict version of IntMap fixes the memory issues.
I haven't been able to get the profiling (-fprof-auto) to work because it seems none of my packages are compiled for profiling. I solved the lack of profiling base libraries by installing the ghc profiling package for my OS (ubuntu: ghc-prof), but according to this I'll need to manually re-install all my haskell libraries for profiling. I don't have the time to do this at the moment, so I'm just putting this link here for the benefit of anyone who has a similar problem.
If you compile with -fprof-auto as per the GHC guide chapter on profiling, you'll see large allocation happening in run.fn.new_map and run.fn.
The code in question:
new_map x = IntMap.insert (k x) ((old_val x)+1) mp
Suspicion: ((old_val x)+1) is creating a chain of unevaluated thunks. Proposed change:
new_map x = let ov = old_val x + 1 in
ov `seq` IntMap.insert (k x) ov mp
Voila! Allocations, GC, and memory use are all way down.
EDIT: You probably intended to import qualified Data.IntMap.Strict as IntMap, making this change unnecessary.

How much overhead do function calls have in Haskell?

I am new to Haskell and I am puzzled by the cost of a function call, which seems to be completely unreasonable to me, and makes me think I am doing something fundamentally wrong.
Consider the following Haskell code:
module Main where
logistic x = 4.0*x*(1.0-x)
lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (logistic x) (n-1)
main = putStrLn $ show $ lg 0.7861 100000000
Compiling this with the command
ghc -O3 -XBangPatterns -o tsths tst.hs
and running it, I get:
real 0m15.904s
user 0m15.853s
sys 0m0.016s
If instead of calling the function logistic I calculate the expression inline:
module Main where
lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (4.0*x*(1.0-x)) (n-1)
main = putStrLn $ show $ lg 0.7861 100000000
the execution time becomes:
real 0m0.838s
user 0m0.828s
sys 0m0.004s
which is exactly the same as the equivalent C program, which is
#include <stdio.h>
int main() {
int i, num=100000000;
double x=0.7861;
for (i=0; i<num; ++i)
x *= 4.0*(1.0-x);
printf("%lg\n", x);
}
Am I doing something terribly wrong?
Many thanks.
It's a bug in GHC-7.4.1. Looking at the generated core (only the core for the function lg is important, from GHC-7.4.2, we get
Main.lg3 :: GHC.Types.Double
[GblId,
Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
ConLike=False, Cheap=False, Expandable=False,
Guidance=IF_ARGS [] 30 0}]
Main.lg3 = GHC.Float.$w$cfromRational Main.lg4 Main.lg2
Main.lg1 :: GHC.Types.Double
[GblId,
Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
ConLike=False, Cheap=False, Expandable=False,
Guidance=IF_ARGS [] 30 0}]
Main.lg1 = GHC.Float.$w$cfromRational Main.lg2 Main.lg2
Main.$wlg :: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double#
[GblId,
Arity=2,
Str=DmdType LL,
Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=2, Value=True,
ConLike=True, Cheap=True, Expandable=True,
Guidance=IF_ARGS [0 30] 158 0}]
Main.$wlg =
\ (ww_s1Oy :: GHC.Prim.Double#) (ww1_s1OC :: GHC.Prim.Int#) ->
case ww1_s1OC of ds_Xvs {
__DEFAULT ->
case Main.lg3 of _ { GHC.Types.D# x_awJ ->
case Main.lg1 of _ { GHC.Types.D# x1_awV ->
letrec {
$wlg1_X1PF [Occ=LoopBreaker]
:: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double#
[LclId, Arity=2, Str=DmdType LL]
$wlg1_X1PF =
\ (ww2_X1Pv :: GHC.Prim.Double#) (ww3_X1PA :: GHC.Prim.Int#) ->
case ww3_X1PA of ds1_Xwr {
__DEFAULT ->
$wlg1_X1PF
(GHC.Prim.*##
(GHC.Prim.*## x_awJ ww2_X1Pv) (GHC.Prim.-## x1_awV ww2_X1Pv))
(GHC.Prim.-# ds1_Xwr 1);
0 -> ww2_X1Pv
}; } in
$wlg1_X1PF
(GHC.Prim.*##
(GHC.Prim.*## x_awJ ww_s1Oy) (GHC.Prim.-## x1_awV ww_s1Oy))
(GHC.Prim.-# ds_Xvs 1)
}
};
0 -> ww_s1Oy
}
two top-level Doubles and a decent loop.
GHC-7.4.1 was a bit too inlining-happy, that produced
Rec {
Main.$wlg [Occ=LoopBreaker]
:: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double#
[GblId, Arity=2, Str=DmdType LL]
Main.$wlg =
\ (ww_s1NS :: GHC.Prim.Double#) (ww1_s1NW :: GHC.Prim.Int#) ->
case ww1_s1NW of ds_Xvb {
__DEFAULT ->
case GHC.Float.$wfromRat'' (-1021) 53 Main.logistic4 Main.logistic2
of ww2_a1Mt { __DEFAULT ->
case GHC.Float.$wfromRat'' (-1021) 53 Main.logistic2 Main.logistic2
of ww3_X1Nq { __DEFAULT ->
Main.$wlg
(GHC.Prim.*##
(GHC.Prim.*## ww2_a1Mt ww_s1NS) (GHC.Prim.-## ww3_X1Nq ww_s1NS))
(GHC.Prim.-# ds_Xvb 1)
}
};
0 -> ww_s1NS
}
end Rec }
and gave you two calls to the fromRational worker in each iteration.
Now, fromRational is a fairly complicated function. It is still rather slow, despite having gotten a much faster implementation in the 7.2 series, so these calls hurt big time.
With a type signature, there are no Rational top-level constants produced, only Double constants, and these are then used, which of course doesn't include a gratuitous slowdown.
As suggested by Dan Burton it is actually overhead of polymorphic function, because GHC infers type logistic :: Fractional a => a -> a. If you specify type explicitly you generally enable both better checking and better optimizations. I believe it is good practice to specify type of function explicitly.
If you want to have function with polymorphic type but have full speed of monomorphic call in case of specific usage you can use SPECIALIZE pragma, but I believe this is GHC specific.
{-# LANGUAGE BangPatterns #-}
module Main where
logistic :: Fractional a => a -> a
{-# SPECIALISE logistic :: Double -> Double #-}
logistic x = 4.0*x*(1.0-x)
lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (logistic x) (n-1)
main = putStrLn $ show $ lg 0.7861 100000000
Also note that you can specify LANGUAGE pragma at the beginning of file to enable bang patterns and don't need to enable them on command line.
Times on my machine were 21 seconds for original, 0.67 sec for explicit type, 0.7 sec for specialize (which is basically the same).
I believe overhead of specialized call is very small because it is just bunch of instructions which gets inlined anyway but polymorphic function results in call. Though it is weird that GHC cant inline despite polymorphism.
Add a type signature to logistic and you will see the speedup. Allow me to use CPP to demonstrate the difference.
bash> cat tst.hs
module Main where
#if defined(SIG)
logistic :: Double -> Double
#endif
logistic x = 4.0*x*(1.0-x)
lg :: Double -> Int -> Double
lg !x 0 = x
lg !x !n = lg (logistic x) (n-1)
main = putStrLn $ show $ lg 0.7861 100000000
bash> ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.4.1
If compiled without SIG defined (the type signature is excluded):
bash> ghc -O3 -XBangPatterns -XCPP -o tsths tst.hs
[1 of 1] Compiling Main ( tst.hs, tst.o )
Linking tsths ...
bash> time ./tsths
0.34209286442469333
real 0m13.187s
user 0m13.177s
sys 0m0.008s
Now lets compile with SIG defined so the signature is included:
bash> rm tsths *.o *.hi
bash> ghc -O3 -XBangPatterns -XCPP -DSIG -o tsths tst.hs
[1 of 1] Compiling Main ( tst.hs, tst.o )
Linking tsths ...
bash> time ./tsths
0.34209286442469333
real 0m0.464s
user 0m0.440s
sys 0m0.020s
Not sure why GHC doesn't optimize it without the signature; the monomorphism restriction should restrict it to Double -> Double anyways.

Is indexing of Data.Vector.Unboxed.Mutable.MVector really this slow?

I have an application that spends about 80% of its time computing the centroid of a large list (10^7) of high dimensional vectors (dim=100) using the Kahan summation algorithm. I have done my best at optimizing the summation, but it is still 20x slower than an equivalent C implementation. Profiling indicates that the culprits are the unsafeRead and unsafeWrite functions from Data.Vector.Unboxed.Mutable. My question is: are these functions really this slow or am I misunderstanding the profiling statistics?
Here are the two implementations. The Haskell one is compiled with ghc-7.0.3 using the llvm backend. The C one is compiled with llvm-gcc.
Kahan summation in Haskell:
{-# LANGUAGE BangPatterns #-}
module Test where
import Control.Monad ( mapM_ )
import Data.Vector.Unboxed ( Vector, Unbox )
import Data.Vector.Unboxed.Mutable ( MVector )
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import Data.Word ( Word )
import Data.Bits ( shiftL, shiftR, xor )
prng :: Word -> Word
prng w = w' where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
mkVect :: Word -> Vector Double
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
foldV :: (Unbox a, Unbox b)
=> (a -> b -> a) -- componentwise function to fold
-> Vector a -- initial accumulator value
-> [Vector b] -- data vectors
-> Vector a -- final accumulator value
foldV fn accum vs = U.modify (\x -> mapM_ (liftV fn x) vs) accum where
liftV f acc = fV where
fV v = go 0 where
n = min (U.length v) (UM.length acc)
go i | i < n = step >> go (i + 1)
| otherwise = return ()
where
step = {-# SCC "fV_step" #-} do
a <- {-# SCC "fV_read" #-} UM.unsafeRead acc i
b <- {-# SCC "fV_index" #-} U.unsafeIndexM v i
{-# SCC "fV_write" #-} UM.unsafeWrite acc i $! {-# SCC "fV_apply" #-} f a b
kahan :: [Vector Double] -> Vector Double
kahan [] = U.singleton 0.0
kahan (v:vs) = fst . U.unzip $ foldV kahanStep acc vs where
acc = U.map (\z -> (z, 0.0)) v
kahanStep :: (Double, Double) -> Double -> (Double, Double)
kahanStep (s, c) x = (s', c') where
!y = x - c
!s' = s + y
!c' = (s' - s) - y
{-# NOINLINE kahanStep #-}
zero :: U.Vector Double
zero = U.replicate 100 0.0
myLoop n = kahan $ map mkVect [1..n]
main = print $ myLoop 100000
Compiling with ghc-7.0.3 using the llvm backend:
ghc -o Test_hs --make -fforce-recomp -O3 -fllvm -optlo-O3 -msse2 -main-is Test.main Test.hs
time ./Test_hs
real 0m1.948s
user 0m1.936s
sys 0m0.008s
Profiling information:
16,710,594,992 bytes allocated in the heap
33,047,064 bytes copied during GC
35,464 bytes maximum residency (1 sample(s))
23,888 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 31907 collections, 0 parallel, 0.28s, 0.27s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 24.73s ( 24.74s elapsed)
GC time 0.28s ( 0.27s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 25.01s ( 25.02s elapsed)
%GC time 1.1% (1.1% elapsed)
Alloc rate 675,607,179 bytes per MUT second
Productivity 98.9% of total user, 98.9% of total elapsed
Thu Feb 23 02:42 2012 Time and Allocation Profiling Report (Final)
Test_hs +RTS -s -p -RTS
total time = 24.60 secs (1230 ticks # 20 ms)
total alloc = 8,608,188,392 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
fV_write Test 31.1 26.0
fV_read Test 27.2 23.2
mkVect Test 12.3 27.2
fV_step Test 11.7 0.0
foldV Test 5.9 5.7
fV_index Test 5.2 9.3
kahanStep Test 3.3 6.5
prng Test 2.2 1.8
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
CAF:main1 Test 339 1 0.0 0.0 0.0 0.0
main Test 346 1 0.0 0.0 0.0 0.0
CAF:main2 Test 338 1 0.0 0.0 100.0 100.0
main Test 347 0 0.0 0.0 100.0 100.0
myLoop Test 348 1 0.2 0.2 100.0 100.0
mkVect Test 350 400000 12.3 27.2 14.5 29.0
prng Test 351 9900000 2.2 1.8 2.2 1.8
kahan Test 349 102 0.0 0.0 85.4 70.7
foldV Test 359 1 5.9 5.7 85.4 70.7
fV_step Test 360 9999900 11.7 0.0 79.5 65.1
fV_write Test 367 19999800 31.1 26.0 35.4 32.5
fV_apply Test 368 9999900 1.0 0.0 4.3 6.5
kahanStep Test 369 9999900 3.3 6.5 3.3 6.5
fV_index Test 366 9999900 5.2 9.3 5.2 9.3
fV_read Test 361 9999900 27.2 23.2 27.2 23.2
CAF:lvl19_r3ei Test 337 1 0.0 0.0 0.0 0.0
kahan Test 358 0 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad3_r3eg Test 336 1 0.0 0.0 0.0 0.0
kahan Test 357 0 0.0 0.0 0.0 0.0
CAF:$dMVector2_r3ee Test 335 1 0.0 0.0 0.0 0.0
CAF:$dVector1_r3ec Test 334 1 0.0 0.0 0.0 0.0
CAF:poly_$dMonad_r3ea Test 333 1 0.0 0.0 0.0 0.0
CAF:$dMVector1_r3e2 Test 330 1 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad2_r3e0 Test 328 1 0.0 0.0 0.0 0.0
foldV Test 365 0 0.0 0.0 0.0 0.0
CAF:lvl11_r3dM Test 322 1 0.0 0.0 0.0 0.0
kahan Test 354 0 0.0 0.0 0.0 0.0
CAF:lvl10_r3dK Test 321 1 0.0 0.0 0.0 0.0
kahan Test 355 0 0.0 0.0 0.0 0.0
CAF:$dMVector_r3dI Test 320 1 0.0 0.0 0.0 0.0
kahan Test 356 0 0.0 0.0 0.0 0.0
CAF GHC.Float 297 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 256 2 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 214 2 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 211 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Generic 182 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Unboxed 174 2 0.0 0.0 0.0 0.0
The equivalent implementation in C:
#include <stdint.h>
#include <stdio.h>
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahanStep (double *s, double *c, double x) {
double y, t;
y = x - *c;
t = *s + y;
*c = (t - *s) - y;
*s = t;
}
void kahan(double s[], double c[]) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
kahanStep(&s[j], &c[j], w);
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled with llvm-gcc:
>llvm-gcc -o Test_c -O3 -msse2 -std=c99 test.c
>time ./Test_c
real 0m0.096s
user 0m0.088s
sys 0m0.004s
Update 1: I un-inlined kahanStep in the C version. It barely made a dent in the performance. I hope that now we can all acknowledge Amdahl's law and move on. As
inefficient as kahanStep might be, unsafeRead and unsafeWrite are 9-10x slower. I was hoping someone could shed some light on the possible causes of that fact.
Also, I should say that since I am interacting with a library that uses Data.Vector.Unboxed, so I am kinda married to it at this point, and parting with it would be very traumatic :-)
Update 2: I guess I was not clear enough in my original question. I am not looking for ways to speed up this microbenchmark. I am looking for an explanation of the counter intuitive profiling stats, so I can decide whether or not to file a bug report against vector.
Your C version is not equivalent to your Haskell implementation. In C you've inlined the important Kahan summation step yourself, in Haskell you created a polymorphic higher order function that does a lot more and takes the transformation step as a parameter. Moving kahanStep to a separate function in C isn't the point, it will still be inlined by the compiler. Even if you put it into its own source file, compile separately and link without link-time optimisation, you have only addressed part of the difference.
I have made a C version that is closer to the Haskell version,
kahan.h:
typedef struct DPair_st {
double fst, snd;
} DPair;
DPair kahanStep(DPair pr, double x);
kahanStep.c:
#include "kahan.h"
DPair kahanStep (DPair pr, double x) {
double y, t;
y = x - pr.snd;
t = pr.fst + y;
pr.snd = (t - pr.fst) - y;
pr.fst = t;
return pr;
}
main.c:
#include <stdint.h>
#include <stdio.h>
#include "kahan.h"
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahan(double s[], double c[], DPair (*fun)(DPair,double)) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
DPair pr;
pr.fst = s[j];
pr.snd = c[j];
pr = fun(pr,w);
s[j] = pr.fst;
c[j] = pr.snd;
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err,kahanStep);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled separately and linked, that runs about 25% slower than the first C version here (0.1s vs. 0.079s).
Now you have a higher order function in C, considerably slower than the original, but still much faster than the Haskell code. One important difference is that the C function takes an unboxed pair of doubles and an unboxed double as arguments, while the Haskell kahanStep takes a boxed pair of boxed Doubles and a boxed Double and returns a boxed pair of boxed Doubles, requiring expensive boxing and unboxing in the foldV loop. That is addressable by more inlining. Explicitly inlining foldV, kahanStep, and step brings the time down from 0.90s to 0.74s here with ghc-7.0.4 (it has a smaller effect on ghc-7.4.1's output, from 0.99s down to 0.90s).
But the boxing and unboxing is, alas, the smaller part of the difference. foldV does much more than C's kahan, it takes a list of vectors used to modify the accumulator. That list of vectors is completely absent in the C code, and that makes a big difference. All these 100000 vectors have to be allocated, filled and put into a list (due to laziness, not all of them are simultaneously alive, so there's no space problem, but they, as well as the list cells, have to be allocated and garbage collected, which takes considerable time). And in the loop proper, instead of having a Word# passed in a register, the precomputed value is read from the vector.
If you use a more direct translation of the C to Haskell,
{-# LANGUAGE CPP, BangPatterns #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
prng :: Word -> Word
prng w = w'
where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
!cj <- unsafeRead c j
!sj <- unsafeRead s j
let !y = fromIntegral w - cj
!t = sj + y
!w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
it's much faster. Admittedly it's still about three times slower than the C, but the original was 13 times slower here (and I don't have llvm installed, so I use vanilla gcc and the native backed of GHC, using llvm may give slightly different results).
I don't think that indexing is really the culprit. The vector package heavily relies on compiler magic, but compiling for profiling support massively interferes with that. For packages like vector or bytestring which use their own fusion framework for optimisation, the profiling interference can be rather disastrous and the profiling results utterly useless. I'm inclined to believe we have such a case here.
In the Core, all reads and writes are transformed to the primops readDoubleArray#, indexDoubleArray# and writeDoubleArray#, which are fast. Maybe a bit slower than a C array access, but not very much. So I'm confident that that's not the problem and the cause of the big difference. But you have put {-# SCC #-} annotations on them, so disabling any optimisation involving a rearrangement of any of those terms. And each time one of these points is entered, it has to be recorded. I'm not familiar enough with the profiler and optimiser to know what exactly happens, but, as a data point, with the {-# INLINE #-} pragmas on foldV, step and kahanStep, a profiling run with these SCCs took 3.17s, and with the SCCs fV_step, fV_read, fV_index, fV_write and fV_apply removed (nothing else changed) a profiling run took only 2.03s (both times as reported by +RTS -P, so with the profiling overhead subtracted). That difference shows that SCCs on cheap functions and too fine-grained SCCs can massively skew the profiling results. Now if we also put {-# INLINE #-} pragmas on mkVect, kahan and prng, we are left with a completely uninformative profile, but the run takes only 1.23s. (These last inlinings have, however, no effect for the non-profiling runs, without profiling, they are inlined automatically.)
So, don't take the profiling results as unquestionable truths. The more your code (directly or indirectly through the libraries used) depends on optimisations, the more it is vulnerable to misleading profiling results caused by disabled optimisations. This also holds, but to a much lesser extent, for heap-profiling to pin down space leaks.
When you have a suspicious profiling result, check what happens when you remove some SCCs. If that results in a big drop of run time, that SCC was not your primary problem (it may become a problem again after other problems have been fixed).
Looking at the Core generated for your programme, what jumped out was that your kahanStep - by the way, remove the {-# NOINLINE #-} pragma from that, it's counter-productive - produced a boxed pair of boxed Doubles in the loop, that was immediately deconstructed and the components unboxed. Such unnecessary intermediate boxing of values is expensive and slows computations down a lot.
As this came up on haskell-cafe again today where somebody got terrible performance from the above code with ghc-7.4.1, tibbe took it upon himself to investigate the core that GHC produced and found that GHC produced suboptimal code for the conversion from Word to Double. Replacing the fromIntegral of the conversion with a custom conversion using only (wrapped) primitives (and removing the bang patterns that don't make a difference here, GHC's strictness analyser is good enough to see through the algorithm, I should learn to trust it more ;), we obtain a version that is on par with gcc -O3's output for the original C:
{-# LANGUAGE CPP #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
import GHC.Float (int2Double)
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
main :: IO ()
main = print . elems $ runSTUArray calc
There is a funny mixing in of list combinators in all of this seemingly Data.Vector code. If I make the very first obvious emendation, replacing
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
with the correct use of Data.Vector.Unboxed:
mkVect = U.force . U.map fromIntegral . U.iterateN 100 prng
then my time falls by two thirds -- from real 0m1.306s to real 0m0.429s It looks like all of the top level functions have this problem except prng and zero
This came up on the mailing lists and I discovered that there's a bug in the Word->Double conversion code in GHC 7.4.1 (at least). This version, which works around the bug, is as fast as the C code on my machine:
{-# LANGUAGE CPP, BangPatterns, MagicHash #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import GHC.Word
import GHC.Exts
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner !w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
outer i | i <= VNUM = inner (fromIntegral i) 0 >> outer (i + 1)
| otherwise = return ()
outer (1 :: Int)
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
{- I originally used this function, which isn't quite correct.
We need a real bug fix in GHC.
word2Double :: Word -> Double
word2Double (W# w) = D# (int2Double# (word2Int# w))
-}
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
Other than working around the Word->Double bug, I've also removed extra lists to match the C version better.
I know you didn't ask for a way to improve this micro-benchmark, but I'll give you an explanation that might prove helpful when writing loops in the future:
An unknown function call, such as the one made to the higher-order argument of foldV, can be expensive when done frequently in a loop. In particular, it will inhibit unboxing of the function arguments, leading to increased allocation. The reason it inhibits argument unboxing is that we don't know that the function we're calling is strict in those arguments and thus we pass the arguments as e.g. (Double, Double), instead of as Double# -> Double#.
The compiler can figure out the strictness information if the loop (e.g. foldV) meets the loop body (e.g. kahanStep). For that reason I recommend that people INLINE higher-order functions. In this case, inlining foldV and removing the NOINLINE on kahanStep improves the runtime quite a bit for me.
This doesn't bring the performance on par with C in this case, as there are other things going on (as others have commented on), but it's a step in the right direction (and it's a step you can do without every having to look at profiling output).

Resources