Haskell: unnecessary reevaluations of constant expressions - performance

I am going to demonstrate the problem using the following example program
{-# LANGUAGE BangPatterns #-}
data Point = Point !Double !Double
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b
in b * (q - fromIntegral (floor q :: Int))
standardMap :: Double -> Point -> Point
standardMap k (Point q p) =
Point (fmod (q + p) (2 * pi)) (fmod (p + k * sin(q)) (2 * pi))
iterate' gen !p = p : (iterate' gen $ gen p)
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k) $ (Point 0.15 0.25)
where k = (cos (pi/3)) - (sin (pi/3))
Here standardMap k is the parametrized function and k=(cos (pi/3))-(sin (pi/3)) is a parameter. If i compile this program with ghc -O3 -fllvm the execution time on my machine is approximately 42s, however, if I write k in the form 0.5 - (sin (pi/3)) the execution time equals 21s and if I write k = 0.5 - 0.5 * (sqrt 3) it will take only 12s.
The conclusion is that k is reevaluated on each call of standardMap k.
Why this is not optimized?
P.S. compiler ghc 7.6.3 on archlinux
EDIT
For those who are concerned with the weird properties of standardMap here is a simpler and more intuitive example, which exhibits the same problem
{-# LANGUAGE BangPatterns #-}
data Point = Point !Double !Double
rotate :: Double -> Point -> Point
rotate k (Point q p) =
Point ((cos k) * q - (sin k) * p) ((sin k) * q + (cos k) * p)
iterate' gen !p = p : (iterate' gen $ gen p)
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (rotate k) $ (Point 0.15 0.25)
where --k = (cos (pi/3)) - (sin (pi/3))
k = 0.5 - 0.5 * (sqrt 3)
EDIT
Before I asked the question I have tried to make k strict, the same way Don suggested, but with ghc -O3 I didn't see a difference. The solution with strictness works if the program is compiled with ghc -O2. I missed that because I didn't try all possible combinations of flags with the all possible versions of the program.
So what is the difference between -O3 and -O2 that affects such cases?
Should I prefer -O2 in general?
EDIT
As observed by Mike Hartl and others, if rotate k is changed into rotate $ k or standardMap k into standardMap $ k, the performance is improved, though it is not the best possible (Don's solution). Why?

As always, check the core.
With ghc -O2, k is inlined into the loop body, which is floated out as a top level function:
Main.main7 :: Main.Point -> Main.Point
Main.main7 =
\ (ds_dAa :: Main.Point) ->
case ds_dAa of _ { Main.Point q_alG p_alH ->
case q_alG of _ { GHC.Types.D# x_s1bt ->
case p_alH of _ { GHC.Types.D# y_s1bw ->
case Main.$wfmod (GHC.Prim.+## x_s1bt y_s1bw) 6.283185307179586
of ww_s1bi { __DEFAULT ->
case Main.$wfmod
(GHC.Prim.+##
y_s1bw
(GHC.Prim.*##
(GHC.Prim.-##
(GHC.Prim.cosDouble# 1.0471975511965976)
(GHC.Prim.sinDouble# 1.0471975511965976))
(GHC.Prim.sinDouble# x_s1bt)))
6.283185307179586
of ww1_X1bZ { __DEFAULT ->
Main.Point (GHC.Types.D# ww_s1bi) (GHC.Types.D# ww1_X1bZ)
Indicating that the sin and cos calls aren't evaluated at compile time.
The result is that a bit more math is going to occur:
$ time ./A
3.1430515093368085
real 0m15.590s
If you make it strict, it is at least not recalculated each time:
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k) $ (Point 0.15 0.25)
where
k :: Double
!k = (cos (pi/3)) - (sin (pi/3))
Resulting in:
ipv_sEq =
GHC.Prim.-##
(GHC.Prim.cosDouble# 1.0471975511965976)
(GHC.Prim.sinDouble# 1.0471975511965976) } in
And a running time of:
$ time ./A
6.283185307179588
real 0m7.859s
Which I think is good enough for now. I'd also add unpack pragmas to the Point type.
If you want to reason about numeric performance under different code arrangements, you must inspect the Core.
Using your revised example. It suffers the same issue. k is inlined rotate. GHC thinks it is really cheap, when in this benchmark it is more expensive.
Naively, ghc-7.2.3 -O2
$ time ./A
0.1470480616244365
real 0m22.897s
And k is evaluated each time rotate is called.
Make k strict: one way to force it to be not shared.
$ time ./A
0.14704806100839019
real 0m2.360s
Using UNPACK pragmas on the Point constructor:
$ time ./A
0.14704806100839019
real 0m1.860s

I don't think it is repeated evaluation.
First, I switched to "do" notation and used a "let" on the definition of "k" which I figured should help. No - still slow.
Then I added a trace call - just being evaluated once. Even checked that the fast variant was in fact producing a Double.
Then I printed out both variations. There is a small difference in the starting values.
Tweaking the value of the "slow" variant makes it the same speed. I've no idea what your algorithm is for - would it be very sensitive to starting values?
import Debug.Trace (trace)
...
main = do
-- is -0.3660254037844386
let k0 = (0.5 - 0.5 * (sqrt 3))::Double
-- was -0.3660254037844385
let k1 = (cos (pi/3)) - (trace "x" (sin (pi/3))) + 0.0000000000000001;
putStrLn (show k0)
putStrLn (show k1)
putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k1) $ (Point 0.15 0.25)
EDIT: this is the version with numeric literals. It's displaying runtimes of 23sec vs 7sec for me. I compiled two separate versions of the code to make sure I wasn't doing something stupid like not recompiling.
main = do
-- -0.3660254037844386
-- -0.3660254037844385
let k2 = -0.3660254037844385
putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k2) $ (Point 0.15 0.25)
EDIT2: I don't know how to get the opcodes from ghc, but comparing the hexdumps for the two .o files shows they differ by a single byte - presumably the literal. So it can't be the runtime.
EDIT3: Tried turning profiling on, and that's just puzzled me even more. unless I'm missing something the only difference is a small discrepancy in the number of calls to fmod (fmod.q to be precise).
The "5" profile is for the constant ending "5", same with "6".
Fri Sep 6 12:37 2013 Time and Allocation Profiling Report (Final)
constant-timings-5 +RTS -p -RTS
total time = 38.34 secs (38343 ticks # 1000 us, 1 processor)
total alloc = 12,000,105,184 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
standardMap Main 71.0 0.0
iterate' Main 21.2 93.3
fmod Main 6.3 6.7
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 50 0 0.0 0.0 100.0 100.0
main Main 101 0 0.0 0.0 0.0 0.0
CAF:main1 Main 98 0 0.0 0.0 0.0 0.0
main Main 100 1 0.0 0.0 0.0 0.0
CAF:main2 Main 97 0 0.0 0.0 1.0 0.0
main Main 102 0 1.0 0.0 1.0 0.0
main.\ Main 110 1 0.0 0.0 0.0 0.0
CAF:main3 Main 96 0 0.0 0.0 99.0 100.0
main Main 103 0 0.0 0.0 99.0 100.0
iterate' Main 104 100000001 21.2 93.3 99.0 100.0
standardMap Main 105 100000000 71.0 0.0 77.9 6.7
fmod Main 106 200000001 6.3 6.7 6.9 6.7
fmod.q Main 109 49999750 0.6 0.0 0.6 0.0
CAF:main_k Main 95 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
main.k2 Main 108 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 90 0 0.0 0.0 0.0 0.0
CAF GHC.Float 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 66 0 0.0 0.0 0.0 0.0
Fri Sep 6 12:38 2013 Time and Allocation Profiling Report (Final)
constant-timings-6 +RTS -p -RTS
total time = 22.17 secs (22167 ticks # 1000 us, 1 processor)
total alloc = 11,999,947,752 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
standardMap Main 48.4 0.0
iterate' Main 38.2 93.3
fmod Main 10.9 6.7
main Main 1.4 0.0
fmod.q Main 1.0 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 50 0 0.0 0.0 100.0 100.0
main Main 101 0 0.0 0.0 0.0 0.0
CAF:main1 Main 98 0 0.0 0.0 0.0 0.0
main Main 100 1 0.0 0.0 0.0 0.0
CAF:main2 Main 97 0 0.0 0.0 1.4 0.0
main Main 102 0 1.4 0.0 1.4 0.0
main.\ Main 110 1 0.0 0.0 0.0 0.0
CAF:main3 Main 96 0 0.0 0.0 98.6 100.0
main Main 103 0 0.0 0.0 98.6 100.0
iterate' Main 104 100000001 38.2 93.3 98.6 100.0
standardMap Main 105 100000000 48.4 0.0 60.4 6.7
fmod Main 106 200000001 10.9 6.7 12.0 6.7
fmod.q Main 109 49989901 1.0 0.0 1.0 0.0
CAF:main_k Main 95 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
main.k2 Main 108 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 90 0 0.0 0.0 0.0 0.0
CAF GHC.Float 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 66 0 0.0 0.0 0.0 0.0
EDIT4: Link below is to the two opcode dumps (thanks to #Tom Ellis). Although I can't read them, they seem to have the same "shape". Presumably the long random-char strings are internal identifiers. I've just recompiled both with -O2 -fforce-recomp and the time differences are real.
https://gist.github.com/anonymous/6462797

Related

Julia sparse matrix

I have vector y_vec, How to convert the vector to a matrix of form Y_matrix
y_vec = [0; 1; 1; 2; 3; 4]
Y_matrix = [1 0 0 0 0
0 1 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1]
So far, I've tried using a for loop.
Y_mat = full(spzeros(length(y_vec), length(unique(y_vec))))
for (i,j) in enumerate(1:length(y_vec))
Y_mat[i, y_vec[j]+1] = 1
end
But, there seems to be a problem when y_vec is not continuous, say y_vec = [0; 1; 1; 2; 3; 4; 8], using for loop fails !!! How to get around this issue.
Is there a way to solve the above problem using sparse matrix in Julia.
you can use sparse matrix constructor sparse(I,J,V):
y_vec = [0; 1; 1; 2; 3; 4; 8]
I = collect(1:length(y_vec))
J = y_vec+1
V = ones(length(y_vec))
S = sparse(I,J,V)
full(S)
julia> full(S)
7x9 Array{Float64,2}:
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

For loop for computing two vectors in R

Suppose i have a genotype dataset: geno
FID rs1 rs2 rs3
1 1 0 2
2 1 1 1
3 0 1 1
4 0 1 0
5 0 0 2
Another dataset is : coed
rs1 rs2 rs3
0.6 0.2 0.3
Do the following code:
geno$rs1 <- geno$rs1 * coed$rs1
geno$rs2 <- geno$rs2 * coed$rs2
geno$rs3 <- geno$rs3 * coed$rs3
sum3 <- rowSums(geno[,c(2:4)])
c <- cbind(geno,sum3)
I will get the output as i want
FID rs1 rs2 rs3 sum3
1 0.6 0 0.6 1.2
2 0.6 0.2 0.3 1.1
3 0 0.2 0.3 0.5
4 0 0.2 0 0.2
5 0 0 0.6 0.6
But i have thousands of SNPs, which i tried to build the below for loop
snp <- names(geno)[2:4]
geno.new <- numeric(0)
for (i in snp){
geno.new[i] = geno1[i] * coed[i]
}
The results is not what i would expected
$rs1
[1] 0.6 0.6 0.0 0.0 0.0
$rs2
[1] 0.0 0.2 0.2 0.2 0.0
$rs3
[1] 0.6 0.3 0.3 0.0 0.6
Could any one help me to improve that?
Thanks
I did find the solution, see the code below:
## read datasets
geno <- read.table("Genotype.csv",header=T,sep=",")
dim(geno)
coed <- read.table("beta.csv",header=T,sep=",")
## define the snp name
snp <- names(geno)[2:4]
## building for loop
for (i in snp){
geno[i] <- geno[i] * coed[i]
}
## caculate the sums
sum <- rowSums(geno[,c(2:4)])
## combind the results
all <- cbind(geno,sum)

Read numbers from stdin into a Data.Vector.Unboxed.Vector Int64

Given is a text file (for piping) with many numbers divided by a space, like so:
234 456 345 ...
What is the best way to read them all into a Data.Vector.Unboxed.Vector Int64? My current code looks like this:
import Control.Applicative
import Control.Arrow
import Data.Int
import Data.Maybe
import qualified Data.ByteString.Char8 as B
import qualified Data.Vector.Unboxed as V
main :: IO ()
main = do
v <- readInts <$> B.getContents
print $ V.maximum v
-- splitted for profiling
readInts :: B.ByteString -> V.Vector Int64
readInts = a >>> b >>> c >>> d
a = B.split ' '
b = mapMaybe (B.readInt >>> liftA fst)
c = map fromIntegral
d = V.fromList
Here is the profiler output
Thu Sep 18 16:08 2014 Time and Allocation Profiling Report (Final)
FastReadInts +RTS -p -K800M -RTS
total time = 0.51 secs (505 ticks # 1000 us, 1 processor)
total alloc = 1,295,988,256 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
d Main 74.3 5.2
b Main 9.9 35.6
a Main 6.3 40.0
main Main 4.8 0.0
c Main 3.2 19.3
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 60 0 0.4 0.0 100.0 100.0
main Main 121 0 4.8 0.0 98.2 100.0
readInts Main 123 0 0.0 0.0 93.5 100.0
a Main 131 0 6.1 40.0 6.1 40.0
b Main 129 0 9.9 35.6 9.9 35.6
c Main 127 0 3.2 19.3 3.2 19.3
d Main 125 0 74.3 5.2 74.3 5.2
CAF Main 119 0 0.0 0.0 0.2 0.0
a Main 130 1 0.2 0.0 0.2 0.0
b Main 128 1 0.0 0.0 0.0 0.0
c Main 126 1 0.0 0.0 0.0 0.0
d Main 124 1 0.0 0.0 0.0 0.0
readInts Main 122 1 0.0 0.0 0.0 0.0
main Main 120 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 103 0 0.6 0.0 0.6 0.0
CAF GHC.IO.Encoding 96 0 0.2 0.0 0.2 0.0
CAF GHC.IO.Handle.Internals 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 83 0 0.2 0.0 0.2 0.0
CAF GHC.IO.Encoding.Iconv 81 0 0.2 0.0 0.2 0.0
The programm is compiled and run this way:
ghc -O2 -prof -auto-all -rtsopts FastReadInts.hs
./FastReadInts +RTS -p -K800M < many_numbers.txt
many_numbers.txt is about 14MB large.
How can this bottleneck, i.e. V.fromList, be removed?
It is hard to answer questions like this without some expected level of performance or point of comparison. By simply omitting the profiling your code runs in 100ms over an ASCii file of 21MB of random 64-bit numbers, this seems reasonable to me.
$ time ./so < randoms.txt
9223350746261547498
real 0m0.109s
user 0m0.094s
sys 0m0.013s
And the generation of the test data:
import System.Random
main = do
g <- newStdGen
let rs = take (2^20) $ randomRs (0,2^64) g :: [Integer]
writeFile "randoms.txt" $ unwords (map show rs)
EDIT:
As requested:
import Data.Vector.Unboxed.Mutable as M
...
listToVector :: [Int64] -> V.Vector Int64
listToVector ls = unsafePerformIO $ do
m <- M.unsafeNew (2^20)
zipWithM_ (M.unsafeWrite m) [0..(2^20)-1] ls
V.unsafeFreeze m
Just wanted to note that pre-allocating mutable vector does not impact performance too much. In most cases run time will be dominated by reading file.
I have benchmarked both versions on 2^23 numbers and it seems that pre-allocated mutable array is even a bit slower.
benchmarking V.fromList
time 49.51 ms (47.65 ms .. 51.07 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 48.24 ms (47.82 ms .. 49.01 ms)
std dev 971.5 μs (329.1 μs .. 1.438 ms)
benchmarking listToVector
time 109.9 ms (106.2 ms .. 119.9 ms)
0.993 R² (0.975 R² .. 1.000 R²)
mean 109.3 ms (107.6 ms .. 113.8 ms)
std dev 4.041 ms (1.149 ms .. 6.129 ms)
And here is the code of the benchmark:
import Control.Applicative
import Control.Monad (zipWithM_)
import System.IO.Unsafe
import Data.Int
import qualified Data.ByteString.Char8 as B
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as M
import Criterion.Main
main :: IO ()
main = do
let readInt x = let Just (i,_) = B.readInt x in fromIntegral i
nums <- map readInt . B.words <$> B.readFile "randoms.txt"
defaultMain
[bench "V.fromList" $ whnf (V.maximum . V.fromList) nums
,bench "listToVector" $ whnf (V.maximum . listToVector) nums
]
listToVector :: [Int64] -> V.Vector Int64
listToVector ls = unsafePerformIO $ do
m <- M.unsafeNew (2^23)
zipWithM_ (M.unsafeWrite m) [0..(2^23)-1] ls
V.unsafeFreeze m

Is indexing of Data.Vector.Unboxed.Mutable.MVector really this slow?

I have an application that spends about 80% of its time computing the centroid of a large list (10^7) of high dimensional vectors (dim=100) using the Kahan summation algorithm. I have done my best at optimizing the summation, but it is still 20x slower than an equivalent C implementation. Profiling indicates that the culprits are the unsafeRead and unsafeWrite functions from Data.Vector.Unboxed.Mutable. My question is: are these functions really this slow or am I misunderstanding the profiling statistics?
Here are the two implementations. The Haskell one is compiled with ghc-7.0.3 using the llvm backend. The C one is compiled with llvm-gcc.
Kahan summation in Haskell:
{-# LANGUAGE BangPatterns #-}
module Test where
import Control.Monad ( mapM_ )
import Data.Vector.Unboxed ( Vector, Unbox )
import Data.Vector.Unboxed.Mutable ( MVector )
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import Data.Word ( Word )
import Data.Bits ( shiftL, shiftR, xor )
prng :: Word -> Word
prng w = w' where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
mkVect :: Word -> Vector Double
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
foldV :: (Unbox a, Unbox b)
=> (a -> b -> a) -- componentwise function to fold
-> Vector a -- initial accumulator value
-> [Vector b] -- data vectors
-> Vector a -- final accumulator value
foldV fn accum vs = U.modify (\x -> mapM_ (liftV fn x) vs) accum where
liftV f acc = fV where
fV v = go 0 where
n = min (U.length v) (UM.length acc)
go i | i < n = step >> go (i + 1)
| otherwise = return ()
where
step = {-# SCC "fV_step" #-} do
a <- {-# SCC "fV_read" #-} UM.unsafeRead acc i
b <- {-# SCC "fV_index" #-} U.unsafeIndexM v i
{-# SCC "fV_write" #-} UM.unsafeWrite acc i $! {-# SCC "fV_apply" #-} f a b
kahan :: [Vector Double] -> Vector Double
kahan [] = U.singleton 0.0
kahan (v:vs) = fst . U.unzip $ foldV kahanStep acc vs where
acc = U.map (\z -> (z, 0.0)) v
kahanStep :: (Double, Double) -> Double -> (Double, Double)
kahanStep (s, c) x = (s', c') where
!y = x - c
!s' = s + y
!c' = (s' - s) - y
{-# NOINLINE kahanStep #-}
zero :: U.Vector Double
zero = U.replicate 100 0.0
myLoop n = kahan $ map mkVect [1..n]
main = print $ myLoop 100000
Compiling with ghc-7.0.3 using the llvm backend:
ghc -o Test_hs --make -fforce-recomp -O3 -fllvm -optlo-O3 -msse2 -main-is Test.main Test.hs
time ./Test_hs
real 0m1.948s
user 0m1.936s
sys 0m0.008s
Profiling information:
16,710,594,992 bytes allocated in the heap
33,047,064 bytes copied during GC
35,464 bytes maximum residency (1 sample(s))
23,888 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 31907 collections, 0 parallel, 0.28s, 0.27s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 24.73s ( 24.74s elapsed)
GC time 0.28s ( 0.27s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 25.01s ( 25.02s elapsed)
%GC time 1.1% (1.1% elapsed)
Alloc rate 675,607,179 bytes per MUT second
Productivity 98.9% of total user, 98.9% of total elapsed
Thu Feb 23 02:42 2012 Time and Allocation Profiling Report (Final)
Test_hs +RTS -s -p -RTS
total time = 24.60 secs (1230 ticks # 20 ms)
total alloc = 8,608,188,392 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
fV_write Test 31.1 26.0
fV_read Test 27.2 23.2
mkVect Test 12.3 27.2
fV_step Test 11.7 0.0
foldV Test 5.9 5.7
fV_index Test 5.2 9.3
kahanStep Test 3.3 6.5
prng Test 2.2 1.8
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
CAF:main1 Test 339 1 0.0 0.0 0.0 0.0
main Test 346 1 0.0 0.0 0.0 0.0
CAF:main2 Test 338 1 0.0 0.0 100.0 100.0
main Test 347 0 0.0 0.0 100.0 100.0
myLoop Test 348 1 0.2 0.2 100.0 100.0
mkVect Test 350 400000 12.3 27.2 14.5 29.0
prng Test 351 9900000 2.2 1.8 2.2 1.8
kahan Test 349 102 0.0 0.0 85.4 70.7
foldV Test 359 1 5.9 5.7 85.4 70.7
fV_step Test 360 9999900 11.7 0.0 79.5 65.1
fV_write Test 367 19999800 31.1 26.0 35.4 32.5
fV_apply Test 368 9999900 1.0 0.0 4.3 6.5
kahanStep Test 369 9999900 3.3 6.5 3.3 6.5
fV_index Test 366 9999900 5.2 9.3 5.2 9.3
fV_read Test 361 9999900 27.2 23.2 27.2 23.2
CAF:lvl19_r3ei Test 337 1 0.0 0.0 0.0 0.0
kahan Test 358 0 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad3_r3eg Test 336 1 0.0 0.0 0.0 0.0
kahan Test 357 0 0.0 0.0 0.0 0.0
CAF:$dMVector2_r3ee Test 335 1 0.0 0.0 0.0 0.0
CAF:$dVector1_r3ec Test 334 1 0.0 0.0 0.0 0.0
CAF:poly_$dMonad_r3ea Test 333 1 0.0 0.0 0.0 0.0
CAF:$dMVector1_r3e2 Test 330 1 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad2_r3e0 Test 328 1 0.0 0.0 0.0 0.0
foldV Test 365 0 0.0 0.0 0.0 0.0
CAF:lvl11_r3dM Test 322 1 0.0 0.0 0.0 0.0
kahan Test 354 0 0.0 0.0 0.0 0.0
CAF:lvl10_r3dK Test 321 1 0.0 0.0 0.0 0.0
kahan Test 355 0 0.0 0.0 0.0 0.0
CAF:$dMVector_r3dI Test 320 1 0.0 0.0 0.0 0.0
kahan Test 356 0 0.0 0.0 0.0 0.0
CAF GHC.Float 297 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 256 2 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 214 2 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 211 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Generic 182 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Unboxed 174 2 0.0 0.0 0.0 0.0
The equivalent implementation in C:
#include <stdint.h>
#include <stdio.h>
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahanStep (double *s, double *c, double x) {
double y, t;
y = x - *c;
t = *s + y;
*c = (t - *s) - y;
*s = t;
}
void kahan(double s[], double c[]) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
kahanStep(&s[j], &c[j], w);
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled with llvm-gcc:
>llvm-gcc -o Test_c -O3 -msse2 -std=c99 test.c
>time ./Test_c
real 0m0.096s
user 0m0.088s
sys 0m0.004s
Update 1: I un-inlined kahanStep in the C version. It barely made a dent in the performance. I hope that now we can all acknowledge Amdahl's law and move on. As
inefficient as kahanStep might be, unsafeRead and unsafeWrite are 9-10x slower. I was hoping someone could shed some light on the possible causes of that fact.
Also, I should say that since I am interacting with a library that uses Data.Vector.Unboxed, so I am kinda married to it at this point, and parting with it would be very traumatic :-)
Update 2: I guess I was not clear enough in my original question. I am not looking for ways to speed up this microbenchmark. I am looking for an explanation of the counter intuitive profiling stats, so I can decide whether or not to file a bug report against vector.
Your C version is not equivalent to your Haskell implementation. In C you've inlined the important Kahan summation step yourself, in Haskell you created a polymorphic higher order function that does a lot more and takes the transformation step as a parameter. Moving kahanStep to a separate function in C isn't the point, it will still be inlined by the compiler. Even if you put it into its own source file, compile separately and link without link-time optimisation, you have only addressed part of the difference.
I have made a C version that is closer to the Haskell version,
kahan.h:
typedef struct DPair_st {
double fst, snd;
} DPair;
DPair kahanStep(DPair pr, double x);
kahanStep.c:
#include "kahan.h"
DPair kahanStep (DPair pr, double x) {
double y, t;
y = x - pr.snd;
t = pr.fst + y;
pr.snd = (t - pr.fst) - y;
pr.fst = t;
return pr;
}
main.c:
#include <stdint.h>
#include <stdio.h>
#include "kahan.h"
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahan(double s[], double c[], DPair (*fun)(DPair,double)) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
DPair pr;
pr.fst = s[j];
pr.snd = c[j];
pr = fun(pr,w);
s[j] = pr.fst;
c[j] = pr.snd;
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err,kahanStep);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled separately and linked, that runs about 25% slower than the first C version here (0.1s vs. 0.079s).
Now you have a higher order function in C, considerably slower than the original, but still much faster than the Haskell code. One important difference is that the C function takes an unboxed pair of doubles and an unboxed double as arguments, while the Haskell kahanStep takes a boxed pair of boxed Doubles and a boxed Double and returns a boxed pair of boxed Doubles, requiring expensive boxing and unboxing in the foldV loop. That is addressable by more inlining. Explicitly inlining foldV, kahanStep, and step brings the time down from 0.90s to 0.74s here with ghc-7.0.4 (it has a smaller effect on ghc-7.4.1's output, from 0.99s down to 0.90s).
But the boxing and unboxing is, alas, the smaller part of the difference. foldV does much more than C's kahan, it takes a list of vectors used to modify the accumulator. That list of vectors is completely absent in the C code, and that makes a big difference. All these 100000 vectors have to be allocated, filled and put into a list (due to laziness, not all of them are simultaneously alive, so there's no space problem, but they, as well as the list cells, have to be allocated and garbage collected, which takes considerable time). And in the loop proper, instead of having a Word# passed in a register, the precomputed value is read from the vector.
If you use a more direct translation of the C to Haskell,
{-# LANGUAGE CPP, BangPatterns #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
prng :: Word -> Word
prng w = w'
where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
!cj <- unsafeRead c j
!sj <- unsafeRead s j
let !y = fromIntegral w - cj
!t = sj + y
!w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
it's much faster. Admittedly it's still about three times slower than the C, but the original was 13 times slower here (and I don't have llvm installed, so I use vanilla gcc and the native backed of GHC, using llvm may give slightly different results).
I don't think that indexing is really the culprit. The vector package heavily relies on compiler magic, but compiling for profiling support massively interferes with that. For packages like vector or bytestring which use their own fusion framework for optimisation, the profiling interference can be rather disastrous and the profiling results utterly useless. I'm inclined to believe we have such a case here.
In the Core, all reads and writes are transformed to the primops readDoubleArray#, indexDoubleArray# and writeDoubleArray#, which are fast. Maybe a bit slower than a C array access, but not very much. So I'm confident that that's not the problem and the cause of the big difference. But you have put {-# SCC #-} annotations on them, so disabling any optimisation involving a rearrangement of any of those terms. And each time one of these points is entered, it has to be recorded. I'm not familiar enough with the profiler and optimiser to know what exactly happens, but, as a data point, with the {-# INLINE #-} pragmas on foldV, step and kahanStep, a profiling run with these SCCs took 3.17s, and with the SCCs fV_step, fV_read, fV_index, fV_write and fV_apply removed (nothing else changed) a profiling run took only 2.03s (both times as reported by +RTS -P, so with the profiling overhead subtracted). That difference shows that SCCs on cheap functions and too fine-grained SCCs can massively skew the profiling results. Now if we also put {-# INLINE #-} pragmas on mkVect, kahan and prng, we are left with a completely uninformative profile, but the run takes only 1.23s. (These last inlinings have, however, no effect for the non-profiling runs, without profiling, they are inlined automatically.)
So, don't take the profiling results as unquestionable truths. The more your code (directly or indirectly through the libraries used) depends on optimisations, the more it is vulnerable to misleading profiling results caused by disabled optimisations. This also holds, but to a much lesser extent, for heap-profiling to pin down space leaks.
When you have a suspicious profiling result, check what happens when you remove some SCCs. If that results in a big drop of run time, that SCC was not your primary problem (it may become a problem again after other problems have been fixed).
Looking at the Core generated for your programme, what jumped out was that your kahanStep - by the way, remove the {-# NOINLINE #-} pragma from that, it's counter-productive - produced a boxed pair of boxed Doubles in the loop, that was immediately deconstructed and the components unboxed. Such unnecessary intermediate boxing of values is expensive and slows computations down a lot.
As this came up on haskell-cafe again today where somebody got terrible performance from the above code with ghc-7.4.1, tibbe took it upon himself to investigate the core that GHC produced and found that GHC produced suboptimal code for the conversion from Word to Double. Replacing the fromIntegral of the conversion with a custom conversion using only (wrapped) primitives (and removing the bang patterns that don't make a difference here, GHC's strictness analyser is good enough to see through the algorithm, I should learn to trust it more ;), we obtain a version that is on par with gcc -O3's output for the original C:
{-# LANGUAGE CPP #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
import GHC.Float (int2Double)
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
main :: IO ()
main = print . elems $ runSTUArray calc
There is a funny mixing in of list combinators in all of this seemingly Data.Vector code. If I make the very first obvious emendation, replacing
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
with the correct use of Data.Vector.Unboxed:
mkVect = U.force . U.map fromIntegral . U.iterateN 100 prng
then my time falls by two thirds -- from real 0m1.306s to real 0m0.429s It looks like all of the top level functions have this problem except prng and zero
This came up on the mailing lists and I discovered that there's a bug in the Word->Double conversion code in GHC 7.4.1 (at least). This version, which works around the bug, is as fast as the C code on my machine:
{-# LANGUAGE CPP, BangPatterns, MagicHash #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import GHC.Word
import GHC.Exts
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner !w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
outer i | i <= VNUM = inner (fromIntegral i) 0 >> outer (i + 1)
| otherwise = return ()
outer (1 :: Int)
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
{- I originally used this function, which isn't quite correct.
We need a real bug fix in GHC.
word2Double :: Word -> Double
word2Double (W# w) = D# (int2Double# (word2Int# w))
-}
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
Other than working around the Word->Double bug, I've also removed extra lists to match the C version better.
I know you didn't ask for a way to improve this micro-benchmark, but I'll give you an explanation that might prove helpful when writing loops in the future:
An unknown function call, such as the one made to the higher-order argument of foldV, can be expensive when done frequently in a loop. In particular, it will inhibit unboxing of the function arguments, leading to increased allocation. The reason it inhibits argument unboxing is that we don't know that the function we're calling is strict in those arguments and thus we pass the arguments as e.g. (Double, Double), instead of as Double# -> Double#.
The compiler can figure out the strictness information if the loop (e.g. foldV) meets the loop body (e.g. kahanStep). For that reason I recommend that people INLINE higher-order functions. In this case, inlining foldV and removing the NOINLINE on kahanStep improves the runtime quite a bit for me.
This doesn't bring the performance on par with C in this case, as there are other things going on (as others have commented on), but it's a step in the right direction (and it's a step you can do without every having to look at profiling output).

Is performance of partial or curried functions well defined in Haskell?

In the following code:
ismaxl :: (Ord a) => [a] -> a -> Bool
ismaxl l x = x == maxel
where maxel = maximum l
main = do
let mylist = [1, 2, 3, 5]
let ismax = ismaxl mylist
--Is each call O(1)? Does each call remember maxel?
let c1 = ismax 1
let c2 = ismax 2
let c3 = ismax 3
let c5 = ismax 5
putStrLn (show [c1, c2, c3, c5])
Does the partial function ismax, compute the maxel? Speficially, can someone point to a rule about the complexity of partial functions in Haskell? MUST the compiler only call maximum once in the above example? Put another way, does a partial function keep the references of prior calls for internal where clauses?
I have some CPU-bound code that is not performing acceptably, and I'm looking for possible errors in my reasoning about the complexity.
As a demonstration of what you can learn from profiling your Haskell code, here's the result of some minor modifications to your code. First, I've replaced mylist with [0..10000000] to make sure it takes a while to compute the maximum.
Here's some lines from the profiling output, after running that version:
COST CENTRE MODULE %time %alloc
ismaxl Main 55.8 0.0
main Main 44.2 100.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
CAF:main_c5 Main 225 1 0.0 0.0 15.6 0.0
main Main 249 0 0.0 0.0 15.6 0.0
ismaxl Main 250 1 15.6 0.0 15.6 0.0
CAF:main_c3 Main 224 1 0.0 0.0 15.6 0.0
main Main 246 0 0.0 0.0 15.6 0.0
ismaxl Main 247 1 15.6 0.0 15.6 0.0
CAF:main_c2 Main 223 1 0.0 0.0 14.3 0.0
main Main 243 0 0.0 0.0 14.3 0.0
ismaxl Main 244 1 14.3 0.0 14.3 0.0
CAF:main_c1 Main 222 1 0.0 0.0 10.4 0.0
main Main 239 0 0.0 0.0 10.4 0.0
ismaxl Main 240 1 10.4 0.0 10.4 0.0
CAF:main8 Main 221 1 0.0 0.0 44.2 100.0
main Main 241 0 44.2 100.0 44.2 100.0
It's pretty obviously recomputing the maximum here.
Now, replacing ismaxl with this:
ismaxl :: (Ord a) => [a] -> a -> Bool
ismaxl l = let maxel = maximum l in (== maxel)
...and profiling again:
COST CENTRE MODULE %time %alloc
main Main 60.5 100.0
ismaxl Main 39.5 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
CAF:main_c5 Main 227 1 0.0 0.0 0.0 0.0
main Main 252 0 0.0 0.0 0.0 0.0
ismaxl Main 253 1 0.0 0.0 0.0 0.0
CAF:main_c3 Main 226 1 0.0 0.0 0.0 0.0
main Main 249 0 0.0 0.0 0.0 0.0
ismaxl Main 250 1 0.0 0.0 0.0 0.0
CAF:main_c2 Main 225 1 0.0 0.0 0.0 0.0
main Main 246 0 0.0 0.0 0.0 0.0
ismaxl Main 247 1 0.0 0.0 0.0 0.0
CAF:main_c1 Main 224 1 0.0 0.0 0.0 0.0
CAF:main_ismax Main 223 1 0.0 0.0 39.5 0.0
main Main 242 0 0.0 0.0 39.5 0.0
ismaxl Main 243 2 39.5 0.0 39.5 0.0
CAF:main8 Main 222 1 0.0 0.0 60.5 100.0
main Main 244 0 60.5 100.0 60.5 100.0
...this time it's spending most of its time in one single call to ismaxl, the others being too fast to even notice, so it must be computing the maximum only once here.
Here's a modified version of your code that will allow you to see whether or not maxel is reused:
import Debug.Trace
ismaxl :: (Ord a) => [a] -> a -> Bool
ismaxl l x = x == maxel
where maxel = trace "Hello" $ maximum l
main = do
let mylist = [1, 2, 3, 5]
let ismax = ismaxl mylist
--Is each call O(1)? Does each call remember maxel?
let c1 = ismax 1
let c2 = ismax 2
let c3 = ismax 3
let c5 = ismax 5
putStrLn (show [c1, c2, c3, c5])
You'll see that maxel is not 'remembered' between applications.
In general, you shouldn't expect Haskell to start doing reductions until all of the arguments have been supplied to a function.
On the other hand, if you have aggressive optimisation turned on then it's hard to predict what a particular compiler would actually do. But you probably ought not to rely on any part of the compiler that's hard to predict when you can easily rewrite the code to make what you want explicit.
Building off other good answers, GHC hasn't been eager to perform this sort of optimization in my experience. If I can't easily make something point-free, I've often resorted to writing with a mix of bound vars on the LHS and a lambda:
ismaxl :: (Ord a) => [a] -> a -> Bool
ismaxl l = \x -> x == maxel
where maxel = maximum l
I don't particularly like this style, but it does ensure that maxel is shared between calls to a partially applied ismaxl.
I haven't been able to find any such requirement in the Haskell Report, and in fact GHC doesn't seem to perform this optimization by default.
I changed your main function to
main = do
let mylist = [1..99999]
let ismax = ismaxl mylist
let c1 = ismax 1
let c2 = ismax 2
let c3 = ismax 3
let c5 = ismax 5
putStrLn (show [c1, c2, c3, c5])
Simple profiling shows (on my old Pentium 4):
$ ghc a.hs
$ time ./a.out
[False,False,False,False]
real 0m0.313s
user 0m0.220s
sys 0m0.044s
But when I change the definition of c2, c3 and c5 to let c2 = 2 == 99999 etc. (leaving c1 as it is), I get
$ ghc a.hs
$ time ./a.out
[False,False,False,False]
real 0m0.113s
user 0m0.060s
sys 0m0.028s

Resources