Is indexing of Data.Vector.Unboxed.Mutable.MVector really this slow? - performance

I have an application that spends about 80% of its time computing the centroid of a large list (10^7) of high dimensional vectors (dim=100) using the Kahan summation algorithm. I have done my best at optimizing the summation, but it is still 20x slower than an equivalent C implementation. Profiling indicates that the culprits are the unsafeRead and unsafeWrite functions from Data.Vector.Unboxed.Mutable. My question is: are these functions really this slow or am I misunderstanding the profiling statistics?
Here are the two implementations. The Haskell one is compiled with ghc-7.0.3 using the llvm backend. The C one is compiled with llvm-gcc.
Kahan summation in Haskell:
{-# LANGUAGE BangPatterns #-}
module Test where
import Control.Monad ( mapM_ )
import Data.Vector.Unboxed ( Vector, Unbox )
import Data.Vector.Unboxed.Mutable ( MVector )
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import Data.Word ( Word )
import Data.Bits ( shiftL, shiftR, xor )
prng :: Word -> Word
prng w = w' where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
mkVect :: Word -> Vector Double
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
foldV :: (Unbox a, Unbox b)
=> (a -> b -> a) -- componentwise function to fold
-> Vector a -- initial accumulator value
-> [Vector b] -- data vectors
-> Vector a -- final accumulator value
foldV fn accum vs = U.modify (\x -> mapM_ (liftV fn x) vs) accum where
liftV f acc = fV where
fV v = go 0 where
n = min (U.length v) (UM.length acc)
go i | i < n = step >> go (i + 1)
| otherwise = return ()
where
step = {-# SCC "fV_step" #-} do
a <- {-# SCC "fV_read" #-} UM.unsafeRead acc i
b <- {-# SCC "fV_index" #-} U.unsafeIndexM v i
{-# SCC "fV_write" #-} UM.unsafeWrite acc i $! {-# SCC "fV_apply" #-} f a b
kahan :: [Vector Double] -> Vector Double
kahan [] = U.singleton 0.0
kahan (v:vs) = fst . U.unzip $ foldV kahanStep acc vs where
acc = U.map (\z -> (z, 0.0)) v
kahanStep :: (Double, Double) -> Double -> (Double, Double)
kahanStep (s, c) x = (s', c') where
!y = x - c
!s' = s + y
!c' = (s' - s) - y
{-# NOINLINE kahanStep #-}
zero :: U.Vector Double
zero = U.replicate 100 0.0
myLoop n = kahan $ map mkVect [1..n]
main = print $ myLoop 100000
Compiling with ghc-7.0.3 using the llvm backend:
ghc -o Test_hs --make -fforce-recomp -O3 -fllvm -optlo-O3 -msse2 -main-is Test.main Test.hs
time ./Test_hs
real 0m1.948s
user 0m1.936s
sys 0m0.008s
Profiling information:
16,710,594,992 bytes allocated in the heap
33,047,064 bytes copied during GC
35,464 bytes maximum residency (1 sample(s))
23,888 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 31907 collections, 0 parallel, 0.28s, 0.27s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 24.73s ( 24.74s elapsed)
GC time 0.28s ( 0.27s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 25.01s ( 25.02s elapsed)
%GC time 1.1% (1.1% elapsed)
Alloc rate 675,607,179 bytes per MUT second
Productivity 98.9% of total user, 98.9% of total elapsed
Thu Feb 23 02:42 2012 Time and Allocation Profiling Report (Final)
Test_hs +RTS -s -p -RTS
total time = 24.60 secs (1230 ticks # 20 ms)
total alloc = 8,608,188,392 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
fV_write Test 31.1 26.0
fV_read Test 27.2 23.2
mkVect Test 12.3 27.2
fV_step Test 11.7 0.0
foldV Test 5.9 5.7
fV_index Test 5.2 9.3
kahanStep Test 3.3 6.5
prng Test 2.2 1.8
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
CAF:main1 Test 339 1 0.0 0.0 0.0 0.0
main Test 346 1 0.0 0.0 0.0 0.0
CAF:main2 Test 338 1 0.0 0.0 100.0 100.0
main Test 347 0 0.0 0.0 100.0 100.0
myLoop Test 348 1 0.2 0.2 100.0 100.0
mkVect Test 350 400000 12.3 27.2 14.5 29.0
prng Test 351 9900000 2.2 1.8 2.2 1.8
kahan Test 349 102 0.0 0.0 85.4 70.7
foldV Test 359 1 5.9 5.7 85.4 70.7
fV_step Test 360 9999900 11.7 0.0 79.5 65.1
fV_write Test 367 19999800 31.1 26.0 35.4 32.5
fV_apply Test 368 9999900 1.0 0.0 4.3 6.5
kahanStep Test 369 9999900 3.3 6.5 3.3 6.5
fV_index Test 366 9999900 5.2 9.3 5.2 9.3
fV_read Test 361 9999900 27.2 23.2 27.2 23.2
CAF:lvl19_r3ei Test 337 1 0.0 0.0 0.0 0.0
kahan Test 358 0 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad3_r3eg Test 336 1 0.0 0.0 0.0 0.0
kahan Test 357 0 0.0 0.0 0.0 0.0
CAF:$dMVector2_r3ee Test 335 1 0.0 0.0 0.0 0.0
CAF:$dVector1_r3ec Test 334 1 0.0 0.0 0.0 0.0
CAF:poly_$dMonad_r3ea Test 333 1 0.0 0.0 0.0 0.0
CAF:$dMVector1_r3e2 Test 330 1 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad2_r3e0 Test 328 1 0.0 0.0 0.0 0.0
foldV Test 365 0 0.0 0.0 0.0 0.0
CAF:lvl11_r3dM Test 322 1 0.0 0.0 0.0 0.0
kahan Test 354 0 0.0 0.0 0.0 0.0
CAF:lvl10_r3dK Test 321 1 0.0 0.0 0.0 0.0
kahan Test 355 0 0.0 0.0 0.0 0.0
CAF:$dMVector_r3dI Test 320 1 0.0 0.0 0.0 0.0
kahan Test 356 0 0.0 0.0 0.0 0.0
CAF GHC.Float 297 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 256 2 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 214 2 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 211 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Generic 182 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Unboxed 174 2 0.0 0.0 0.0 0.0
The equivalent implementation in C:
#include <stdint.h>
#include <stdio.h>
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahanStep (double *s, double *c, double x) {
double y, t;
y = x - *c;
t = *s + y;
*c = (t - *s) - y;
*s = t;
}
void kahan(double s[], double c[]) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
kahanStep(&s[j], &c[j], w);
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled with llvm-gcc:
>llvm-gcc -o Test_c -O3 -msse2 -std=c99 test.c
>time ./Test_c
real 0m0.096s
user 0m0.088s
sys 0m0.004s
Update 1: I un-inlined kahanStep in the C version. It barely made a dent in the performance. I hope that now we can all acknowledge Amdahl's law and move on. As
inefficient as kahanStep might be, unsafeRead and unsafeWrite are 9-10x slower. I was hoping someone could shed some light on the possible causes of that fact.
Also, I should say that since I am interacting with a library that uses Data.Vector.Unboxed, so I am kinda married to it at this point, and parting with it would be very traumatic :-)
Update 2: I guess I was not clear enough in my original question. I am not looking for ways to speed up this microbenchmark. I am looking for an explanation of the counter intuitive profiling stats, so I can decide whether or not to file a bug report against vector.

Your C version is not equivalent to your Haskell implementation. In C you've inlined the important Kahan summation step yourself, in Haskell you created a polymorphic higher order function that does a lot more and takes the transformation step as a parameter. Moving kahanStep to a separate function in C isn't the point, it will still be inlined by the compiler. Even if you put it into its own source file, compile separately and link without link-time optimisation, you have only addressed part of the difference.
I have made a C version that is closer to the Haskell version,
kahan.h:
typedef struct DPair_st {
double fst, snd;
} DPair;
DPair kahanStep(DPair pr, double x);
kahanStep.c:
#include "kahan.h"
DPair kahanStep (DPair pr, double x) {
double y, t;
y = x - pr.snd;
t = pr.fst + y;
pr.snd = (t - pr.fst) - y;
pr.fst = t;
return pr;
}
main.c:
#include <stdint.h>
#include <stdio.h>
#include "kahan.h"
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahan(double s[], double c[], DPair (*fun)(DPair,double)) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
DPair pr;
pr.fst = s[j];
pr.snd = c[j];
pr = fun(pr,w);
s[j] = pr.fst;
c[j] = pr.snd;
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err,kahanStep);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled separately and linked, that runs about 25% slower than the first C version here (0.1s vs. 0.079s).
Now you have a higher order function in C, considerably slower than the original, but still much faster than the Haskell code. One important difference is that the C function takes an unboxed pair of doubles and an unboxed double as arguments, while the Haskell kahanStep takes a boxed pair of boxed Doubles and a boxed Double and returns a boxed pair of boxed Doubles, requiring expensive boxing and unboxing in the foldV loop. That is addressable by more inlining. Explicitly inlining foldV, kahanStep, and step brings the time down from 0.90s to 0.74s here with ghc-7.0.4 (it has a smaller effect on ghc-7.4.1's output, from 0.99s down to 0.90s).
But the boxing and unboxing is, alas, the smaller part of the difference. foldV does much more than C's kahan, it takes a list of vectors used to modify the accumulator. That list of vectors is completely absent in the C code, and that makes a big difference. All these 100000 vectors have to be allocated, filled and put into a list (due to laziness, not all of them are simultaneously alive, so there's no space problem, but they, as well as the list cells, have to be allocated and garbage collected, which takes considerable time). And in the loop proper, instead of having a Word# passed in a register, the precomputed value is read from the vector.
If you use a more direct translation of the C to Haskell,
{-# LANGUAGE CPP, BangPatterns #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
prng :: Word -> Word
prng w = w'
where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
!cj <- unsafeRead c j
!sj <- unsafeRead s j
let !y = fromIntegral w - cj
!t = sj + y
!w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
it's much faster. Admittedly it's still about three times slower than the C, but the original was 13 times slower here (and I don't have llvm installed, so I use vanilla gcc and the native backed of GHC, using llvm may give slightly different results).
I don't think that indexing is really the culprit. The vector package heavily relies on compiler magic, but compiling for profiling support massively interferes with that. For packages like vector or bytestring which use their own fusion framework for optimisation, the profiling interference can be rather disastrous and the profiling results utterly useless. I'm inclined to believe we have such a case here.
In the Core, all reads and writes are transformed to the primops readDoubleArray#, indexDoubleArray# and writeDoubleArray#, which are fast. Maybe a bit slower than a C array access, but not very much. So I'm confident that that's not the problem and the cause of the big difference. But you have put {-# SCC #-} annotations on them, so disabling any optimisation involving a rearrangement of any of those terms. And each time one of these points is entered, it has to be recorded. I'm not familiar enough with the profiler and optimiser to know what exactly happens, but, as a data point, with the {-# INLINE #-} pragmas on foldV, step and kahanStep, a profiling run with these SCCs took 3.17s, and with the SCCs fV_step, fV_read, fV_index, fV_write and fV_apply removed (nothing else changed) a profiling run took only 2.03s (both times as reported by +RTS -P, so with the profiling overhead subtracted). That difference shows that SCCs on cheap functions and too fine-grained SCCs can massively skew the profiling results. Now if we also put {-# INLINE #-} pragmas on mkVect, kahan and prng, we are left with a completely uninformative profile, but the run takes only 1.23s. (These last inlinings have, however, no effect for the non-profiling runs, without profiling, they are inlined automatically.)
So, don't take the profiling results as unquestionable truths. The more your code (directly or indirectly through the libraries used) depends on optimisations, the more it is vulnerable to misleading profiling results caused by disabled optimisations. This also holds, but to a much lesser extent, for heap-profiling to pin down space leaks.
When you have a suspicious profiling result, check what happens when you remove some SCCs. If that results in a big drop of run time, that SCC was not your primary problem (it may become a problem again after other problems have been fixed).
Looking at the Core generated for your programme, what jumped out was that your kahanStep - by the way, remove the {-# NOINLINE #-} pragma from that, it's counter-productive - produced a boxed pair of boxed Doubles in the loop, that was immediately deconstructed and the components unboxed. Such unnecessary intermediate boxing of values is expensive and slows computations down a lot.
As this came up on haskell-cafe again today where somebody got terrible performance from the above code with ghc-7.4.1, tibbe took it upon himself to investigate the core that GHC produced and found that GHC produced suboptimal code for the conversion from Word to Double. Replacing the fromIntegral of the conversion with a custom conversion using only (wrapped) primitives (and removing the bang patterns that don't make a difference here, GHC's strictness analyser is good enough to see through the algorithm, I should learn to trust it more ;), we obtain a version that is on par with gcc -O3's output for the original C:
{-# LANGUAGE CPP #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
import GHC.Float (int2Double)
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
main :: IO ()
main = print . elems $ runSTUArray calc

There is a funny mixing in of list combinators in all of this seemingly Data.Vector code. If I make the very first obvious emendation, replacing
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
with the correct use of Data.Vector.Unboxed:
mkVect = U.force . U.map fromIntegral . U.iterateN 100 prng
then my time falls by two thirds -- from real 0m1.306s to real 0m0.429s It looks like all of the top level functions have this problem except prng and zero

This came up on the mailing lists and I discovered that there's a bug in the Word->Double conversion code in GHC 7.4.1 (at least). This version, which works around the bug, is as fast as the C code on my machine:
{-# LANGUAGE CPP, BangPatterns, MagicHash #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import GHC.Word
import GHC.Exts
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner !w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
outer i | i <= VNUM = inner (fromIntegral i) 0 >> outer (i + 1)
| otherwise = return ()
outer (1 :: Int)
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
{- I originally used this function, which isn't quite correct.
We need a real bug fix in GHC.
word2Double :: Word -> Double
word2Double (W# w) = D# (int2Double# (word2Int# w))
-}
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
Other than working around the Word->Double bug, I've also removed extra lists to match the C version better.

I know you didn't ask for a way to improve this micro-benchmark, but I'll give you an explanation that might prove helpful when writing loops in the future:
An unknown function call, such as the one made to the higher-order argument of foldV, can be expensive when done frequently in a loop. In particular, it will inhibit unboxing of the function arguments, leading to increased allocation. The reason it inhibits argument unboxing is that we don't know that the function we're calling is strict in those arguments and thus we pass the arguments as e.g. (Double, Double), instead of as Double# -> Double#.
The compiler can figure out the strictness information if the loop (e.g. foldV) meets the loop body (e.g. kahanStep). For that reason I recommend that people INLINE higher-order functions. In this case, inlining foldV and removing the NOINLINE on kahanStep improves the runtime quite a bit for me.
This doesn't bring the performance on par with C in this case, as there are other things going on (as others have commented on), but it's a step in the right direction (and it's a step you can do without every having to look at profiling output).

Related

Haskell performance tuning

I'm quite new to Haskell, and to learn it better I started solving problems here and there and I ended up with this (project Euler 34).
145 is a curious number, as 1! + 4! + 5! = 1 + 24 + 120 = 145.
Find the sum of all numbers which are equal to the sum of the factorial >of their digits.
Note: as 1! = 1 and 2! = 2 are not sums they are not included.
I wrote a C and an Haskell brute force solution.
Could someone explain me the Haskell version is ~15x (~0.450 s vs ~6.5s )slower than the C implementation and how to possibly tune and speedup the Haskell solution?
unsigned int solve(){
unsigned int result = 0;
unsigned int i=10;
while(i<2540161){
unsigned int sumOfFacts = 0;
unsigned int number = i;
while (number > 0) {
unsigned int d = number % 10;
number /= 10;
sumOfFacts += factorial(d);
}
if (sumOfFacts == i)
result += i;
i++;
}
return result;
}
here the haskell solution
--BRUTE FORCE SOLUTION
solve:: Int
solve = sum (filter (\x-> sfc x 0 == x) [10..2540160])
--sum factorial of digits
sfc :: Int -> Int -> Int
sfc 0 acc = acc
sfc n acc = sfc n' (acc+fc r)
where
n' = div n 10
r = mod n 10 --n-(10*n')
fc 0 =1
fc 1 =1
fc 2 =2
fc 3 =6
fc 4 =24
fc 5 =120
fc 6 =720
fc 7 =5040
fc 8 =40320
fc 9 =362880
First, compile with optimizations. With ghc-7.10.1 -O2 -fllvm, the Haskell version runs in 0.54 secs for me. This is already pretty good.
If we want to do even better, we should first replace div with quot and mod with rem. div and mod do some extra work, because they handle the rounding of negative numbers differently. Since we only have positive numbers here, we should switch to the faster functions.
Second, we should replace the pattern matching in fc with an array lookup. GHC uses a branching construct for Int patterns, and uses binary search when the number of cases is large enough. We can do better here with a lookup.
The new code looks like this:
import qualified Data.Vector.Unboxed as V
facs :: V.Vector Int
facs =
V.fromList [1, 1, 2, 6, 24, 120, 720, 5040, 40320, 362880]
--BRUTE FORCE SOLUTION
solve:: Int
solve = sum (filter (\x-> sfc x 0 == x) [10..2540160])
--sum factorial of digits
sfc :: Int -> Int -> Int
sfc 0 acc = acc
sfc n acc = sfc n' (acc + V.unsafeIndex facs r)
where
(n', r) = quotRem n 10
main = print solve
It runs in 0.095 seconds on my computer.

Creating image from function in parallel

I'm writting a parogram in Haskell that creates a fractal and writes to a PNG file. I have a function
f:: Int->Int->PixelRGB8
which calcualtes color of the pixel with given image coordinates. (The output color format, PixelRGB8, is not important, I can easilly change it to, say, RGB tuple or anything).
Using Codec.Picture, I can write
writePng "test.png" $ generateImage f width height
which indeed writes the desired image file. However, it works very slowly and I can see that my CPU load is low. I want to use parallel computations, since the computation of each pixel value does not depend on its neighbors. As far as I can see, Codec.Picture does not provide any means to do it. I understand how parMap works, but I can't see a way to apply it here. I think one possible solution is to use repa.DevIL, but I'm kinda lost in its multidimusional arrays notation which looks like an overkill in my case. So, the question is: how to construct an image file from given function using parallel?
UPDATE. Here's a complete code (function 'extract' is ommited because it's long and called only one time):
import Data.Complex
import System.IO
import Data.List.Split
import Codec.Picture
eval:: (Floating a) => [a] -> a -> a
eval [p] _ = p
eval (p:ps) z = p * z ** (fromIntegral (length ps) ) + (eval ps z)
type Comp = Complex Double
-- func, der, z, iter
convergesOrNot:: (Comp -> Comp) -> (Comp -> Comp) -> Comp->Int -> Int
convergesOrNot _ _ _ 0 = 0
convergesOrNot f d z iter | realPart (abs (f z) ) < 1e-6 = 1
| otherwise = convergesOrNot f d (z - (f z)/(d z)) (iter-1)
-- x, y, f,d, xMin, xMin, stepX, stepY
getPixel:: Int->Int->(Comp->Comp)->(Comp->Comp)->Double->Double->Double->Double->PixelRGB8
getPixel x y f d xMin yMin stepX stepY | convergesOrNot f d z 16 == 1 = PixelRGB8 255 255 255
| otherwise = PixelRGB8 0 0 0
where
real = xMin + (fromIntegral x)*stepX
imag = yMin + (fromIntegral y)*stepY
z = real :+ imag;
data Params = Params{f :: [Comp],
d :: [Comp],
xMin::Double,
yMin::Double,
stepX::Double,
stepY::Double,
width::Int,
height::Int
} deriving (Show)
getPixelParams:: Int->Int->Params->PixelRGB8
getPixelParams x y params = getPixel x y func derv (xMin params) (yMin params) (stepX params) (stepY params)
where
func = \z -> eval (f params) z
derv = \z -> eval (d params) z
main = do
handle <- openFile "config.txt" ReadMode
config <- hGetContents handle
let params = extract config
writePng "test.png" $ generateImage (\x y -> getPixelParams x y params) (width params) (height params)
hClose handle
The profiling shows that most of the time is spent in eval function. Result (the .prof file ) is as follows (it's only the top part of file, the rest is bunch of zeroes):
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 91 0 0.0 0.0 100.0 100.0
main Main 183 0 0.0 0.0 99.9 100.0
main.\ Main 244 0 0.0 0.0 0.0 0.0
getPixelParams Main 245 0 0.0 0.0 0.0 0.0
getPixelParams.derv Main 269 1 0.0 0.0 0.0 0.0
getPixelParams.func Main 246 1 0.0 0.0 0.0 0.0
generateImage Codec.Picture.Types 199 1 0.0 0.0 99.8 99.9
generateImage.generated Codec.Picture.Types 234 1 0.0 0.0 99.8 99.9
generateImage.generated.lineGenerator Codec.Picture.Types 238 257 0.0 0.0 99.8 99.9
generateImage.generated.lineGenerator.column Codec.Picture.Types 239 65792 0.5 0.8 99.8 99.9
unsafeWritePixel Codec.Picture.Types 275 65536 0.0 0.0 0.0 0.0
main.\ Main 240 65536 0.1 0.0 99.2 99.1
getPixelParams Main 241 65536 0.7 0.0 99.1 99.1
getPixelParams.derv Main 270 0 0.2 0.0 19.3 18.5
getPixelParams.derv.\ Main 271 463922 0.2 0.0 19.2 18.5
eval Main 272 1391766 18.9 18.5 18.9 18.5
getPixelParams.func Main 247 0 0.5 0.0 62.3 59.0
getPixelParams.func.\ Main 248 993380 0.4 0.0 61.8 59.0
eval Main 249 3973520 61.4 59.0 61.4 59.0
getPixel Main 242 65536 0.2 0.0 16.7 21.5
getPixel.imag Main 262 256 0.0 0.0 0.0 0.0
getPixel.z Main 261 65536 0.1 0.1 0.1 0.1
getPixel.real Main 251 65536 0.2 0.1 0.2 0.1
convergesOrNot Main 243 531889 16.3 21.3 16.3 21.3
UPDATE 2 After a number of changes from #Cirdec and #Jedai, the code looks like this:
import Data.Complex
import System.IO
import Data.List.Split
import qualified Data.List as DL
import Codec.Picture
import Codec.Picture.Types
import Control.Parallel
import Data.Array
import Control.Parallel.Strategies
import GHC.Conc (numCapabilities)
class Ix a => Partitionable a where
partition :: Int -> (a, a) -> [(a, a)]
default partition :: (Num a) => Int -> (a, a) -> [(a, a)]
partition n r#(l,_) = zipWith (\x y -> (x, x+y-1)) starts steps
where
(span, longerSpans) = rangeSize r `quotRem` n
steps = zipWith (+) (replicate (min (rangeSize r) n) (fromIntegral span)) (replicate longerSpans 1 ++ repeat 0)
starts = scanl (+) l steps
instance Partitionable Int
instance (Partitionable a, Partitionable b) => Partitionable (a, b) where
partition n ((x0,y0), (x1, y1)) = do
xr'#(x0', x1') <- partition n (x0, x1)
let n' = n * rangeSize xr' `div` rangeSize (x0, x1)
(y0', y1') <- partition n' (y0, y1)
return ((x0', y0'), (x1', y1'))
mkArrayPar :: (Partitionable i) => Int -> Strategy e -> (i, i) -> (i -> e) -> Array i e
mkArrayPar n s bounds f = listArray bounds (concat workUnits)
where
partitions = partition n bounds
workUnits = parMap (evalList s) (map f . range) partitions
generateImagePar :: forall a . Pixel a => (Int -> Int -> a) -> Int -> Int -> Image a
generateImagePar f w h = generateImage f' w h
where
bounds = ((0, 0), (w-1,h-1))
pixels = mkArrayPar numCapabilities rseq bounds (uncurry f)
f' = curry (pixels !)
--
-- Newton
--
eval:: (Floating a) => [a] -> a -> a
eval cs z = DL.foldl1' (\acc c -> acc * z + c) cs
diff:: (Floating a) => [a] -> [a]
diff [p] = []
diff (p:ps) = [(fromIntegral (length ps) )*p] ++ diff ps
type Comp = Complex Double
convergesOrNot:: (Comp -> Comp) -> (Comp -> Comp) -> Comp->Int -> Int
convergesOrNot _ _ _ 0 = 0
convergesOrNot f d z iter | realPart (abs (f z) ) < 1e-6 = 1
| otherwise = convergesOrNot f d (z - (f z)/(d z)) (iter-1)
-- x, y, f,d, xMin, xMin, stepX, stepY
getPixel:: Int->Int->(Comp->Comp)->(Comp->Comp)->Double->Double->Double->Double->PixelRGB8
getPixel x y f d xMin yMin stepX stepY | convergesOrNot f d z 16 == 1 = PixelRGB8 255 255 255
| otherwise = PixelRGB8 0 0 0
where
real = xMin + (fromIntegral x)*stepX
imag = yMin + (fromIntegral y)*stepY
z = real :+ imag;
data Params = Params{f :: [Comp],
d :: [Comp],
xMin::Double,
yMin::Double,
stepX::Double,
stepY::Double,
width::Int,
height::Int
} deriving (Show)
extract:: String -> Params
extract config = Params poly deriv xMin yMin stepX stepY width height
where
lines = splitOn "\n" config
wh = splitOn " " (lines !! 0)
width = read (wh !! 0) :: Int
height = read (wh !! 1) :: Int
bottomLeft = splitOn " " (lines !! 1)
upperRight = splitOn " " (lines !! 2)
xMin = read $ (bottomLeft !! 0) :: Double
yMin = read $ (bottomLeft !! 1) :: Double
xMax = read $ (upperRight !! 0) :: Double
yMax = read $ (upperRight !! 1) :: Double
stepX = (xMax - xMin)/(fromIntegral width)
stepY = (yMax - yMin)/(fromIntegral height)
poly = map (\x -> (read x :: Double) :+ 0) (splitOn " " (lines !! 3))
deriv = diff poly
getPixelParams:: Int->Int->Params->PixelRGB8
getPixelParams x y params = getPixel x y func derv (xMin params) (yMin params) (stepX params) (stepY params)
where
func = \z -> eval (f params) z
derv = \z -> eval (d params) z
main = do
handle <- openFile "config.txt" ReadMode
config <- hGetContents handle
let params = extract config
writePng "test.png" $ generateImagePar (\x y -> getPixelParams x y params) (width params) (height params)
hClose handle
I compile it with
ghc O2 -threaded -rtsopts -XDefaultSignatures -XExistentialQuantification partNewton.hs -o newton
and I run it with ./newton +RTS -N. But when I run it on config
2048 2048
-1 -1
1 1
1 0 0 1
it results in error
Stack space overflow: current size 8388608 bytes.
You can calculate the pixels in parallel before generating the image. To make the pixel lookup for generateImage simple, we'll stuff all of the pixels into an Array.
{-# LANGUAGE RankNTypes #-}
import Data.Array
import Control.Parallel.Strategies
To generate the image in parallel, we'll calculate the pixels in parallel for each boint within the range of the bounds of the image. We'll build a temporary Array to hold all the pixels. The array's lookup function, ! will provide an efficient lookup function to pass to generateImage.
generateImagePar :: forall a . Pixel a => (Int -> Int -> a) -> Int -> Int -> Image a
generateImagePar f w h = generateImage f' w h
where
bounds = ((0, 0), (w-1,h-1))
pixels = parMap rseq (uncurry f) (range bounds)
pixelArray = listArray bounds pixels
f' = curry (pixelArray !)
We can then write your example in terms of generateImagePar.
writePng "test.png" $ generateImagePar f width height
This may be no faster and may in fact be slower than using generateImage. It's important to profile your code to understand why it is slow before attempting to improve its performance. For example, if your program is memory starved or is thrashing resources, using generateImagePar will certainly be slower than using generateImage.
Partitioning the work
We can partition the work into chunks to reduce the number of sparks without resorting to any sort of mutable data structure. First we'll define the class of indexes whose ranges can be divided into partitions. We'll define a default for dividing up numeric ranges.
class Ix a => Partitionable a where
partition :: Int -> (a, a) -> [(a, a)]
default partition :: (Num a) => Int -> (a, a) -> [(a, a)]
partition n r#(l,_) = zipWith (\x y -> (x, x+y-1)) starts steps
where
(span, longerSpans) = rangeSize r `quotRem` n
steps = zipWith (+) (replicate (min (rangeSize r) n) (fromIntegral span)) (replicate longerSpans 1 ++ repeat 0)
starts = scanl (+) l steps
Ints (and any other Num) can be made Partitionable using the default implementation.
instance Partitionable Int
Index products can be partitioned by first partitioning the first dimension, and then partitioning the second dimension if there aren't enough possible divisions in the first dimension.
instance (Partitionable a, Partitionable b) => Partitionable (a, b) where
partition n ((x0,y0), (x1, y1)) = do
xr'#(x0', x1') <- partition n (x0, x1)
let n' = n * rangeSize xr' `div` rangeSize (x0, x1)
(y0', y1') <- partition n' (y0, y1)
return ((x0', y0'), (x1', y1'))
We can build an array in parallel by partitioning the work into units and sparking each work unit.
mkArrayPar :: (Partitionable i) => Int -> Strategy e -> (i, i) -> (i -> e) -> Array i e
mkArrayPar n s bounds f = listArray bounds (concat workUnits)
where
partitions = partition n bounds
workUnits = parMap (evalList s) (map f . range) partitions
Now we can define generateImagePar in terms of making an array in parallel. A good number of partitions is a small multiple of the number of actual processors, numCapabilities; we'll start up to 1 partition per processor.
import GHC.Conc (numCapabilities)
generateImagePar :: forall a . Pixel a => (Int -> Int -> a) -> Int -> Int -> Image a
generateImagePar f w h = generateImage f' w h
where
bounds = ((0, 0), (w-1,h-1))
pixels = mkArrayPar numCapabilities rseq bounds (uncurry f)
f' = curry (pixels !)

Memoization done, what now?

I was trying to solve a puzzle in Haskell and had written the following code:
u 0 p = 0.0
u 1 p = 1.0
u n p = 1.0 + minimum [((1.0-q)*(s k p)) + (u (n-k) p) | k <-[1..n], let q = (1.0-p)**(fromIntegral k)]
s 1 p = 0.0
s n p = 1.0 + minimum [((1.0-q)*(s (n-k) p)) + q*((s k p) + (u (n-k) p)) | k <-[1..(n-1)], let q = (1.0-(1.0-p)**(fromIntegral k))/(1.0-(1.0-p)**(fromIntegral n))]
This code was terribly slow though. I suspect the reason for this is that the same things get calculated again and again. I therefore made a memoized version:
memoUa = array (0,10000) ((0,0.0):(1,1.0):[(k,mua k) | k<- [2..10000]])
mua n = (1.0) + minimum [((1.0-q)*(memoSa ! k)) + (memoUa ! (n-k)) | k <-[1..n], let q = (1.0-0.02)**(fromIntegral k)]
memoSa = array (0,10000) ((0,0.0):(1,0.0):[(k,msa k) | k<- [2..10000]])
msa n = (1.0) + minimum [((1.0-q) * (memoSa ! (n-k))) + q*((memoSa ! k) + (memoUa ! (n-k))) | k <-[1..(n-1)], let q = (1.0-(1.0-0.02)**(fromIntegral k))/(1.0-(1.0-0.02)**(fromIntegral n))]
This seems to be a lot faster, but now I get an out of memory error. I do not understand why this happens (the same strategy in java, without recursion, has no problems). Could somebody point me in the right direction on how to improve this code?
EDIT: I am adding my java version here (as I don't know where else to put it). I realize that the code isn't really reader-friendly (no meaningful names, etc.), but I hope it is clear enough.
public class Main {
public static double calc(double p) {
double[] u = new double[10001];
double[] s = new double[10001];
u[0] = 0.0;
u[1] = 1.0;
s[0] = 0.0;
s[1] = 0.0;
for (int n=2;n<10001;n++) {
double q = 1.0;
double denom = 1.0;
for (int k = 1; k <= n; k++ ) {
denom = denom * (1.0 - p);
}
denom = 1.0 - denom;
s[n] = (double) n;
u[n] = (double) n;
for (int k = 1; k <= n; k++ ) {
q = (1.0 - p) * q;
if (k<n) {
double qs = (1.0-q)/denom;
double bs = (1.0-qs)*s[n-k] + qs*(s[k]+ u[n-k]) + 1.0;
if (bs < s[n]) {
s[n] = bs;
}
}
double bu = (1.0-q)*s[k] + 1.0 + u[n-k];
if (bu < u[n]) {
u[n] = bu;
}
}
}
return u[10000];
}
public static void main(String[] args) {
double s = 0.0;
int i = 2;
//for (int i = 1; i<51; i++) {
s = s + calc(i*0.01);
//}
System.out.println("result = " + s);
}
}
I don't run out of memory when I run the compiled version, but there is a significant difference between how the Java version works and how the Haskell version works which I'll illustrate here.
The first thing to do is to add some important type signatures. In particular, you don't want Integer array indices, so I added:
memoUa :: Array Int Double
memoSa :: Array Int Double
I found these using ghc-mod check. I also added a main so that you can run it from the command line:
import System.Environment
main = do
(arg:_) <- getArgs
let n = read arg
print $ mua n
Now to gain some insight into what's going on, we can compile the program using profiling:
ghc -O2 -prof memo.hs
Then when we invoke the program like this:
memo 1000 +RTS -s
we will get profiling output which looks like:
164.31333233347755
98,286,872 bytes allocated in the heap
29,455,360 bytes copied during GC
657,080 bytes maximum residency (29 sample(s))
38,260 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 161 colls, 0 par 0.03s 0.03s 0.0002s 0.0011s
Gen 1 29 colls, 0 par 0.03s 0.03s 0.0011s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.21s ( 0.21s elapsed)
GC time 0.06s ( 0.06s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.27s elapsed)
%GC time 21.8% (22.3% elapsed)
Alloc rate 468,514,624 bytes per MUT second
Productivity 78.2% of total user, 77.3% of total elapsed
Important things to pay attention to are:
maximum residency
Total time
%GC time (or Productivity)
Maximum residency is a measure of how much memory is needed by the program. %GC time the proportion of the time spent in garbage collection and Productivity is the complement (100% - %GC time).
If you run the program for various input values you will see a productivity of around 80%:
n Max Res. Prod. Time Output
2000 779,076 79.4% 1.10s 328.54535361588535
4000 1,023,016 80.7% 4.41s 657.0894961398351
6000 1,299,880 81.3% 9.91s 985.6071032981068
8000 1,539,352 81.5% 17.64s 1314.0968411684714
10000 1,815,600 81.7% 27.57s 1642.5891214360522
This means that about 20% of the run time is spent in garbage collection. Also, we see increasing memory usage as n increases.
It turns out we can dramatically improve productivity and memory usage by telling Haskell the order in which to evaluate the array elements instead of relying on lazy evaluation:
import Control.Monad (forM_)
main = do
(arg:_) <- getArgs
let n = read arg
forM_ [1..n] $ \i -> mua i `seq` return ()
print $ mua n
And the new profiling stats are:
n Max Res. Prod. Time Output
2000 482,800 99.3% 1.31s 328.54535361588535
4000 482,800 99.6% 5.88s 657.0894961398351
6000 482,800 99.5% 12.09s 985.6071032981068
8000 482,800 98.1% 21.71s 1314.0968411684714
10000 482,800 96.1% 34.58s 1642.5891214360522
Some interesting observations here: productivity is up, memory usage is down (constant now over the range of inputs) but run time is up. This suggests that we forced more computations than we needed to. In an imperative language like Java you have to give an evaluation order so you would know exactly which computations need to be performed. It would interesting to see your Java code to see which computations it is performing.

Haskell: unnecessary reevaluations of constant expressions

I am going to demonstrate the problem using the following example program
{-# LANGUAGE BangPatterns #-}
data Point = Point !Double !Double
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b
in b * (q - fromIntegral (floor q :: Int))
standardMap :: Double -> Point -> Point
standardMap k (Point q p) =
Point (fmod (q + p) (2 * pi)) (fmod (p + k * sin(q)) (2 * pi))
iterate' gen !p = p : (iterate' gen $ gen p)
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k) $ (Point 0.15 0.25)
where k = (cos (pi/3)) - (sin (pi/3))
Here standardMap k is the parametrized function and k=(cos (pi/3))-(sin (pi/3)) is a parameter. If i compile this program with ghc -O3 -fllvm the execution time on my machine is approximately 42s, however, if I write k in the form 0.5 - (sin (pi/3)) the execution time equals 21s and if I write k = 0.5 - 0.5 * (sqrt 3) it will take only 12s.
The conclusion is that k is reevaluated on each call of standardMap k.
Why this is not optimized?
P.S. compiler ghc 7.6.3 on archlinux
EDIT
For those who are concerned with the weird properties of standardMap here is a simpler and more intuitive example, which exhibits the same problem
{-# LANGUAGE BangPatterns #-}
data Point = Point !Double !Double
rotate :: Double -> Point -> Point
rotate k (Point q p) =
Point ((cos k) * q - (sin k) * p) ((sin k) * q + (cos k) * p)
iterate' gen !p = p : (iterate' gen $ gen p)
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (rotate k) $ (Point 0.15 0.25)
where --k = (cos (pi/3)) - (sin (pi/3))
k = 0.5 - 0.5 * (sqrt 3)
EDIT
Before I asked the question I have tried to make k strict, the same way Don suggested, but with ghc -O3 I didn't see a difference. The solution with strictness works if the program is compiled with ghc -O2. I missed that because I didn't try all possible combinations of flags with the all possible versions of the program.
So what is the difference between -O3 and -O2 that affects such cases?
Should I prefer -O2 in general?
EDIT
As observed by Mike Hartl and others, if rotate k is changed into rotate $ k or standardMap k into standardMap $ k, the performance is improved, though it is not the best possible (Don's solution). Why?
As always, check the core.
With ghc -O2, k is inlined into the loop body, which is floated out as a top level function:
Main.main7 :: Main.Point -> Main.Point
Main.main7 =
\ (ds_dAa :: Main.Point) ->
case ds_dAa of _ { Main.Point q_alG p_alH ->
case q_alG of _ { GHC.Types.D# x_s1bt ->
case p_alH of _ { GHC.Types.D# y_s1bw ->
case Main.$wfmod (GHC.Prim.+## x_s1bt y_s1bw) 6.283185307179586
of ww_s1bi { __DEFAULT ->
case Main.$wfmod
(GHC.Prim.+##
y_s1bw
(GHC.Prim.*##
(GHC.Prim.-##
(GHC.Prim.cosDouble# 1.0471975511965976)
(GHC.Prim.sinDouble# 1.0471975511965976))
(GHC.Prim.sinDouble# x_s1bt)))
6.283185307179586
of ww1_X1bZ { __DEFAULT ->
Main.Point (GHC.Types.D# ww_s1bi) (GHC.Types.D# ww1_X1bZ)
Indicating that the sin and cos calls aren't evaluated at compile time.
The result is that a bit more math is going to occur:
$ time ./A
3.1430515093368085
real 0m15.590s
If you make it strict, it is at least not recalculated each time:
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k) $ (Point 0.15 0.25)
where
k :: Double
!k = (cos (pi/3)) - (sin (pi/3))
Resulting in:
ipv_sEq =
GHC.Prim.-##
(GHC.Prim.cosDouble# 1.0471975511965976)
(GHC.Prim.sinDouble# 1.0471975511965976) } in
And a running time of:
$ time ./A
6.283185307179588
real 0m7.859s
Which I think is good enough for now. I'd also add unpack pragmas to the Point type.
If you want to reason about numeric performance under different code arrangements, you must inspect the Core.
Using your revised example. It suffers the same issue. k is inlined rotate. GHC thinks it is really cheap, when in this benchmark it is more expensive.
Naively, ghc-7.2.3 -O2
$ time ./A
0.1470480616244365
real 0m22.897s
And k is evaluated each time rotate is called.
Make k strict: one way to force it to be not shared.
$ time ./A
0.14704806100839019
real 0m2.360s
Using UNPACK pragmas on the Point constructor:
$ time ./A
0.14704806100839019
real 0m1.860s
I don't think it is repeated evaluation.
First, I switched to "do" notation and used a "let" on the definition of "k" which I figured should help. No - still slow.
Then I added a trace call - just being evaluated once. Even checked that the fast variant was in fact producing a Double.
Then I printed out both variations. There is a small difference in the starting values.
Tweaking the value of the "slow" variant makes it the same speed. I've no idea what your algorithm is for - would it be very sensitive to starting values?
import Debug.Trace (trace)
...
main = do
-- is -0.3660254037844386
let k0 = (0.5 - 0.5 * (sqrt 3))::Double
-- was -0.3660254037844385
let k1 = (cos (pi/3)) - (trace "x" (sin (pi/3))) + 0.0000000000000001;
putStrLn (show k0)
putStrLn (show k1)
putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k1) $ (Point 0.15 0.25)
EDIT: this is the version with numeric literals. It's displaying runtimes of 23sec vs 7sec for me. I compiled two separate versions of the code to make sure I wasn't doing something stupid like not recompiling.
main = do
-- -0.3660254037844386
-- -0.3660254037844385
let k2 = -0.3660254037844385
putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k2) $ (Point 0.15 0.25)
EDIT2: I don't know how to get the opcodes from ghc, but comparing the hexdumps for the two .o files shows they differ by a single byte - presumably the literal. So it can't be the runtime.
EDIT3: Tried turning profiling on, and that's just puzzled me even more. unless I'm missing something the only difference is a small discrepancy in the number of calls to fmod (fmod.q to be precise).
The "5" profile is for the constant ending "5", same with "6".
Fri Sep 6 12:37 2013 Time and Allocation Profiling Report (Final)
constant-timings-5 +RTS -p -RTS
total time = 38.34 secs (38343 ticks # 1000 us, 1 processor)
total alloc = 12,000,105,184 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
standardMap Main 71.0 0.0
iterate' Main 21.2 93.3
fmod Main 6.3 6.7
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 50 0 0.0 0.0 100.0 100.0
main Main 101 0 0.0 0.0 0.0 0.0
CAF:main1 Main 98 0 0.0 0.0 0.0 0.0
main Main 100 1 0.0 0.0 0.0 0.0
CAF:main2 Main 97 0 0.0 0.0 1.0 0.0
main Main 102 0 1.0 0.0 1.0 0.0
main.\ Main 110 1 0.0 0.0 0.0 0.0
CAF:main3 Main 96 0 0.0 0.0 99.0 100.0
main Main 103 0 0.0 0.0 99.0 100.0
iterate' Main 104 100000001 21.2 93.3 99.0 100.0
standardMap Main 105 100000000 71.0 0.0 77.9 6.7
fmod Main 106 200000001 6.3 6.7 6.9 6.7
fmod.q Main 109 49999750 0.6 0.0 0.6 0.0
CAF:main_k Main 95 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
main.k2 Main 108 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 90 0 0.0 0.0 0.0 0.0
CAF GHC.Float 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 66 0 0.0 0.0 0.0 0.0
Fri Sep 6 12:38 2013 Time and Allocation Profiling Report (Final)
constant-timings-6 +RTS -p -RTS
total time = 22.17 secs (22167 ticks # 1000 us, 1 processor)
total alloc = 11,999,947,752 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
standardMap Main 48.4 0.0
iterate' Main 38.2 93.3
fmod Main 10.9 6.7
main Main 1.4 0.0
fmod.q Main 1.0 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 50 0 0.0 0.0 100.0 100.0
main Main 101 0 0.0 0.0 0.0 0.0
CAF:main1 Main 98 0 0.0 0.0 0.0 0.0
main Main 100 1 0.0 0.0 0.0 0.0
CAF:main2 Main 97 0 0.0 0.0 1.4 0.0
main Main 102 0 1.4 0.0 1.4 0.0
main.\ Main 110 1 0.0 0.0 0.0 0.0
CAF:main3 Main 96 0 0.0 0.0 98.6 100.0
main Main 103 0 0.0 0.0 98.6 100.0
iterate' Main 104 100000001 38.2 93.3 98.6 100.0
standardMap Main 105 100000000 48.4 0.0 60.4 6.7
fmod Main 106 200000001 10.9 6.7 12.0 6.7
fmod.q Main 109 49989901 1.0 0.0 1.0 0.0
CAF:main_k Main 95 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
main.k2 Main 108 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 90 0 0.0 0.0 0.0 0.0
CAF GHC.Float 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 66 0 0.0 0.0 0.0 0.0
EDIT4: Link below is to the two opcode dumps (thanks to #Tom Ellis). Although I can't read them, they seem to have the same "shape". Presumably the long random-char strings are internal identifiers. I've just recompiled both with -O2 -fforce-recomp and the time differences are real.
https://gist.github.com/anonymous/6462797

Optimizing numerical array performance in Haskell

I'm working on a terrain generation algorithm for a MineCraft-like world. Currently, I'm using simplex noise based on the implementation in the paper 'Simplex Noise Demystified' [PDF], since simplex noise is supposed to be faster and to have fewer artifacts than Perlin noise. This looks fairly decent (see image), but so far it's also pretty slow.
Running the noise function 10 times (I need noise with different wavelengths for things like terrain height, temperature, tree location, etc.) with 3 octaves of noise for each block in a chunk (16x16x128 blocks), or about 1 million calls to the noise function in total, takes about 700-800 ms. This is at least an order of magnitude too slow for the purposes of generating terrain with any decent kind of speed, despite the fact that there are no obvious expensive operations in the algorithm (at least to me). Just floor, modulo, some array lookups and basic arithmetic. The algorithm (written in Haskell) is listed below. The SCC comments are for profiling. I've omitted the 2D noise functions, since they work the same way.
g3 :: (Floating a, RealFrac a) => a
g3 = 1/6
{-# INLINE int #-}
int :: (Integral a, Num b) => a -> b
int = fromIntegral
grad3 :: (Floating a, RealFrac a) => V.Vector (a,a,a)
grad3 = V.fromList $ [(1,1,0),(-1, 1,0),(1,-1, 0),(-1,-1, 0),
(1,0,1),(-1, 0,1),(1, 0,-1),(-1, 0,-1),
(0,1,1),( 0,-1,1),(0, 1,-1),( 0,-1,-1)]
{-# INLINE dot3 #-}
dot3 :: Num a => (a, a, a) -> a -> a -> a -> a
dot3 (a,b,c) x y z = a * x + b * y + c * z
{-# INLINE fastFloor #-}
fastFloor :: RealFrac a => a -> Int
fastFloor x = truncate (if x > 0 then x else x - 1)
--Generate a random permutation for use in the noise functions
perm :: Int -> Permutation
perm seed = V.fromList . concat . replicate 2 . shuffle' [0..255] 256 $ mkStdGen seed
--Generate 3D noise between -0.5 and 0.5
simplex3D :: (Floating a, RealFrac a) => Permutation -> a -> a -> a -> a
simplex3D p x y z = {-# SCC "out" #-} 16 * (n gi0 (x0,y0,z0) + n gi1 xyz1 + n gi2 xyz2 + n gi3 xyz3) where
(i,j,k) = {-# SCC "ijk" #-} (s x, s y, s z) where s a = fastFloor (a + (x + y + z) / 3)
(x0,y0,z0) = {-# SCC "x0-z0" #-} (x - int i + t, y - int j + t, z - int k + t) where t = int (i + j + k) * g3
(i1,j1,k1,i2,j2,k2) = {-# SCC "i1-k2" #-} if x0 >= y0
then if y0 >= z0 then (1,0,0,1,1,0) else
if x0 >= z0 then (1,0,0,1,0,1) else (0,0,1,1,0,1)
else if y0 < z0 then (0,0,1,0,1,1) else
if x0 < z0 then (0,1,0,0,1,1) else (0,1,0,1,1,0)
xyz1 = {-# SCC "xyz1" #-} (x0 - int i1 + g3, y0 - int j1 + g3, z0 - int k1 + g3)
xyz2 = {-# SCC "xyz2" #-} (x0 - int i2 + 2*g3, y0 - int j2 + 2*g3, z0 - int k2 + 2*g3)
xyz3 = {-# SCC "xyz3" #-} (x0 - 1 + 3*g3, y0 - 1 + 3*g3, z0 - 1 + 3*g3)
(ii,jj,kk) = {-# SCC "iijjkk" #-} (i .&. 255, j .&. 255, k .&. 255)
gi0 = {-# SCC "gi0" #-} mod (p V.! (ii + p V.! (jj + p V.! kk ))) 12
gi1 = {-# SCC "gi1" #-} mod (p V.! (ii + i1 + p V.! (jj + j1 + p V.! (kk + k1)))) 12
gi2 = {-# SCC "gi2" #-} mod (p V.! (ii + i2 + p V.! (jj + j2 + p V.! (kk + k2)))) 12
gi3 = {-# SCC "gi3" #-} mod (p V.! (ii + 1 + p V.! (jj + 1 + p V.! (kk + 1 )))) 12
{-# INLINE n #-}
n gi (x',y',z') = {-# SCC "n" #-} (\a -> if a < 0 then 0 else
a*a*a*a*dot3 (grad3 V.! gi) x' y' z') $ 0.6 - x'*x' - y'*y' - z'*z'
harmonic :: (Num a, Fractional a) => Int -> (a -> a) -> a
harmonic octaves noise = f octaves / (2 - 1 / int (2 ^ (octaves - 1))) where
f 0 = 0
f o = let r = int $ 2 ^ (o - 1) in noise r / r + f (o - 1)
--Generate harmonic 3D noise between -0.5 and 0.5
harmonicNoise3D :: (RealFrac a, Floating a) => Permutation -> Int -> a -> a -> a -> a -> a
harmonicNoise3D p octaves l x y z = harmonic octaves
(\f -> simplex3D p (x * f / l) (y * f / l) (z * f / l))
For profiling, I used the following code,
q _ = let p = perm 0 in
sum [harmonicNoise3D p 3 l x y z :: Float | l <- [1..10], y <- [0..127], x <- [0..15], z <- [0..15]]
main = do start <- getCurrentTime
print $ q ()
end <- getCurrentTime
print $ diffUTCTime end start
which produces the following information:
COST CENTRE MODULE %time %alloc
simplex3D Main 18.8 21.0
n Main 18.0 19.6
out Main 10.1 9.2
harmonicNoise3D Main 9.8 4.5
harmonic Main 6.4 5.8
int Main 4.0 2.9
gi3 Main 4.0 3.0
xyz2 Main 3.5 5.9
gi1 Main 3.4 3.4
gi0 Main 3.4 2.7
fastFloor Main 3.2 0.6
xyz1 Main 2.9 5.9
ijk Main 2.7 3.5
gi2 Main 2.7 3.3
xyz3 Main 2.6 4.1
iijjkk Main 1.6 2.5
dot3 Main 1.6 0.7
To compare, I also ported the algorithm to C#. Performance there was about 3 to 4 times faster, so I imagine I must be doing something wrong. But even then it's not nearly as fast as I would like. So my question is this: can anyone tell me if there are any ways to speed up my implementation and/or the algorithm in general or does anyone know of a different noise algorithm that has better performance characteristics but a similar look?
Update:
After following some of the suggestions offered below, the code now looks as follows:
module Noise ( Permutation, perm
, noise3D, simplex3D
) where
import Data.Bits
import qualified Data.Vector.Unboxed as UV
import System.Random
import System.Random.Shuffle
type Permutation = UV.Vector Int
g3 :: Double
g3 = 1/6
{-# INLINE int #-}
int :: Int -> Double
int = fromIntegral
grad3 :: UV.Vector (Double, Double, Double)
grad3 = UV.fromList $ [(1,1,0),(-1, 1,0),(1,-1, 0),(-1,-1, 0),
(1,0,1),(-1, 0,1),(1, 0,-1),(-1, 0,-1),
(0,1,1),( 0,-1,1),(0, 1,-1),( 0,-1,-1)]
{-# INLINE dot3 #-}
dot3 :: (Double, Double, Double) -> Double -> Double -> Double -> Double
dot3 (a,b,c) x y z = a * x + b * y + c * z
{-# INLINE fastFloor #-}
fastFloor :: Double -> Int
fastFloor x = truncate (if x > 0 then x else x - 1)
--Generate a random permutation for use in the noise functions
perm :: Int -> Permutation
perm seed = UV.fromList . concat . replicate 2 . shuffle' [0..255] 256 $ mkStdGen seed
--Generate 3D noise between -0.5 and 0.5
noise3D :: Permutation -> Double -> Double -> Double -> Double
noise3D p x y z = 16 * (n gi0 (x0,y0,z0) + n gi1 xyz1 + n gi2 xyz2 + n gi3 xyz3) where
(i,j,k) = (s x, s y, s z) where s a = fastFloor (a + (x + y + z) / 3)
(x0,y0,z0) = (x - int i + t, y - int j + t, z - int k + t) where t = int (i + j + k) * g3
(i1,j1,k1,i2,j2,k2) = if x0 >= y0
then if y0 >= z0 then (1,0,0,1,1,0) else
if x0 >= z0 then (1,0,0,1,0,1) else (0,0,1,1,0,1)
else if y0 < z0 then (0,0,1,0,1,1) else
if x0 < z0 then (0,1,0,0,1,1) else (0,1,0,1,1,0)
xyz1 = (x0 - int i1 + g3, y0 - int j1 + g3, z0 - int k1 + g3)
xyz2 = (x0 - int i2 + 2*g3, y0 - int j2 + 2*g3, z0 - int k2 + 2*g3)
xyz3 = (x0 - 1 + 3*g3, y0 - 1 + 3*g3, z0 - 1 + 3*g3)
(ii,jj,kk) = (i .&. 255, j .&. 255, k .&. 255)
gi0 = rem (UV.unsafeIndex p (ii + UV.unsafeIndex p (jj + UV.unsafeIndex p kk ))) 12
gi1 = rem (UV.unsafeIndex p (ii + i1 + UV.unsafeIndex p (jj + j1 + UV.unsafeIndex p (kk + k1)))) 12
gi2 = rem (UV.unsafeIndex p (ii + i2 + UV.unsafeIndex p (jj + j2 + UV.unsafeIndex p (kk + k2)))) 12
gi3 = rem (UV.unsafeIndex p (ii + 1 + UV.unsafeIndex p (jj + 1 + UV.unsafeIndex p (kk + 1 )))) 12
{-# INLINE n #-}
n gi (x',y',z') = (\a -> if a < 0 then 0 else
a*a*a*a*dot3 (UV.unsafeIndex grad3 gi) x' y' z') $ 0.6 - x'*x' - y'*y' - z'*z'
harmonic :: Int -> (Double -> Double) -> Double
harmonic octaves noise = f octaves / (2 - 1 / int (2 ^ (octaves - 1))) where
f 0 = 0
f o = let r = 2 ^^ (o - 1) in noise r / r + f (o - 1)
--3D simplex noise
--syntax: simplex3D permutation number_of_octaves wavelength x y z
simplex3D :: Permutation -> Int -> Double -> Double -> Double -> Double -> Double
simplex3D p octaves l x y z = harmonic octaves
(\f -> noise3D p (x * f / l) (y * f / l) (z * f / l))
Together with reducing my chunk size to 8x8x128, generating new terrain chunks now occurs at about 10-20 fps, which means moving around is now not nearly as problematic as before. Of course, any other performance improvements are still welcome.
The thing that stands out initially is that your code is highly polymorphic. You should specialize your floating point type uniformly to Double, so GHC (and LLVM) have a chance of applying more aggressive optimizations.
Note, for those trying to reproduce, this code imports:
import qualified Data.Vector as V
import Data.Bits
import Data.Time.Clock
import System.Random
import System.Random.Shuffle
type Permutation = V.Vector Int
Ok. There's lots of things you can try to improve this code.
Improvements
Data representation
Specialize to a concrete floating point type, instead of polymorphic floating point functions
Replace tuple (a,a,a) with unboxed triple T !Double !Double !Double
Switch from Data.Array to Data.Array.Unboxed for Permutations
Replace use of boxed array of triples with multidimensional unboxed array from repa package
Compiler flags
Compile with -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native (or equivalent with -fllvm)
Increase spec constr threshold -- -fspec-constr-count=16
More efficient library functions
Use mersenne-random instead of StdGen to generate randoms
Replace mod with rem
Replace V.! indexing with unchecked indexing VU.unsafeIndex (after moving to Data.Vector.Unboxed
Runtime settings
Increase the default allocation area: -A20M or -H
Also, check your algorithm is identical to the C# one, and you're using the same data structures.

Resources