Creating image from function in parallel - image

I'm writting a parogram in Haskell that creates a fractal and writes to a PNG file. I have a function
f:: Int->Int->PixelRGB8
which calcualtes color of the pixel with given image coordinates. (The output color format, PixelRGB8, is not important, I can easilly change it to, say, RGB tuple or anything).
Using Codec.Picture, I can write
writePng "test.png" $ generateImage f width height
which indeed writes the desired image file. However, it works very slowly and I can see that my CPU load is low. I want to use parallel computations, since the computation of each pixel value does not depend on its neighbors. As far as I can see, Codec.Picture does not provide any means to do it. I understand how parMap works, but I can't see a way to apply it here. I think one possible solution is to use repa.DevIL, but I'm kinda lost in its multidimusional arrays notation which looks like an overkill in my case. So, the question is: how to construct an image file from given function using parallel?
UPDATE. Here's a complete code (function 'extract' is ommited because it's long and called only one time):
import Data.Complex
import System.IO
import Data.List.Split
import Codec.Picture
eval:: (Floating a) => [a] -> a -> a
eval [p] _ = p
eval (p:ps) z = p * z ** (fromIntegral (length ps) ) + (eval ps z)
type Comp = Complex Double
-- func, der, z, iter
convergesOrNot:: (Comp -> Comp) -> (Comp -> Comp) -> Comp->Int -> Int
convergesOrNot _ _ _ 0 = 0
convergesOrNot f d z iter | realPart (abs (f z) ) < 1e-6 = 1
| otherwise = convergesOrNot f d (z - (f z)/(d z)) (iter-1)
-- x, y, f,d, xMin, xMin, stepX, stepY
getPixel:: Int->Int->(Comp->Comp)->(Comp->Comp)->Double->Double->Double->Double->PixelRGB8
getPixel x y f d xMin yMin stepX stepY | convergesOrNot f d z 16 == 1 = PixelRGB8 255 255 255
| otherwise = PixelRGB8 0 0 0
where
real = xMin + (fromIntegral x)*stepX
imag = yMin + (fromIntegral y)*stepY
z = real :+ imag;
data Params = Params{f :: [Comp],
d :: [Comp],
xMin::Double,
yMin::Double,
stepX::Double,
stepY::Double,
width::Int,
height::Int
} deriving (Show)
getPixelParams:: Int->Int->Params->PixelRGB8
getPixelParams x y params = getPixel x y func derv (xMin params) (yMin params) (stepX params) (stepY params)
where
func = \z -> eval (f params) z
derv = \z -> eval (d params) z
main = do
handle <- openFile "config.txt" ReadMode
config <- hGetContents handle
let params = extract config
writePng "test.png" $ generateImage (\x y -> getPixelParams x y params) (width params) (height params)
hClose handle
The profiling shows that most of the time is spent in eval function. Result (the .prof file ) is as follows (it's only the top part of file, the rest is bunch of zeroes):
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 91 0 0.0 0.0 100.0 100.0
main Main 183 0 0.0 0.0 99.9 100.0
main.\ Main 244 0 0.0 0.0 0.0 0.0
getPixelParams Main 245 0 0.0 0.0 0.0 0.0
getPixelParams.derv Main 269 1 0.0 0.0 0.0 0.0
getPixelParams.func Main 246 1 0.0 0.0 0.0 0.0
generateImage Codec.Picture.Types 199 1 0.0 0.0 99.8 99.9
generateImage.generated Codec.Picture.Types 234 1 0.0 0.0 99.8 99.9
generateImage.generated.lineGenerator Codec.Picture.Types 238 257 0.0 0.0 99.8 99.9
generateImage.generated.lineGenerator.column Codec.Picture.Types 239 65792 0.5 0.8 99.8 99.9
unsafeWritePixel Codec.Picture.Types 275 65536 0.0 0.0 0.0 0.0
main.\ Main 240 65536 0.1 0.0 99.2 99.1
getPixelParams Main 241 65536 0.7 0.0 99.1 99.1
getPixelParams.derv Main 270 0 0.2 0.0 19.3 18.5
getPixelParams.derv.\ Main 271 463922 0.2 0.0 19.2 18.5
eval Main 272 1391766 18.9 18.5 18.9 18.5
getPixelParams.func Main 247 0 0.5 0.0 62.3 59.0
getPixelParams.func.\ Main 248 993380 0.4 0.0 61.8 59.0
eval Main 249 3973520 61.4 59.0 61.4 59.0
getPixel Main 242 65536 0.2 0.0 16.7 21.5
getPixel.imag Main 262 256 0.0 0.0 0.0 0.0
getPixel.z Main 261 65536 0.1 0.1 0.1 0.1
getPixel.real Main 251 65536 0.2 0.1 0.2 0.1
convergesOrNot Main 243 531889 16.3 21.3 16.3 21.3
UPDATE 2 After a number of changes from #Cirdec and #Jedai, the code looks like this:
import Data.Complex
import System.IO
import Data.List.Split
import qualified Data.List as DL
import Codec.Picture
import Codec.Picture.Types
import Control.Parallel
import Data.Array
import Control.Parallel.Strategies
import GHC.Conc (numCapabilities)
class Ix a => Partitionable a where
partition :: Int -> (a, a) -> [(a, a)]
default partition :: (Num a) => Int -> (a, a) -> [(a, a)]
partition n r#(l,_) = zipWith (\x y -> (x, x+y-1)) starts steps
where
(span, longerSpans) = rangeSize r `quotRem` n
steps = zipWith (+) (replicate (min (rangeSize r) n) (fromIntegral span)) (replicate longerSpans 1 ++ repeat 0)
starts = scanl (+) l steps
instance Partitionable Int
instance (Partitionable a, Partitionable b) => Partitionable (a, b) where
partition n ((x0,y0), (x1, y1)) = do
xr'#(x0', x1') <- partition n (x0, x1)
let n' = n * rangeSize xr' `div` rangeSize (x0, x1)
(y0', y1') <- partition n' (y0, y1)
return ((x0', y0'), (x1', y1'))
mkArrayPar :: (Partitionable i) => Int -> Strategy e -> (i, i) -> (i -> e) -> Array i e
mkArrayPar n s bounds f = listArray bounds (concat workUnits)
where
partitions = partition n bounds
workUnits = parMap (evalList s) (map f . range) partitions
generateImagePar :: forall a . Pixel a => (Int -> Int -> a) -> Int -> Int -> Image a
generateImagePar f w h = generateImage f' w h
where
bounds = ((0, 0), (w-1,h-1))
pixels = mkArrayPar numCapabilities rseq bounds (uncurry f)
f' = curry (pixels !)
--
-- Newton
--
eval:: (Floating a) => [a] -> a -> a
eval cs z = DL.foldl1' (\acc c -> acc * z + c) cs
diff:: (Floating a) => [a] -> [a]
diff [p] = []
diff (p:ps) = [(fromIntegral (length ps) )*p] ++ diff ps
type Comp = Complex Double
convergesOrNot:: (Comp -> Comp) -> (Comp -> Comp) -> Comp->Int -> Int
convergesOrNot _ _ _ 0 = 0
convergesOrNot f d z iter | realPart (abs (f z) ) < 1e-6 = 1
| otherwise = convergesOrNot f d (z - (f z)/(d z)) (iter-1)
-- x, y, f,d, xMin, xMin, stepX, stepY
getPixel:: Int->Int->(Comp->Comp)->(Comp->Comp)->Double->Double->Double->Double->PixelRGB8
getPixel x y f d xMin yMin stepX stepY | convergesOrNot f d z 16 == 1 = PixelRGB8 255 255 255
| otherwise = PixelRGB8 0 0 0
where
real = xMin + (fromIntegral x)*stepX
imag = yMin + (fromIntegral y)*stepY
z = real :+ imag;
data Params = Params{f :: [Comp],
d :: [Comp],
xMin::Double,
yMin::Double,
stepX::Double,
stepY::Double,
width::Int,
height::Int
} deriving (Show)
extract:: String -> Params
extract config = Params poly deriv xMin yMin stepX stepY width height
where
lines = splitOn "\n" config
wh = splitOn " " (lines !! 0)
width = read (wh !! 0) :: Int
height = read (wh !! 1) :: Int
bottomLeft = splitOn " " (lines !! 1)
upperRight = splitOn " " (lines !! 2)
xMin = read $ (bottomLeft !! 0) :: Double
yMin = read $ (bottomLeft !! 1) :: Double
xMax = read $ (upperRight !! 0) :: Double
yMax = read $ (upperRight !! 1) :: Double
stepX = (xMax - xMin)/(fromIntegral width)
stepY = (yMax - yMin)/(fromIntegral height)
poly = map (\x -> (read x :: Double) :+ 0) (splitOn " " (lines !! 3))
deriv = diff poly
getPixelParams:: Int->Int->Params->PixelRGB8
getPixelParams x y params = getPixel x y func derv (xMin params) (yMin params) (stepX params) (stepY params)
where
func = \z -> eval (f params) z
derv = \z -> eval (d params) z
main = do
handle <- openFile "config.txt" ReadMode
config <- hGetContents handle
let params = extract config
writePng "test.png" $ generateImagePar (\x y -> getPixelParams x y params) (width params) (height params)
hClose handle
I compile it with
ghc O2 -threaded -rtsopts -XDefaultSignatures -XExistentialQuantification partNewton.hs -o newton
and I run it with ./newton +RTS -N. But when I run it on config
2048 2048
-1 -1
1 1
1 0 0 1
it results in error
Stack space overflow: current size 8388608 bytes.

You can calculate the pixels in parallel before generating the image. To make the pixel lookup for generateImage simple, we'll stuff all of the pixels into an Array.
{-# LANGUAGE RankNTypes #-}
import Data.Array
import Control.Parallel.Strategies
To generate the image in parallel, we'll calculate the pixels in parallel for each boint within the range of the bounds of the image. We'll build a temporary Array to hold all the pixels. The array's lookup function, ! will provide an efficient lookup function to pass to generateImage.
generateImagePar :: forall a . Pixel a => (Int -> Int -> a) -> Int -> Int -> Image a
generateImagePar f w h = generateImage f' w h
where
bounds = ((0, 0), (w-1,h-1))
pixels = parMap rseq (uncurry f) (range bounds)
pixelArray = listArray bounds pixels
f' = curry (pixelArray !)
We can then write your example in terms of generateImagePar.
writePng "test.png" $ generateImagePar f width height
This may be no faster and may in fact be slower than using generateImage. It's important to profile your code to understand why it is slow before attempting to improve its performance. For example, if your program is memory starved or is thrashing resources, using generateImagePar will certainly be slower than using generateImage.
Partitioning the work
We can partition the work into chunks to reduce the number of sparks without resorting to any sort of mutable data structure. First we'll define the class of indexes whose ranges can be divided into partitions. We'll define a default for dividing up numeric ranges.
class Ix a => Partitionable a where
partition :: Int -> (a, a) -> [(a, a)]
default partition :: (Num a) => Int -> (a, a) -> [(a, a)]
partition n r#(l,_) = zipWith (\x y -> (x, x+y-1)) starts steps
where
(span, longerSpans) = rangeSize r `quotRem` n
steps = zipWith (+) (replicate (min (rangeSize r) n) (fromIntegral span)) (replicate longerSpans 1 ++ repeat 0)
starts = scanl (+) l steps
Ints (and any other Num) can be made Partitionable using the default implementation.
instance Partitionable Int
Index products can be partitioned by first partitioning the first dimension, and then partitioning the second dimension if there aren't enough possible divisions in the first dimension.
instance (Partitionable a, Partitionable b) => Partitionable (a, b) where
partition n ((x0,y0), (x1, y1)) = do
xr'#(x0', x1') <- partition n (x0, x1)
let n' = n * rangeSize xr' `div` rangeSize (x0, x1)
(y0', y1') <- partition n' (y0, y1)
return ((x0', y0'), (x1', y1'))
We can build an array in parallel by partitioning the work into units and sparking each work unit.
mkArrayPar :: (Partitionable i) => Int -> Strategy e -> (i, i) -> (i -> e) -> Array i e
mkArrayPar n s bounds f = listArray bounds (concat workUnits)
where
partitions = partition n bounds
workUnits = parMap (evalList s) (map f . range) partitions
Now we can define generateImagePar in terms of making an array in parallel. A good number of partitions is a small multiple of the number of actual processors, numCapabilities; we'll start up to 1 partition per processor.
import GHC.Conc (numCapabilities)
generateImagePar :: forall a . Pixel a => (Int -> Int -> a) -> Int -> Int -> Image a
generateImagePar f w h = generateImage f' w h
where
bounds = ((0, 0), (w-1,h-1))
pixels = mkArrayPar numCapabilities rseq bounds (uncurry f)
f' = curry (pixels !)

Related

Competitive programming using Haskell

I am currently trying to refresh my Haskell knowledge by solving some Hackerrank problems.
For example:
https://www.hackerrank.com/challenges/maximum-palindromes/problem
I've already implemented an imperative solution in C++ which got accepted for all test cases. Now I am trying to come up with a pure functional solution in (reasonably idiomatic) Haskell.
My current code is
module Main where
import Control.Monad
import qualified Data.ByteString.Char8 as C
import Data.Bits
import Data.List
import qualified Data.Map.Strict as Map
import qualified Data.IntMap.Strict as IntMap
import Debug.Trace
-- precompute factorials
compFactorials :: Int -> Int -> IntMap.IntMap Int
compFactorials n m = go 0 1 IntMap.empty
where
go a acc map
| a < 0 = map
| a < n = go a' acc' map'
| otherwise = map'
where
map' = IntMap.insert a acc map
a' = a + 1
acc' = (acc * a') `mod` m
-- precompute invs
compInvs :: Int -> Int -> IntMap.IntMap Int -> IntMap.IntMap Int
compInvs n m facts = go 0 IntMap.empty
where
go a map
| a < 0 = map
| a < n = go a' map'
| otherwise = map'
where
map' = IntMap.insert a v map
a' = a + 1
v = (modExp b (m-2) m) `mod` m
b = (IntMap.!) facts a
modExp :: Int -> Int -> Int -> Int
modExp b e m = go b e 1
where
go b e r
| (.&.) e 1 == 1 = go b' e' r'
| e > 0 = go b' e' r
| otherwise = r
where
r' = (r * b) `mod` m
b' = (b * b) `mod` m
e' = shift e (-1)
-- precompute frequency table
initFreqMap :: C.ByteString -> Map.Map Char (IntMap.IntMap Int)
initFreqMap inp = go 1 map1 map2 inp
where
map1 = Map.fromList $ zip ['a'..'z'] $ repeat 0
map2 = Map.fromList $ zip ['a'..'z'] $ repeat IntMap.empty
go idx m1 m2 inp
| C.null inp = m2
| otherwise = go (idx+1) m1' m2' $ C.tail inp
where
m1' = Map.update (\v -> Just $ v+1) (C.head inp) m1
m2' = foldl' (\m w -> Map.update (\v -> liftM (\c -> IntMap.insert idx c v) $ Map.lookup w m1') w m)
m2 ['a'..'z']
query :: Int -> Int -> Int -> Map.Map Char (IntMap.IntMap Int)
-> IntMap.IntMap Int -> IntMap.IntMap Int -> Int
query l r m freqMap facts invs
| x > 1 = (x * y) `mod` m
| otherwise = y
where
calcCnt cs = cr - cl
where
cl = IntMap.findWithDefault 0 (l-1) cs
cr = IntMap.findWithDefault 0 r cs
f1 acc cs
| even cnt = acc
| otherwise = acc + 1
where
cnt = calcCnt cs
f2 (acc1,acc2) cs
| cnt < 2 = (acc1 ,acc2)
| otherwise = (acc1',acc2')
where
cnt = calcCnt cs
n = cnt `div` 2
acc1' = acc1 + n
r = choose acc1' n
acc2' = (acc2 * r) `mod` m
-- calc binomial coefficient using Fermat's little theorem
choose n k
| n < k = 0
| otherwise = (f1 * t) `mod` m
where
f1 = (IntMap.!) facts n
i1 = (IntMap.!) invs k
i2 = (IntMap.!) invs (n-k)
t = (i1 * i2) `mod` m
x = Map.foldl' f1 0 freqMap
y = snd $ Map.foldl' f2 (0,1) freqMap
main :: IO()
main = do
inp <- C.getLine
q <- readLn :: IO Int
let modulo = 1000000007
let facts = compFactorials (C.length inp) modulo
let invs = compInvs (C.length inp) modulo facts
let freqMap = initFreqMap inp
forM_ [1..q] $ \_ -> do
line <- getLine
let [s1, s2] = words line
let l = (read s1) :: Int
let r = (read s2) :: Int
let result = query l r modulo freqMap facts invs
putStrLn $ show result
It passes all small and medium test cases but I am getting timeout with large test cases.
The key to solve this problem is to precompute some stuff once at the beginning and use them to answer the individual queries efficiently.
Now, my main problem where I need help is:
The initital profiling shows that the lookup operation of the IntMap seems to be the main bottleneck. Is there better alternative to IntMap for memoization? Or should I look at Vector or Array, which I believe will lead to more "ugly" code.
Even in current state, the code doesn't look nice (by functional standards) and as verbose as my C++ solution. Any tips to make it more idiomatic? Other than IntMap usage for memoization, do you spot any other obvious problems which can lead to performance problems?
And is there any good sources, where I can learn how to use Haskell more effectively for competitive programming?
A sample large testcase, where the current code gets timeout:
input.txt
output.txt
For comparison my C++ solution:
#include <vector>
#include <iostream>
#define MOD 1000000007L
long mod_exp(long b, long e) {
long r = 1;
while (e > 0) {
if ((e & 1) == 1) {
r = (r * b) % MOD;
}
b = (b * b) % MOD;
e >>= 1;
}
return r;
}
long n_choose_k(int n, int k, const std::vector<long> &fact_map, const std::vector<long> &inv_map) {
if (n < k) {
return 0;
}
long l1 = fact_map[n];
long l2 = (inv_map[k] * inv_map[n-k]) % MOD;
return (l1 * l2) % MOD;
}
int main() {
std::string s;
int q;
std::cin >> s >> q;
std::vector<std::vector<long>> freq_map;
std::vector<long> fact_map(s.size()+1);
std::vector<long> inv_map(s.size()+1);
for (int i = 0; i < 26; i++) {
freq_map.emplace_back(std::vector<long>(s.size(), 0));
}
std::vector<long> acc_map(26, 0);
for (int i = 0; i < s.size(); i++) {
acc_map[s[i]-'a']++;
for (int j = 0; j < 26; j++) {
freq_map[j][i] = acc_map[j];
}
}
fact_map[0] = 1;
inv_map[0] = 1;
for (int i = 1; i <= s.size(); i++) {
fact_map[i] = (i * fact_map[i-1]) % MOD;
inv_map[i] = mod_exp(fact_map[i], MOD-2) % MOD;
}
while (q--) {
int l, r;
std::cin >> l >> r;
std::vector<long> x(26, 0);
long t = 0;
long acc = 0;
long result = 1;
for (int i = 0; i < 26; i++) {
auto cnt = freq_map[i][r-1] - (l > 1 ? freq_map[i][l-2] : 0);
if (cnt % 2 != 0) {
t++;
}
long n = cnt / 2;
if (n > 0) {
acc += n;
result *= n_choose_k(acc, n, fact_map, inv_map);
result = result % MOD;
}
}
if (t > 0) {
result *= t;
result = result % MOD;
}
std::cout << result << std::endl;
}
}
UPDATE:
DanielWagner's answer has confirmed my suspicion that the main problem in my code was the usage of IntMap for memoization. Replacing IntMap with Array made my code perform similar to DanielWagner's solution.
module Main where
import Control.Monad
import Data.Array (Array)
import qualified Data.Array as A
import qualified Data.ByteString.Char8 as C
import Data.Bits
import Data.List
import Debug.Trace
-- precompute factorials
compFactorials :: Int -> Int -> Array Int Int
compFactorials n m = A.listArray (0,n) $ scanl' f 1 [1..n]
where
f acc a = (acc * a) `mod` m
-- precompute invs
compInvs :: Int -> Int -> Array Int Int -> Array Int Int
compInvs n m facts = A.listArray (0,n) $ map f [0..n]
where
f a = (modExp ((A.!) facts a) (m-2) m) `mod` m
modExp :: Int -> Int -> Int -> Int
modExp b e m = go b e 1
where
go b e r
| (.&.) e 1 == 1 = go b' e' r'
| e > 0 = go b' e' r
| otherwise = r
where
r' = (r * b) `mod` m
b' = (b * b) `mod` m
e' = shift e (-1)
-- precompute frequency table
initFreqMap :: C.ByteString -> Map.Map Char (Array Int Int)
initFreqMap inp = Map.fromList $ map f ['a'..'z']
where
n = C.length inp
f c = (c, A.listArray (0,n) $ scanl' g 0 [0..n-1])
where
g x j
| C.index inp j == c = x+1
| otherwise = x
query :: Int -> Int -> Int -> Map.Map Char (Array Int Int)
-> Array Int Int -> Array Int Int -> Int
query l r m freqMap facts invs
| x > 1 = (x * y) `mod` m
| otherwise = y
where
calcCnt freqMap = cr - cl
where
cl = (A.!) freqMap (l-1)
cr = (A.!) freqMap r
f1 acc cs
| even cnt = acc
| otherwise = acc + 1
where
cnt = calcCnt cs
f2 (acc1,acc2) cs
| cnt < 2 = (acc1 ,acc2)
| otherwise = (acc1',acc2')
where
cnt = calcCnt cs
n = cnt `div` 2
acc1' = acc1 + n
r = choose acc1' n
acc2' = (acc2 * r) `mod` m
-- calc binomial coefficient using Fermat's little theorem
choose n k
| n < k = 0
| otherwise = (f1 * t) `mod` m
where
f1 = (A.!) facts n
i1 = (A.!) invs k
i2 = (A.!) invs (n-k)
t = (i1 * i2) `mod` m
x = Map.foldl' f1 0 freqMap
y = snd $ Map.foldl' f2 (0,1) freqMap
main :: IO()
main = do
inp <- C.getLine
q <- readLn :: IO Int
let modulo = 1000000007
let facts = compFactorials (C.length inp) modulo
let invs = compInvs (C.length inp) modulo facts
let freqMap = initFreqMap inp
replicateM_ q $ do
line <- getLine
let [s1, s2] = words line
let l = (read s1) :: Int
let r = (read s2) :: Int
let result = query l r modulo freqMap facts invs
putStrLn $ show result
I think you've shot yourself in the foot by trying to be too clever. Below I'll show a straightforward implementation of a slightly different algorithm that is about 5x faster than your Haskell code.
Here's the core combinatoric computation. Given a character frequency count for a substring, we can compute the number of maximum-length palindromes this way:
Divide all the frequencies by two, rounding down; call this the div2-frequencies. We'll also want the mod2-frequencies, which is the set of letters for which we had to round down.
Sum the div2-frequencies to get the total length of the palindrome prefix; its factorial gives an overcount of the number of possible prefixes for the palindrome.
Take the product of the factorials of the div2-frequencies. This tells the factor by which we overcounted above.
Take the size of the mod2-frequencies, or choose 1 if there are none. We can extend any of the palindrome prefixes by one of the values in this set, if there are any, so we have to multiply by this size.
For the overcounting step, it's not super obvious to me whether it would be faster to store precomputed inverses for factorials, and take their product, or whether it's faster to just take the product of all the factorials and do one inverse operation at the very end. I'll do the latter, because it just intuitively seems faster to do one inversion per query than one lookup per repeated letter, but what do I know? Should be easy to test if you want to try to adapt the code yourself.
There's only one other quick insight I had vs. your code, which is that we can cache the frequency counts for prefixes of the input; then computing the frequency count for a substring is just pointwise subtraction of two cached counts. Your precomputation on the input I find to be a bit excessive in comparison.
Without further ado, let's see some code. As usual there's some preamble.
module Main where
import Control.Monad
import Data.Array (Array)
import qualified Data.Array as A
import Data.Map.Strict (Map)
import qualified Data.Map.Strict as M
import Data.Monoid
Like you, I want to do all my computations on cheap Ints and bake in the modular operations where possible. I'll make a newtype to make sure this happens for me.
newtype Mod1000000007 = Mod Int deriving (Eq, Ord)
instance Num Mod1000000007 where
fromInteger = Mod . (`mod` 1000000007) . fromInteger
Mod l + Mod r = Mod ((l+r) `rem` 1000000007)
Mod l * Mod r = Mod ((l*r) `rem` 1000000007)
negate (Mod v) = Mod ((1000000007 - v) `rem` 1000000007)
abs = id
signum = id
instance Integral Mod1000000007 where
toInteger (Mod n) = toInteger n
quotRem a b = (a * b^1000000005, 0)
I baked in the base of 1000000007 in several places, but it's easy to generalize by giving Mod a phantom parameter and making a HasBase class to pick the base. Ask a fresh question if you're not sure how and are interested; I'll be happy to do a more thorough writeup. There's a few more instances for Mod that are basically uninteresting and primarily needed because of Haskell's wacko numeric class hierarchy:
instance Show Mod1000000007 where show (Mod n) = show n
instance Real Mod1000000007 where toRational (Mod n) = toRational n
instance Enum Mod1000000007 where
toEnum = Mod . (`mod` 1000000007)
fromEnum (Mod n) = n
Here's the precomputation we want to do for factorials...
type FactMap = Array Int Mod1000000007
factMap :: Int -> FactMap
factMap n = A.listArray (0,n) (scanl (*) 1 [1..])
...and for precomputing frequency maps for each prefix, plus getting a frequency map given a start and end point.
type FreqMap = Map Char Int
freqMaps :: String -> Array Int FreqMap
freqMaps s = go where
go = A.listArray (0, length s)
(M.empty : [M.insertWith (+) c 1 (go A.! i) | (i, c) <- zip [0..] s])
substringFreqMap :: Array Int FreqMap -> Int -> Int -> FreqMap
substringFreqMap maps l r = M.unionWith (-) (maps A.! r) (maps A.! (l-1))
Implementing the core computation described above is just a few lines of code, now that we have suitable Num and Integral instances for Mod1000000007:
palindromeCount :: FactMap -> FreqMap -> Mod1000000007
palindromeCount facts freqs
= toEnum (max 1 mod2Freqs)
* (facts A.! sum div2Freqs)
`div` product (map (facts A.!) div2Freqs)
where
(div2Freqs, Sum mod2Freqs) = foldMap (\n -> ([n `quot` 2], Sum (n `rem` 2))) freqs
Now we just need a short driver to read stuff and pass it around to the appropriate functions.
main :: IO ()
main = do
inp <- getLine
q <- readLn
let freqs = freqMaps inp
facts = factMap (length inp)
replicateM_ q $ do
[l,r] <- map read . words <$> getLine
print . palindromeCount facts $ substringFreqMap freqs l r
That's it. Notably I made no attempt to be fancy about bitwise operations and didn't do anything fancy with accumulators; everything is in what I would consider idiomatic purely-functional style. The final count is about half as much code that runs about 5x faster.
P.S. Just for fun, I replaced the last line with print (l+r :: Int)... and discovered that about half the time is spent in read. Ouch! Seems there's still plenty of low-hanging fruit if this isn't fast enough yet.

Haskell performance of memoization implement of dynamic programming is poor [duplicate]

I'm trying to memoize the following function:
gridwalk x y
| x == 0 = 1
| y == 0 = 1
| otherwise = (gridwalk (x - 1) y) + (gridwalk x (y - 1))
Looking at this I came up with the following solution:
gw :: (Int -> Int -> Int) -> Int -> Int -> Int
gw f x y
| x == 0 = 1
| y == 0 = 1
| otherwise = (f (x - 1) y) + (f x (y - 1))
gwlist :: [Int]
gwlist = map (\i -> gw fastgw (i `mod` 20) (i `div` 20)) [0..]
fastgw :: Int -> Int -> Int
fastgw x y = gwlist !! (x + y * 20)
Which I then can call like this:
gw fastgw 20 20
Is there an easier, more concise and general way (notice how I had to hardcode the max grid dimensions in the gwlist function in order to convert from 2D to 1D space so I can access the memoizing list) to memoize functions with multiple parameters in Haskell?
You can use a list of lists to memoize the function result for both parameters:
memo :: (Int -> Int -> a) -> [[a]]
memo f = map (\x -> map (f x) [0..]) [0..]
gw :: Int -> Int -> Int
gw 0 _ = 1
gw _ 0 = 1
gw x y = (fastgw (x - 1) y) + (fastgw x (y - 1))
gwstore :: [[Int]]
gwstore = memo gw
fastgw :: Int -> Int -> Int
fastgw x y = gwstore !! x !! y
Use the data-memocombinators package from hackage. It provides easy to use memorization techniques and provides an easy and breve way to use them:
import Data.MemoCombinators (memo2,integral)
gridwalk = memo2 integral integral gridwalk' where
gridwalk' x y
| x == 0 = 1
| y == 0 = 1
| otherwise = (gridwalk (x - 1) y) + (gridwalk x (y - 1))
Here is a version using Data.MemoTrie from the MemoTrie package to memoize the function:
import Data.MemoTrie(memo2)
gridwalk :: Int -> Int -> Int
gridwalk = memo2 gw
where
gw 0 _ = 1
gw _ 0 = 1
gw x y = gridwalk (x - 1) y + gridwalk x (y - 1)
If you want maximum generality, you can memoize a memoizing function.
memo :: (Num a, Enum a) => (a -> b) -> [b]
memo f = map f (enumFrom 0)
gwvals = fmap memo (memo gw)
fastgw :: Int -> Int -> Int
fastgw x y = gwvals !! x !! y
This technique will work with functions that have any number of arguments.
Edit: thanks to Philip K. for pointing out a bug in the original code. Originally memo had a "Bounded" constraint instead of "Num" and began the enumeration at minBound, which would only be valid for natural numbers.
Lists aren't a good data structure for memoizing, though, because they have linear lookup complexity. You might be better off with a Map or IntMap. Or look on Hackage.
Note that this particular code does rely on laziness, so if you wanted to switch to using a Map you would need to take a bounded amount of elements from the list, as in:
gwByMap :: Int -> Int -> Int -> Int -> Int
gwByMap maxX maxY x y = fromMaybe (gw x y) $ M.lookup (x,y) memomap
where
memomap = M.fromList $ concat [[((x',y'),z) | (y',z) <- zip [0..maxY] ys]
| (x',ys) <- zip [0..maxX] gwvals]
fastgw2 :: Int -> Int -> Int
fastgw2 = gwByMap 20 20
I think ghc may be stupid about sharing in this case, you may need to lift out the x and y parameters, like this:
gwByMap maxX maxY = \x y -> fromMaybe (gw x y) $ M.lookup (x,y) memomap

Netwire 5 - A box cannot bounce

I am trying to convert challenge 3 ( https://ocharles.org.uk/blog/posts/2013-08-01-getting-started-with-netwire-and-sdl.html ) from netwire 4.0 to netwire 5.0 using OpenGL. Unfortunately, the box cannot bounce. The entire code is following. It seems to me that the function velocity does not work. When the box collides with a wall, it does not bounce but stops. How do I correct my program? Thanks in advance.
{-# LANGUAGE Arrows #-}
import Graphics.Rendering.OpenGL
import Graphics.UI.GLFW
import Data.IORef
import Prelude hiding ((.))
import Control.Monad.Fix (MonadFix)
import Control.Wire
import FRP.Netwire
isKeyDown :: (Enum k, Monoid e) => k -> Wire s e IO a e
isKeyDown k = mkGen_ $ \_ -> do
s <- getKey k
return $ case s of
Press -> Right mempty
Release -> Left mempty
acceleration :: (Monoid e) => Wire s e IO a Double
acceleration = pure ( 0) . isKeyDown (CharKey 'A') . isKeyDown (CharKey 'D')
<|> pure (-0.5) . isKeyDown (CharKey 'A')
<|> pure ( 0.5) . isKeyDown (CharKey 'D')
<|> pure ( 0)
velocity :: (Monad m, HasTime t s, Monoid e) => Wire s e m (Double, Bool) Double
velocity = integralWith bounce 0
where bounce c v
| c = (-v)
| otherwise = v
collided :: (Ord a, Fractional a) => (a, a) -> a -> (a, Bool)
collided (a, b) x
| x < a = (a, True)
| x > b = (b, True)
| otherwise = (x, False)
position' :: (Monad m, HasTime t s) => Wire s e m Double (Double, Bool)
position' = integral 0 >>> (arr $ collided (-0.8, 0.8))
challenge3 :: (HasTime t s) => Wire s () IO a Double
challenge3 = proc _ -> do
rec a <- acceleration -< ()
v <- velocity -< (a, c)
(p, c) <- position' -< v
returnA -< p
s :: Double
s = 0.05
y :: Double
y = 0.0
renderPoint :: (Double, Double) -> IO ()
renderPoint (x, y) = vertex $ Vertex2 (realToFrac x :: GLfloat) (realToFrac y :: GLfloat)
generatePoints :: Double -> Double -> Double -> [(Double, Double)]
generatePoints x y s =
[ (x - s, y - s)
, (x + s, y - s)
, (x + s, y + s)
, (x - s, y + s) ]
runNetwork :: (HasTime t s) => IORef Bool -> Session IO s -> Wire s e IO a Double -> IO ()
runNetwork closedRef session wire = do
pollEvents
closed <- readIORef closedRef
if closed
then return ()
else do
(st , session') <- stepSession session
(wt', wire' ) <- stepWire wire st $ Right undefined
case wt' of
Left _ -> return ()
Right x -> do
clear [ColorBuffer]
renderPrimitive Quads $
mapM_ renderPoint $ generatePoints x y s
swapBuffers
runNetwork closedRef session' wire'
main :: IO ()
main = do
initialize
openWindow (Size 1024 512) [DisplayRGBBits 8 8 8, DisplayAlphaBits 8, DisplayDepthBits 24] Window
closedRef <- newIORef False
windowCloseCallback $= do
writeIORef closedRef True
return True
runNetwork closedRef clockSession_ challenge3
closeWindow
By experience, I think the trick here is the fact that you technically have to bounce a few pixels before the actual collision, because if you detect it when it happens, then the inertia put your square a little bit "in" the wall, and so velocity is constantly reversed, causing your square to be blocked.
Ocharles actually nods to it in the blog post :
If this position falls outside the world bounds, we adjust the square (with a small epsilon to stop it getting stuck in the wall) and return the collision information.
Good luck with Netwire 5, I'm playing with it too, and I just begin to like it. ;)

Implementing Phase Unwrapping Algorithm with Haskell Repa Array

I'm trying to implement a Phase Unwrapping Algorithm for Three Phase Structured Light Scanning in Haskell using a Repa Array. I want to implement a flood fill based unwrapping algorithm recursing outward from the point (width / 2, height / 2). Unfortunately using that method of recursion I'm getting an out of memory exception. I'm new to Haskell and the Repa library so I was wondering whether it looks like I'm doing anything glaringly wrong. Any help with this would be greatly appreciated!
Update (#leventov):
I am now considering implementing the following path following algorithm using mutable arrays in Yarr. (Publication: K. Chen, J. Xi, Y. Yu & J. F. Chicharo, "Fast quality-guided flood-fill phase unwrapping algorithm for threedimensional fringe pattern profilometry," in Optical Metrology and Inspection for Industrial Applications,
2010, pp. 1-9.)
{-# OPTIONS_GHC -Odph -rtsopts -fno-liberate-case -fllvm -optlo-O3 -XTypeOperators -XNoMonomorphismRestriction #-}
module Scanner where
import Data.Word
import Data.Fixed
import Data.Array.Repa.Eval
import qualified Data.Array.Repa as R
import qualified Data.Array.Repa.Repr.Unboxed as U
import qualified Data.Array.Repa.Repr.ForeignPtr as P
import Codec.BMP
import Data.Array.Repa.IO.BMP
import Control.Monad.Identity (runIdentity)
import System.Environment( getArgs )
type ImRead = Either Error Image
type Avg = P.Array R.U R.DIM2 (ImageT, ImageT, ImageT)
type ImageT = (Word8, Word8, Word8)
type PhaseT = (Float, Float, Float)
type WrapT = (Float, Int)
type Image = P.Array R.U R.DIM2 (Word8, Word8, Word8)
type Phase = P.Array R.U R.DIM2 (Float, Float, Float)
type Wrap = P.Array R.U R.DIM2 (Float, Int)
type UWrapT = (Float, Int, [(Int, Int)], String)
type DepthT = (Float, Int, String)
{-# INLINE noise #-}
{-# INLINE zskew #-}
{-# INLINE zscale #-}
{-# INLINE compute #-}
{-# INLINE main #-}
{-# INLINE doMain #-}
{-# INLINE zipImg #-}
{-# INLINE mapWrap #-}
{-# INLINE avgPhase #-}
{-# INLINE doAvg #-}
{-# INLINE doWrap #-}
{-# INLINE doPhase #-}
{-# INLINE isPhase #-}
{-# INLINE diffPhase #-}
{-# INLINE shape #-}
{-# INLINE countM #-}
{-# INLINE inArr #-}
{-# INLINE idx #-}
{-# INLINE getElem #-}
{-# INLINE start #-}
{-# INLINE unwrap #-}
{-# INLINE doUnwrap #-}
{-# INLINE doDepth #-}
{-# INLINE write #-}
noise :: Float
noise = 0.1
zskew :: Float
zskew = 24
zscale :: Float
zscale = 130
compute :: (R.Shape sh, U.Unbox e) => P.Array R.D sh e -> P.Array R.U sh e
compute a = runIdentity (R.computeP a)
main :: IO ()
main = do
commandArguments <- getArgs
case commandArguments of
(file1 : file2 : file3 : _ ) -> do
image1 <- readImageFromBMP file1
image2 <- readImageFromBMP file2
image3 <- readImageFromBMP file3
doMain image1 image2 image3
_ -> putStrLn "Not enough arguments"
doMain :: ImRead -> ImRead -> ImRead -> IO()
doMain (Right i1) (Right i2) (Right i3) = write
where
write = writeFile "out.txt" str
(p, m, d, str) = start $ mapWrap i1 i2 i3
doMain _ _ _ = putStrLn "Error loading image"
zipImg :: Image -> Image -> Image -> Avg
zipImg i1 i2 i3 = U.zip3 i1 i2 i3
mapWrap :: Image -> Image -> Image -> Wrap
mapWrap i1 i2 i3 = compute $ R.map wrap avg
where
wrap = (doWrap . avgPhase)
avg = zipImg i1 i2 i3
avgPhase :: (ImageT, ImageT, ImageT) -> PhaseT
avgPhase (i1, i2, i3) = (doAvg i1, doAvg i2, doAvg i3)
doAvg :: ImageT -> Float
doAvg (r, g, b) = (r1 + g1 + b1) / d1
where
r1 = fromIntegral r
g1 = fromIntegral g
b1 = fromIntegral b
d1 = fromIntegral 765
doWrap :: PhaseT -> WrapT
doWrap (p1, p2, p3) = (wrap, mask)
where
wrap = isPhase $ doPhase (p1, p2, p3)
mask = isNoise $ diffPhase [p1, p2, p3]
doPhase :: PhaseT -> (Float, Float)
doPhase (p1, p2, p3) = (x1, x2)
where
x1 = sqrt 3 * (p1 - p3)
x2 = 2 * p2 - p1 - p3
isPhase :: (Float, Float) -> Float
isPhase (x1, x2) = atan2 x1 x2 / (2 * pi)
diffPhase :: [Float] -> Float
diffPhase phases = maximum phases - minimum phases
isNoise :: Float -> Int
isNoise phase = fromEnum $ phase <= noise
shape :: Wrap -> [Int]
shape wrap = R.listOfShape $ R.extent wrap
countM :: Wrap -> (Float, Int)
countM wrap = R.foldAllS count (0,0) wrap
where count = (\(x, y) (i, j) -> (x, y))
start :: Wrap -> UWrapT
start wrap = unwrap wrap (x, y) (ph, m, [], "")
where
[x0, y0] = shape wrap
x = quot x0 2
y = quot y0 2
(ph, m) = getElem wrap (x0, y0)
inArr :: Wrap -> (Int, Int) -> Bool
inArr wrap (x,y) = x >= 0 && y >= 0 && x < x0 && y < y0
where
[x0, y0] = shape wrap
idx :: (Int, Int) -> (R.Z R.:. Int R.:. Int)
idx (x, y) = (R.Z R.:. x R.:. y)
getElem :: Wrap -> (Int, Int) -> WrapT
getElem wrap (x, y) = wrap R.! idx (x, y)
unwrap :: Wrap -> (Int, Int) -> UWrapT -> UWrapT
unwrap wrap (x, y) (ph, m, done, str) =
if
not $ inArr wrap (x, y) ||
(x, y) `elem` done ||
toEnum m::Bool
then
(ph, m, done, str)
else
up
where
unwrap' = doUnwrap wrap (x, y) (ph, m, done, str)
right = unwrap wrap (x+1, y) unwrap'
left = unwrap wrap (x-1, y) right
down = unwrap wrap (x, y+1) left
up = unwrap wrap (x, y-1) down
doUnwrap :: Wrap -> (Int, Int) -> UWrapT -> UWrapT
doUnwrap wrap (x, y) (ph, m, done, str) = unwrapped
where
unwrapped = (nph, m, (x, y):done, out)
(phase, mask) = getElem wrap (x, y)
rph = fromIntegral $ round ph
off = phase - (ph - rph)
nph = ph + (mod' (off + 0.5) 1) - 0.5
out = doDepth wrap (x, y) (nph, m, str)
doDepth :: Wrap -> (Int, Int) -> DepthT -> String
doDepth wrap (x, y) (ph, m, str) = write (x, ys, d, str)
where
[x0, y0] = shape wrap
ys = y0 - y
ydiff = fromIntegral (y - (quot y0 2))
plane = 0.5 - ydiff / zskew
d = (ph - plane) * zscale
write :: (Int, Int, Float, String) -> String
write (x, y, depth, str) = str ++ vertex
where
vertex = xstr ++ ystr ++ zstr
xstr = show x ++ " "
ystr = show y ++ " "
zstr = show depth ++ "\n"
Sorry for wasting some your time by my first misleading advice.
You should use another 2-dimensional array of pixel states (already visited or not) instead of
(x, y) `elem` done
because the latter takes linear time.
Examples of solving almost the same task: for repa and vector, and for yarr.
Perhaps, you have out of memory exception because of building a string by appending to the end (in write function) - the worst solution, linear time and memory consumption. You would better aggregate results using cons (:) and write it to the output file at the end, in reverse order. Even better - write results to another unboxed Vector of (Int, Int, Float) elements (allocate vector of width*height size - as upper bound of possible size).

Optimizing numerical array performance in Haskell

I'm working on a terrain generation algorithm for a MineCraft-like world. Currently, I'm using simplex noise based on the implementation in the paper 'Simplex Noise Demystified' [PDF], since simplex noise is supposed to be faster and to have fewer artifacts than Perlin noise. This looks fairly decent (see image), but so far it's also pretty slow.
Running the noise function 10 times (I need noise with different wavelengths for things like terrain height, temperature, tree location, etc.) with 3 octaves of noise for each block in a chunk (16x16x128 blocks), or about 1 million calls to the noise function in total, takes about 700-800 ms. This is at least an order of magnitude too slow for the purposes of generating terrain with any decent kind of speed, despite the fact that there are no obvious expensive operations in the algorithm (at least to me). Just floor, modulo, some array lookups and basic arithmetic. The algorithm (written in Haskell) is listed below. The SCC comments are for profiling. I've omitted the 2D noise functions, since they work the same way.
g3 :: (Floating a, RealFrac a) => a
g3 = 1/6
{-# INLINE int #-}
int :: (Integral a, Num b) => a -> b
int = fromIntegral
grad3 :: (Floating a, RealFrac a) => V.Vector (a,a,a)
grad3 = V.fromList $ [(1,1,0),(-1, 1,0),(1,-1, 0),(-1,-1, 0),
(1,0,1),(-1, 0,1),(1, 0,-1),(-1, 0,-1),
(0,1,1),( 0,-1,1),(0, 1,-1),( 0,-1,-1)]
{-# INLINE dot3 #-}
dot3 :: Num a => (a, a, a) -> a -> a -> a -> a
dot3 (a,b,c) x y z = a * x + b * y + c * z
{-# INLINE fastFloor #-}
fastFloor :: RealFrac a => a -> Int
fastFloor x = truncate (if x > 0 then x else x - 1)
--Generate a random permutation for use in the noise functions
perm :: Int -> Permutation
perm seed = V.fromList . concat . replicate 2 . shuffle' [0..255] 256 $ mkStdGen seed
--Generate 3D noise between -0.5 and 0.5
simplex3D :: (Floating a, RealFrac a) => Permutation -> a -> a -> a -> a
simplex3D p x y z = {-# SCC "out" #-} 16 * (n gi0 (x0,y0,z0) + n gi1 xyz1 + n gi2 xyz2 + n gi3 xyz3) where
(i,j,k) = {-# SCC "ijk" #-} (s x, s y, s z) where s a = fastFloor (a + (x + y + z) / 3)
(x0,y0,z0) = {-# SCC "x0-z0" #-} (x - int i + t, y - int j + t, z - int k + t) where t = int (i + j + k) * g3
(i1,j1,k1,i2,j2,k2) = {-# SCC "i1-k2" #-} if x0 >= y0
then if y0 >= z0 then (1,0,0,1,1,0) else
if x0 >= z0 then (1,0,0,1,0,1) else (0,0,1,1,0,1)
else if y0 < z0 then (0,0,1,0,1,1) else
if x0 < z0 then (0,1,0,0,1,1) else (0,1,0,1,1,0)
xyz1 = {-# SCC "xyz1" #-} (x0 - int i1 + g3, y0 - int j1 + g3, z0 - int k1 + g3)
xyz2 = {-# SCC "xyz2" #-} (x0 - int i2 + 2*g3, y0 - int j2 + 2*g3, z0 - int k2 + 2*g3)
xyz3 = {-# SCC "xyz3" #-} (x0 - 1 + 3*g3, y0 - 1 + 3*g3, z0 - 1 + 3*g3)
(ii,jj,kk) = {-# SCC "iijjkk" #-} (i .&. 255, j .&. 255, k .&. 255)
gi0 = {-# SCC "gi0" #-} mod (p V.! (ii + p V.! (jj + p V.! kk ))) 12
gi1 = {-# SCC "gi1" #-} mod (p V.! (ii + i1 + p V.! (jj + j1 + p V.! (kk + k1)))) 12
gi2 = {-# SCC "gi2" #-} mod (p V.! (ii + i2 + p V.! (jj + j2 + p V.! (kk + k2)))) 12
gi3 = {-# SCC "gi3" #-} mod (p V.! (ii + 1 + p V.! (jj + 1 + p V.! (kk + 1 )))) 12
{-# INLINE n #-}
n gi (x',y',z') = {-# SCC "n" #-} (\a -> if a < 0 then 0 else
a*a*a*a*dot3 (grad3 V.! gi) x' y' z') $ 0.6 - x'*x' - y'*y' - z'*z'
harmonic :: (Num a, Fractional a) => Int -> (a -> a) -> a
harmonic octaves noise = f octaves / (2 - 1 / int (2 ^ (octaves - 1))) where
f 0 = 0
f o = let r = int $ 2 ^ (o - 1) in noise r / r + f (o - 1)
--Generate harmonic 3D noise between -0.5 and 0.5
harmonicNoise3D :: (RealFrac a, Floating a) => Permutation -> Int -> a -> a -> a -> a -> a
harmonicNoise3D p octaves l x y z = harmonic octaves
(\f -> simplex3D p (x * f / l) (y * f / l) (z * f / l))
For profiling, I used the following code,
q _ = let p = perm 0 in
sum [harmonicNoise3D p 3 l x y z :: Float | l <- [1..10], y <- [0..127], x <- [0..15], z <- [0..15]]
main = do start <- getCurrentTime
print $ q ()
end <- getCurrentTime
print $ diffUTCTime end start
which produces the following information:
COST CENTRE MODULE %time %alloc
simplex3D Main 18.8 21.0
n Main 18.0 19.6
out Main 10.1 9.2
harmonicNoise3D Main 9.8 4.5
harmonic Main 6.4 5.8
int Main 4.0 2.9
gi3 Main 4.0 3.0
xyz2 Main 3.5 5.9
gi1 Main 3.4 3.4
gi0 Main 3.4 2.7
fastFloor Main 3.2 0.6
xyz1 Main 2.9 5.9
ijk Main 2.7 3.5
gi2 Main 2.7 3.3
xyz3 Main 2.6 4.1
iijjkk Main 1.6 2.5
dot3 Main 1.6 0.7
To compare, I also ported the algorithm to C#. Performance there was about 3 to 4 times faster, so I imagine I must be doing something wrong. But even then it's not nearly as fast as I would like. So my question is this: can anyone tell me if there are any ways to speed up my implementation and/or the algorithm in general or does anyone know of a different noise algorithm that has better performance characteristics but a similar look?
Update:
After following some of the suggestions offered below, the code now looks as follows:
module Noise ( Permutation, perm
, noise3D, simplex3D
) where
import Data.Bits
import qualified Data.Vector.Unboxed as UV
import System.Random
import System.Random.Shuffle
type Permutation = UV.Vector Int
g3 :: Double
g3 = 1/6
{-# INLINE int #-}
int :: Int -> Double
int = fromIntegral
grad3 :: UV.Vector (Double, Double, Double)
grad3 = UV.fromList $ [(1,1,0),(-1, 1,0),(1,-1, 0),(-1,-1, 0),
(1,0,1),(-1, 0,1),(1, 0,-1),(-1, 0,-1),
(0,1,1),( 0,-1,1),(0, 1,-1),( 0,-1,-1)]
{-# INLINE dot3 #-}
dot3 :: (Double, Double, Double) -> Double -> Double -> Double -> Double
dot3 (a,b,c) x y z = a * x + b * y + c * z
{-# INLINE fastFloor #-}
fastFloor :: Double -> Int
fastFloor x = truncate (if x > 0 then x else x - 1)
--Generate a random permutation for use in the noise functions
perm :: Int -> Permutation
perm seed = UV.fromList . concat . replicate 2 . shuffle' [0..255] 256 $ mkStdGen seed
--Generate 3D noise between -0.5 and 0.5
noise3D :: Permutation -> Double -> Double -> Double -> Double
noise3D p x y z = 16 * (n gi0 (x0,y0,z0) + n gi1 xyz1 + n gi2 xyz2 + n gi3 xyz3) where
(i,j,k) = (s x, s y, s z) where s a = fastFloor (a + (x + y + z) / 3)
(x0,y0,z0) = (x - int i + t, y - int j + t, z - int k + t) where t = int (i + j + k) * g3
(i1,j1,k1,i2,j2,k2) = if x0 >= y0
then if y0 >= z0 then (1,0,0,1,1,0) else
if x0 >= z0 then (1,0,0,1,0,1) else (0,0,1,1,0,1)
else if y0 < z0 then (0,0,1,0,1,1) else
if x0 < z0 then (0,1,0,0,1,1) else (0,1,0,1,1,0)
xyz1 = (x0 - int i1 + g3, y0 - int j1 + g3, z0 - int k1 + g3)
xyz2 = (x0 - int i2 + 2*g3, y0 - int j2 + 2*g3, z0 - int k2 + 2*g3)
xyz3 = (x0 - 1 + 3*g3, y0 - 1 + 3*g3, z0 - 1 + 3*g3)
(ii,jj,kk) = (i .&. 255, j .&. 255, k .&. 255)
gi0 = rem (UV.unsafeIndex p (ii + UV.unsafeIndex p (jj + UV.unsafeIndex p kk ))) 12
gi1 = rem (UV.unsafeIndex p (ii + i1 + UV.unsafeIndex p (jj + j1 + UV.unsafeIndex p (kk + k1)))) 12
gi2 = rem (UV.unsafeIndex p (ii + i2 + UV.unsafeIndex p (jj + j2 + UV.unsafeIndex p (kk + k2)))) 12
gi3 = rem (UV.unsafeIndex p (ii + 1 + UV.unsafeIndex p (jj + 1 + UV.unsafeIndex p (kk + 1 )))) 12
{-# INLINE n #-}
n gi (x',y',z') = (\a -> if a < 0 then 0 else
a*a*a*a*dot3 (UV.unsafeIndex grad3 gi) x' y' z') $ 0.6 - x'*x' - y'*y' - z'*z'
harmonic :: Int -> (Double -> Double) -> Double
harmonic octaves noise = f octaves / (2 - 1 / int (2 ^ (octaves - 1))) where
f 0 = 0
f o = let r = 2 ^^ (o - 1) in noise r / r + f (o - 1)
--3D simplex noise
--syntax: simplex3D permutation number_of_octaves wavelength x y z
simplex3D :: Permutation -> Int -> Double -> Double -> Double -> Double -> Double
simplex3D p octaves l x y z = harmonic octaves
(\f -> noise3D p (x * f / l) (y * f / l) (z * f / l))
Together with reducing my chunk size to 8x8x128, generating new terrain chunks now occurs at about 10-20 fps, which means moving around is now not nearly as problematic as before. Of course, any other performance improvements are still welcome.
The thing that stands out initially is that your code is highly polymorphic. You should specialize your floating point type uniformly to Double, so GHC (and LLVM) have a chance of applying more aggressive optimizations.
Note, for those trying to reproduce, this code imports:
import qualified Data.Vector as V
import Data.Bits
import Data.Time.Clock
import System.Random
import System.Random.Shuffle
type Permutation = V.Vector Int
Ok. There's lots of things you can try to improve this code.
Improvements
Data representation
Specialize to a concrete floating point type, instead of polymorphic floating point functions
Replace tuple (a,a,a) with unboxed triple T !Double !Double !Double
Switch from Data.Array to Data.Array.Unboxed for Permutations
Replace use of boxed array of triples with multidimensional unboxed array from repa package
Compiler flags
Compile with -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native (or equivalent with -fllvm)
Increase spec constr threshold -- -fspec-constr-count=16
More efficient library functions
Use mersenne-random instead of StdGen to generate randoms
Replace mod with rem
Replace V.! indexing with unchecked indexing VU.unsafeIndex (after moving to Data.Vector.Unboxed
Runtime settings
Increase the default allocation area: -A20M or -H
Also, check your algorithm is identical to the C# one, and you're using the same data structures.

Resources