Unexplainably incredible performance with Pythagorean triples in Haskell - performance

Let's say we have this simple Haskell function that produces Pythagorean triples:
pytha :: [(Int, Int, Int)]
pytha = [(x, y, z)
| z <- [0..]
, x <- [1..z]
, y <- [x..z]
, x * x + y * y == z * z
]
and we'd like to benchmark how long does it take to produce, say, first 100 triples. So (using the criterion library and assuming import Criterion.Main) we have this benchmark:
main :: IO ()
main = do
countStr <- readFile "count.txt"
defaultMain [ bgroup "pytha" [ bench countStr $ nf (`take` pytha) (read countStr) ] ]
where we even read the count from a file to make sure ghc does not try to evaluate pytha during compile time!
Doing echo 100 > count.txt, compiling the benchmark with -O2 and running on my machine (a 4.0 GHz Sandy Bridge CPU) shows some interesting numbers:
time 967.4 ns (957.6 ns .. 979.3 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 979.6 ns (967.9 ns .. 995.6 ns)
std dev 45.34 ns (33.96 ns .. 60.29 ns)
Slightly modifying this program to show how many triples were considered overall (by producing all the triples first, zipping the list with [0..] and then filtering out all non-Pythagorean triples and looking at the indices of the resulting ones) shows that almost 900000 triples were considered.
All this naturally raises the question: how does the code above manage to achieve 1000 triples/ns on a single core of a pretty standard CPU? Or is it just that my benchmark is wrong?

You need to use a function rather than a value that will be memoized.
pytha :: Int -> [(Int, Int, Int)]
pytha z_max =
[ (x, y, z)
| z <- [0..z_max]
, x <- [1..z]
, y <- [x..z]
, x * x + y * y == z * z
]
GHC isn't going to get clever enough to factor this into takeWhile from a constant list, so it should give a meaningful benchmark. Just make sure Criterion is in charge of passing z_max, which you can reasonably set to maxBound :: Int or some such.
By the way: you can make your implementation much less slow by using floating point operations to calculate much tighter bounds for y.

Related

How do I add parallel computation to this example?

I have an algorithm for synchronous calculation of a certain integral on a given segment. I want to use the Control.Parallel library, or rather par :: a -> b -> b to add parallel computing to this algorithm.
How can I do this?
integrate :: (Double -> Double) -> Double -> Double -> Double
integrate f a b =
let
step = (b - a) / 1000
segments = [a + x * step | x <- [0..999]]
area x = step * (f x + f (x + step)) / 2
in sum $ map area segments
From the looks of it you are trying to approximate an integral of a function f on the region from b to a using trapezoidal rule. You are right in trying to parallelize the code, but there are a couple of issues with the attempt:
First of all, you need a work stealing scheduler in order to get any benefit, since par is unlikely to give you a speedup
Secondly, the way it is implemented each intermediate point f(x) is computed twice, except for the border points f(a) and f(b)
Few moths ago I needed this functionality, so I added it the the massiv library: trapezoidRule, which conveninetly solves both of the above problems and avoids usage of lists.
Here is an out of the box solution, but it will not automatically parallelize the computation, since there is only one element of the array is being computed (it was designed to estimate integrals over many regions)
integrate' :: (Double -> Double) -> Double -> Double -> Double
integrate' f a b = trapezoidRule Seq P (\scale x -> f (scale x)) a d (Sz1 1) n ! 0
where
n = 1000
d = b - a
As a sanity check:
λ> integrate (\x -> x * x) 10 20 -- implementation from the question
2333.3335
λ> integrate' (\x -> x * x) 10 20
2333.3335
Here is a solution that will do the automatic parallelization and will avoid redundant evaluation:
integrateA :: Int -> (Double -> Double) -> Double -> Double -> Double
integrateA n f a b =
let step = (b - a) / fromIntegral n
sz = size segments - 1
segments = computeAs P $ A.map f (enumFromStepN Par a step (Sz (n + 1)))
area y0 y1 = step * (y0 + y1) / 2
areas = A.zipWith area (extract' 0 sz segments) (extract' 1 sz segments)
in A.sum areas
Because of list fusion, in case of your solution using lists, there will be no allocation, as such, for simple cases it will be very fast. In the above solution there is gonna be an array of size n+1 allocated in order to promote sharing and avoid double function evaluation. There will also be extra cost encountered due to scheduling, since forking off threads does not come for free. But in the end for really expensive functions and very large n it is possible to get factor of ~x3 speed up on a quad core processor.
Below are some benchmarks of integrating gaussian function with n = 100000:
benchmarking Gaussian1D/list
time 3.657 ms (3.623 ms .. 3.687 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 3.627 ms (3.604 ms .. 3.658 ms)
std dev 80.50 μs (63.62 μs .. 115.4 μs)
benchmarking Gaussian1D/array Seq
time 3.408 ms (3.304 ms .. 3.523 ms)
0.987 R² (0.979 R² .. 0.994 R²)
mean 3.670 ms (3.578 ms .. 3.839 ms)
std dev 408.0 μs (293.8 μs .. 627.6 μs)
variance introduced by outliers: 69% (severely inflated)
benchmarking Gaussian1D/array Par
time 1.340 ms (1.286 ms .. 1.393 ms)
0.980 R² (0.967 R² .. 0.989 R²)
mean 1.393 ms (1.328 ms .. 1.485 ms)
std dev 263.3 μs (160.1 μs .. 385.6 μs)
variance introduced by outliers: 90% (severely inflated)
Side note suggestion. Switching to Simpson's rule will give you a better approximation. Implementation is available in massiv ;)
Edit
This is such a fun problem, that I decided to see what would it take to implement it without any array allocations. Here is what I came up with:
integrateS :: Int -> (Double -> Double) -> Double -> Double -> Double
integrateS n f a b =
let step = (b - a) / fromIntegral n
segments = A.map f (enumFromStepN Seq (a + step) step (Sz n))
area y0 y1 = step * (y0 + y1) / 2
sumWith (acc, y0) y1 =
let acc' = acc + area y0 y1
in acc' `seq` (acc', y1)
in fst $ A.foldlS sumWith (0, f a) segments
Above approach runs in constant memory, since the few arrays that do get created aren't real arrays backed by memory, but instead are delayed arrays. With a bit of trickery around fold accumulator we can share the results, thus avoiding double function evaluation. This results in astonishing speed up:
benchmarking Gaussian1D/array Seq no-alloc
time 1.788 ms (1.777 ms .. 1.799 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 1.787 ms (1.781 ms .. 1.795 ms)
std dev 23.85 μs (17.19 μs .. 31.96 μs)
The downside to the above approach is that it is not easily parallelizable, but not impossible. Embrace yourself, here is a monstrosity that can run on 8 capabilities (hardcoded and in my case 4 cores with hyperthreading):
-- | Will not produce correct results if `n` is not divisible by 8
integrateN8 :: Int -> (Double -> Double) -> Double -> Double -> Double
integrateN8 n f a b =
let k = 8
n' = n `div` k
step = (b - a) / fromIntegral n
segments =
makeArrayR D (ParN (fromIntegral k)) (Sz1 k) $ \i ->
let start = a + step * fromIntegral n' * fromIntegral i + step
in (f start, A.map f (enumFromStepN Seq (start + step) step (Sz (n' - 1))))
area y0 y1 = step * (y0 + y1) / 2
sumWith (acc, y0) y1 =
let acc' = acc + area y0 y1
in acc' `seq` (acc', y1)
partialResults =
computeAs U $ A.map (\(y0, arr) -> (y0, A.foldlS sumWith (0, y0) arr)) segments
combine (acc, y0) (y1, (acci, yn)) =
let acc' = acc + acci + area y0 y1
in acc' `seq` (acc', yn)
in fst $ foldlS combine (0, f a) partialResults
The only real array allocated is for keeping partialResults which has a total of 16 Double elements. Speed improvement is not as drastic, but nevertheless it is there:
benchmarking Gaussian1D/array Par no-alloc
time 960.1 μs (914.3 μs .. 1.020 ms)
0.968 R² (0.944 R² .. 0.990 R²)
mean 931.8 μs (900.8 μs .. 976.3 μs)
std dev 129.2 μs (84.20 μs .. 198.8 μs)
variance introduced by outliers: 84% (severely inflated)
my default go-to for any map composition would be by using parmap from Strategies API http://hackage.haskell.org/package/parallel-3.2.2.0/docs/Control-Parallel-Strategies.html#g:7 , I'll add an example once I'm around my PC.
Edit:
You'd use parMap in the following way,
module Main where
import Control.Parallel.Strategies
main = putStrLn $ show $ integrate f 1.1 1.2
f :: Double -> Double
f x = x
integrate :: (Double -> Double) -> Double -> Double -> Double
integrate f a b =
let
step = (b - a) / 1000
segments = [a + x * step | x <- [0..999]]
area x = step * (f x + f (x + step)) / 2
in sum $ parMap rpar area segments
Then compile with:
ghc -O2 -threaded -rtsopts Main.hs and run using the RTS + N flag to control the parallelism ./Main +RTS -N -RTS -N can be specified e.g. -N6 to run on 6 threads or can be left empty to use all possible threads.

Haskell; performance of where clause

I was analyzing the effect of where clauses on performance of Haskell programs.
In Haskell, The craft of functional programming, Thomspson, chapter 20.4, I found the following example:
exam1 :: Int -> [Int]
exam1 n = [1 .. n] ++ [1 .. n]
exam2 :: Int -> [Int]
exam2 n = list ++ list
where list = [1 .. n]
and, I quote,
The time taken to calculate [exam1] will be O(n), and the space used will be O(1), but we will have to calculate the expression [1 .. n] twice.
...
The effect [of exam2] is to compute the list [1 .. n] once, so that we save its value after calculating it in order to be able to use it again.
...
If we save something by referring to it in a where clause, we have to pay the penalty of the space that it occupies.
So I go wild and think that the -O2 flag must handle this and choose the best behavior for me. I analyze the time-complexity of these two examples using Criterion.
import Criterion.Main
exam1 :: Int -> [Int]
exam1 n = [1 .. n] ++ [1 .. n]
exam2 :: Int -> [Int]
exam2 n = list ++ list
where list = [1 .. n]
m :: Int
m = 1000000
main :: IO ()
main = defaultMain [ bench "exam1" $ nf exam1 m
, bench "exam2" $ nf exam2 m
]
I compile with -O2, and find:
benchmarking exam1
time 15.11 ms (15.03 ms .. 15.16 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 15.11 ms (15.08 ms .. 15.14 ms)
std dev 83.20 μs (53.18 μs .. 122.6 μs)
benchmarking exam2
time 76.27 ms (72.84 ms .. 82.75 ms)
0.987 R² (0.963 R² .. 0.997 R²)
mean 74.79 ms (70.20 ms .. 77.70 ms)
std dev 6.204 ms (3.871 ms .. 9.233 ms)
variance introduced by outliers: 26% (moderately inflated)
What a difference! Why would that be? I thought that exam2 should be faster but memory inefficient (according to the quote above). But no, it is actually much slower (and probably more memory inefficient but that needs to be tested).
Maybe it is slower because [1 .. 1e6] has to be stored in memory, and this takes a lot of time. What do you think?
PS: I found a possibly related question, but not really.
You can inspect GHC Core using -ddump-simpl and observe the optimized code GHC produced. Core is not as readable as Haskell, but usually one can still get the idea of what is going on.
For exam2 we get plain boring code:
exam2
= \ (n_aX5 :: Int) ->
case n_aX5 of { GHC.Types.I# y_a1lJ ->
let {
list_s1nF [Dmd=<S,U>] :: [Int]
[LclId]
list_s1nF = GHC.Enum.eftInt 1# y_a1lJ } in
++ # Int list_s1nF list_s1nF
}
Roughly, this defines list_s1nF as [1..n] (eftInt = enum from to) and calls ++. No inlining happened here. GHC was afraid to inline list_s1nF since it is used twice, and inlining a definition in such case can be harmful. Indeed if let x = expensive in x+x is inlined, expensive might get recomputed twice, which is bad. Here GHC trusts the programmer, thinking that if they used a let / where they want that to be computed only once. Failing to inline list_s1nF prevents further optimization.
So this code allocates list = [1..n], and then copies that resulting in 1:2:...:n:list where the tail pointer is made to point to the original list.
Copying an arbitrary list requires to follow a pointer chain and allocating cells for the new list, which is intuitively more expensive than [1..n] which only needs to allocate the cells for the new list and keep a counter around.
Instead, exam1 is optimized much further: after some minor unboxing
exam1
= \ (w_s1os :: Int) ->
case w_s1os of { GHC.Types.I# ww1_s1ov ->
PerfList.$wexam1 ww1_s1ov
}
we get to the actual worker function.
PerfList.$wexam1
= \ (ww_s1ov :: GHC.Prim.Int#) ->
let {
n_a1lT :: [Int]
[LclId]
n_a1lT = GHC.Enum.eftInt 1# ww_s1ov } in
case GHC.Prim.># 1# ww_s1ov of {
__DEFAULT ->
letrec {
go_a1lX [Occ=LoopBreaker] :: GHC.Prim.Int# -> [Int]
[LclId, Arity=1, Str=<L,U>, Unf=OtherCon []]
go_a1lX
= \ (x_a1lY :: GHC.Prim.Int#) ->
GHC.Types.:
# Int
(GHC.Types.I# x_a1lY)
(case GHC.Prim.==# x_a1lY ww_s1ov of {
__DEFAULT -> go_a1lX (GHC.Prim.+# x_a1lY 1#);
1# -> n_a1lT
}); } in
go_a1lX 1#;
1# -> n_a1lT
}
Here, the first "enum from to" [1..n] was inlined, and that also triggered the inlining of ++. The resulting recursive function go_a1lX only relies of : and basic arithmetics. When the recursion is over, n_a1lT is returned which is the second "enum from to" [1..n]. This is not inlined, since it would trigger no more optimization.
Here, no list is generated and then copied, so we get better performance.
Note that this also produces optimized code:
exam3 :: Int -> [Int]
exam3 n = list1 ++ list2
where list1 = [1 .. n]
list2 = [1 .. n]
as well as this, since GHC won't automatically cache the result of functions, so those can be inlined.
exam4 :: Int -> [Int]
exam4 n = list () ++ list ()
where list () = [1 .. n]

Efficient summation in OCaml

Please note I am almost a complete newbie in OCaml. In order to learn a bit, and test its performance, I tried to implement a module that approximates Pi using the Leibniz series.
My first attempt led to a stack overflow (the actual error, not this site). Knowing from Haskell that this may come from too many "thunks", or promises to compute something, while recursing over the addends, I looked for some way of keeping just the last result while summing with the next. I found the following tail-recursive implementations of sum and map in the notes of an OCaml course, here and here, and expected the compiler to produce an efficient result.
However, the resulting executable, compiled with ocamlopt, is much slower than a C++ version compiled with clang++. Is this code as efficient as possible? Is there some optimization flag I am missing?
My complete code is:
let (--) i j =
let rec aux n acc =
if n < i then acc else aux (n-1) (n :: acc)
in aux j [];;
let sum_list_tr l =
let rec helper a l = match l with
| [] -> a
| h :: t -> helper (a +. h) t
in helper 0. l
let rec tailmap f l a = match l with
| [] -> a
| h :: t -> tailmap f t (f h :: a);;
let rev l =
let rec helper l a = match l with
| [] -> a
| h :: t -> helper t (h :: a)
in helper l [];;
let efficient_map f l = rev (tailmap f l []);;
let summand n =
let m = float_of_int n
in (-1.) ** m /. (2. *. m +. 1.);;
let pi_approx n =
4. *. sum_list_tr (efficient_map summand (0 -- n));;
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (pi_approx n);;
Just for reference, here are the measured times on my machine:
❯❯❯ time ocaml/main 10000000
3.14159275359
ocaml/main 10000000 3,33s user 0,30s system 99% cpu 3,625 total
❯❯❯ time cpp/main 10000000
3.14159
cpp/main 10000000 0,17s user 0,00s system 99% cpu 0,174 total
For completeness, let me state that the first helper function, an equivalent to Python's range, comes from this SO thread, and that this is run using OCaml version 4.01.0, installed via MacPorts on a Darwin 13.1.0.
As I noted in a comment, OCaml's float are boxed, which puts OCaml to a disadvantage compared to Clang.
However, I may be noticing another typical rough edge trying OCaml after Haskell:
if I see what your program is doing, you are creating a list of stuff, to then map a function on that list and finally fold it into a result.
In Haskell, you could more or less expect such a program to be automatically “deforested” at compile-time, so that the resulting generated code was an efficient implementation of the task at hand.
In OCaml, the fact that functions can have side-effects, and in particular functions passed to high-order functions such as map and fold, means that it would be much harder for the compiler to deforest automatically. The programmer has to do it by hand.
In other words: stop building huge short-lived data structures such as 0 -- n and (efficient_map summand (0 -- n)). When your program decides to tackle a new summand, make it do all it wants to do with that summand in a single pass. You can see this as an exercise in applying the principles in Wadler's article (again, by hand, because for various reasons the compiler will not do it for you despite your program being pure).
Here are some results:
$ ocamlopt v2.ml
$ time ./a.out 1000000
3.14159165359
real 0m0.020s
user 0m0.013s
sys 0m0.003s
$ ocamlopt v1.ml
$ time ./a.out 1000000
3.14159365359
real 0m0.238s
user 0m0.204s
sys 0m0.029s
v1.ml is your version. v2.ml is what you might consider an idiomatic OCaml version:
let rec q_pi_approx p n acc =
if n = p
then acc
else q_pi_approx (succ p) n (acc +. (summand p))
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (4. *. (q_pi_approx 0 n 0.));;
(reusing summand from your code)
It might be more accurate to sum from the last terms to the first, instead of from the first to the last. This is orthogonal to your question, but you may consider it as an exercise in modifying a function that has been forcefully made tail-recursive. Besides, the (-1.) ** m expression in summand is mapped by the compiler to a call to the pow() function on the host, and that's a bag of hurt you may want to avoid.
I've also tried several variants, here are my conclusions:
Using arrays
Using recursion
Using imperative loop
Recursive function is about 30% more effective than array implementation. Imperative loop is approximately as much effective as a recursion (maybe even little slower).
Here're my implementations:
Array:
open Core.Std
let pi_approx n =
let f m = (-1.) ** m /. (2. *. m +. 1.) in
let qpi = Array.init n ~f:Float.of_int |>
Array.map ~f |>
Array.reduce_exn ~f:(+.) in
qpi *. 4.0
Recursion:
let pi_approx n =
let rec loop n acc m =
if m = n
then acc *. 4.0
else
let acc = acc +. (-1.) ** m /. (2. *. m +. 1.) in
loop n acc (m +. 1.0) in
let n = float_of_int n in
loop n 0.0 0.0
This can be further optimized, by moving local function loop outside, so that compiler can inline it.
Imperative loop:
let pi_approx n =
let sum = ref 0. in
for m = 0 to n -1 do
let m = float_of_int m in
sum := !sum +. (-1.) ** m /. (2. *. m +. 1.)
done;
4.0 *. !sum
But, in the code above creating a ref to the sum will incur boxing/unboxing on each step, that we can further optimize this code by using float_ref trick:
type float_ref = { mutable value : float}
let pi_approx n =
let sum = {value = 0.} in
for m = 0 to n - 1 do
let m = float_of_int m in
sum.value <- sum.value +. (-1.) ** m /. (2. *. m +. 1.)
done;
4.0 *. sum.value
Scoreboard
for-loop (with float_ref) : 1.0
non-local recursion : 0.89
local recursion : 0.86
Pascal's version : 0.77
for-loop (with float ref) : 0.62
array : 0.47
original : 0.08
Update
I've updated the answer, as I've found a way to give 40% speedup (or 33% in comparison with #Pascal's answer.
I would like to add that although floats are boxed in OCaml, float arrays are unboxed. Here is a program that builds a float array corresponding to the Leibnitz sequence and uses it to approximate π:
open Array
let q_pi_approx n =
let summand n =
let m = float_of_int n
in (-1.) ** m /. (2. *. m +. 1.) in
let a = Array.init n summand in
Array.fold_left (+.) 0. a
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (4. *. (q_pi_approx n));;
Obviously, it is still slower than a code that doesn't build any data structure at all. Execution times (the version with array is the last one):
time ./v1 10000000
3.14159275359
real 0m2.479s
user 0m2.380s
sys 0m0.104s
time ./v2 10000000
3.14159255359
real 0m0.402s
user 0m0.400s
sys 0m0.000s
time ./a 10000000
3.14159255359
real 0m0.453s
user 0m0.432s
sys 0m0.020s

Haskell performance: Struggling with utilizing profiling results and basic tuning techniques (eliminating explicit recursion, etc.)

I took a bit of a long break from playing with Haskell, and I'm starting to get back in to it. I'm definitely still learning my way around the language. I've realized that one of the things that has always made me nervous/uncomfortable when writing Haskell is that I don't have a strong grasp on how to craft algorithms that are both idiomatic and performant. I realize that "premature optimization is the root of all evil", but similarly slow code will have to be dealt with eventually and the I just can't get rid of my preconceived notions about languages that are so high-level being super slow.
So, in that vein, I started playing with test cases. One of them that I was working on was a naïve, straight-forward implementation of the classical 4th Order Runge-Kutta method, applied to the fairly trivial IVP dy/dt = -y; y(0) = 1, which gives y = e^-t. I wrote a completely straight forward implementation in both Haskell and C (which I'll post in a moment). The Haskell version was incredibly succinct and gave me warm fuzzies on the inside when I looked at it, but the C version (which actually wasn't horrible to parse at all) was over twice as fast.
I realize that it isn't 100% fair to compare the performance of 2 different languages; and that until the day we all die C will most likely always hold the crown as the king of performance, especially hand-optimized C code. I'm not trying to get my Haskell implementation to run just as fast as my C implementation. But I'm pretty certain that if I was more cognizant of what I was doing then I could eek a bit more speed out of this particular Haskell implementation.
The Haskell version was compiled with -02 under GHC 7.6.3 on OS X 10.8.4, the C version was compiled with Clang and I gave it no flags. The Haskell version averaged around 0.016 seconds when tracked with time, and the C version around 0.006 seconds.
These timings take in to account the entire running time of the binary, including output to stdout, which obviously accounts for some of the overhead, but I did do some profiling on the GHC binary by recompiling with -prof -auto-all and running with +RTS -p and also looking at the GC stats with +RTS -s. I didn't really understand all that much of what I saw, but it seemed to be that my GC wasn't out of control though could probably get reined in a little bit (5%, Productivity at ~93% User, ~85% total elapsed) and that most of the productive time was spent in the function iterateRK, which I knew would be slow when I wrote it but it wasn't immediately obvious to me how to go about cleaning it up. I realize that I'm probably incurring a penalty in my usage of a List, both in the constant consing and the laziness in dumping the results to stdout.
What exactly am I doing wrong? What library functions or Monadic wizardry am I tragically unaware of that I could be using to clean up iterateRK? What are some good resources for learning how to be a GHC profiling rockstar?
RK.hs
rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' h t y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
where k1 = y' t y
k2 = y' (t + h/2) (y + ((h/2) * k1))
k3 = y' (t + h/2) (y + ((h/2) * k2))
k4 = y' (t + h) (y + (h * k3))
iterateRK y' h t0 y0 = y0:(iterateRK y' h t1 y1)
where t1 = t0 + h
y1 = rk4 y' h t0 y0
main = do
let y' t y = -y
let h = 1e-3
let y0 = 1.0
let t0 = 0
let results = iterateRK y' h t0 y0
(putStrLn . show) (take 1000 results)
RK.c
#include<stdio.h>
#define ITERATIONS 1000
double rk4(double f(double t, double x), double h, double tn, double yn)
{
double k1, k2, k3, k4;
k1 = f(tn, yn);
k2 = f((tn + h/2), yn + (h/2 * k1));
k3 = f((tn + h/2), yn + (h/2 * k2));
k4 = f(tn + h, yn + h * k3);
return yn + (h/6) * (k1 + 2*k2 + 2*k3 + k4);
}
double expDot(double t, double x)
{
return -x;
}
int main()
{
double t0, y0, tn, yn, h, results[ITERATIONS];
int i;
h = 1e-3;
y0 = 1.0;
t0 = 0.0;
yn = y0;
for(i = 0; i < ITERATIONS; i++)
{
results[i] = yn;
yn = rk4(expDot, h, tn, yn);
tn += h;
}
for(i = 0; i < ITERATIONS; i++)
{
printf("%.10lf", results[i]);
if(i != ITERATIONS - 1)
printf(", ");
else
printf("\n");
}
return 0;
}
Using your program with increased size gives a stack overflow:
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
This is probably caused by too much laziness.
Looking at the heap profile broken down by type, you get the following:
(Note: I modified your program as leftaroundabout pointed out)
This doesn't look good. You shouldn't require linear space for your algorithm. You seem to be holding your Double values longer than required. Makeing the strict solves the issue:
{-# LANGUAGE BangPatterns #-}
iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' !h !t0 !y0 = y0:(iterateRK y' h t1 y1)
where t1 = t0 + h
y1 = rk4 y' h t0 y0
With this modification, the new heap profile looks like this:
This looks much better, the memory usage is much lower. -sstderr` also confirms that we only spend 2.5% of the total time in the garbage collector after the modification:
%GC time 2.5% (2.9% elapsed)
Now, the haskell version is still about 40% slower than the C one (using user time):
$ time ./tesths; time ./testc
2.47e-321
./tesths 0,73s user 0,01s system 86% cpu 0,853 total
2.470328e-321
./testc 0,51s user 0,01s system 95% cpu 0,549 total
Increasing the number of iterations and using a heap-allocated array for the result storage in C lowers the difference once more:
time ./tesths; time ./testc
2.47e-321
./tesths 18,25s user 0,04s system 96% cpu 19,025 total
2.470328e-321
./testc 16,98s user 0,14s system 98% cpu 17,458 total
This is only a difference of about 9%.
But we can still do better. Using the stream-fusion package, we can eliminate the list completely while still keeping the decoupling. Here is the full code with that optimization included:
{-# LANGUAGE BangPatterns #-}
import qualified Data.List.Stream as S
rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' !h !t !y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
where k1 = y' t y
k2 = y' (t + h/2) (y + ((h/2) * k1))
k3 = y' (t + h/2) (y + ((h/2) * k2))
k4 = y' (t + h) (y + (h * k3))
iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' h = curry $ S.unfoldr $ \(!t0, !y0) -> Just (y0, (t0 + h, rk4 y' h t0 y0))
main :: IO ()
main = do
let y' t y = -y
let h = 1e-3
let y0 = 1.0
let t0 = 0
let results = iterateRK y' h t0 y0
print $ S.head $ (S.drop (pred 10000000) results)
I comiled with:
$ ghc -O2 ./test.hs -o tesths -fllvm
Here are the timings:
$ time ./tesths; time ./testc
2.47e-321
./tesths 15,85s user 0,02s system 97% cpu 16,200 total
2.470328e-321
./testc 16,97s user 0,18s system 97% cpu 17,538 total
Now we're even a bit faster than C, because we do no allocations. To do a similar transformation to the C program, we have to merge the two loops into one and loose the nice abstraction. Even then, it's only as fast as haskell:
$ time ./tesths; time ./testc
2.47e-321
./tesths 15,86s user 0,01s system 98% cpu 16,141 total
2.470328e-321
./testc 15,88s user 0,02s system 98% cpu 16,175 total
I think that in order to make a fair comparison, you should exclude program initialization as well as printing the output (or measure it separately). By default, Haskell uses Strings which are lists of Chars and this makes output quite slow. Also Haskell has a complex runtime whose initialization can bias the results a lot for such a short task. You can use criterion library for that:
import Criterion.Main
-- ...
benchmarkIRK n =
let y' t y = -y
h = 1e-3
y0 = 1.0
t0 = 0
in take n (iterateRK y' h t0 y0)
benchmarkIRKPrint = writeFile "/dev/null" . show . benchmarkIRK
main = defaultMain
[ bench "rk" $ nf benchmarkIRK 1000
, bench "rkPrint" $ nfIO (benchmarkIRKPrint 1000)
]
My measurements show that the actual computation takes something around 27 us, computing and printing takes around 350 us and running the whole program (without criterion) takes around 30 ms. So the actual computation takes just 1/1000 of the whole time and printing it just 1/100.
You should also measure your C program similarly, excluding any startup time and distinguishing what portion of time is consumed by computing and printing.
The timings of your programs have very little to do with the languages' performance, and everything with terminal IO. Remove the printing of each step (BTW, putStrLn . show ≡≡ print) from your Haskell program, and you'll get
$ time RK-hs
1.0
real 0m0.004s
user 0m0.000s
sys 0m0.000s
... which isn't really significant, though – 1000 steps is far to little. With
main :: IO ()
main = do
let y' t y = -y
h = 1e-7
y0 = 1.0
t0 = 0
results = iterateRK y' h t0 y0
print . head $ drop 10000000 results
you get
$ time RK-hs +RTS -K100M
0.36787944117145965
real 0m0.653s
user 0m0.572s
sys 0m0.076s
while the equivalent in C has
$ time RK-c
Segmentation fault (core dumped)
oh great... ...but as you see, I had to increase the stack size for the Haskell program as well. Omitting the storage of the results in a stack-allocated array, we have
$ time RK-c
0.3678794412
real 0m0.152s
user 0m0.148s
sys 0m0.000s
so this is indeed faster, significantly now, than the Haskell version.
When even C has memory problems storing a whole lot of intermediate results (if you put it on the stack), this is worse in Haskell: each list node has to be heap-allocated seperately, and while allocation is much faster in Haskell's garbage-collected heap than in C's heap, it's still slow.

What's the way to determine if an Int is a perfect square in Haskell?

I need a simple function
is_square :: Int -> Bool
which determines if an Int N a perfect square (is there an integer x such that x*x = N).
Of course I can just write something like
is_square n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n::Double)
but it looks terrible! Maybe there is a common simple way to implement such a predicate?
Think of it this way, if you have a positive int n, then you're basically doing a binary search on the range of numbers from 1 .. n to find the first number n' where n' * n' = n.
I don't know Haskell, but this F# should be easy to convert:
let is_perfect_square n =
let rec binary_search low high =
let mid = (high + low) / 2
let midSquare = mid * mid
if low > high then false
elif n = midSquare then true
else if n < midSquare then binary_search low (mid - 1)
else binary_search (mid + 1) high
binary_search 1 n
Guaranteed to be O(log n). Easy to modify perfect cubes and higher powers.
There is a wonderful library for most number theory related problems in Haskell included in the arithmoi package.
Use the Math.NumberTheory.Powers.Squares library.
Specifically the isSquare' function.
is_square :: Int -> Bool
is_square = isSquare' . fromIntegral
The library is optimized and well vetted by people much more dedicated to efficiency then you or I. While it currently doesn't have this kind of shenanigans going on under the hood, it could in the future as the library evolves and gets more optimized. View the source code to understand how it works!
Don't reinvent the wheel, always use a library when available.
I think the code you provided is the fastest that you are going to get:
is_square n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n::Double)
The complexity of this code is: one sqrt, one double multiplication, one cast (dbl->int), and one comparison. You could try to use other computation methods to replace the sqrt and the multiplication with just integer arithmetic and shifts, but chances are it is not going to be faster than one sqrt and one multiplication.
The only place where it might be worth using another method is if the CPU on which you are running does not support floating point arithmetic. In this case the compiler will probably have to generate sqrt and double multiplication in software, and you could get advantage in optimizing for your specific application.
As pointed out by other answer, there is still a limitation of big integers, but unless you are going to run into those numbers, it is probably better to take advantage of the floating point hardware support than writing your own algorithm.
In a comment on another answer to this question, you discussed memoization. Keep in mind that this technique helps when your probe patterns exhibit good density. In this case, that would mean testing the same integers over and over. How likely is your code to repeat the same work and thus benefit from caching answers?
You didn't give us an idea of the distribution of your inputs, so consider a quick benchmark that uses the excellent criterion package:
module Main
where
import Criterion.Main
import Random
is_square n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n::Double)
is_square_mem =
let check n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n :: Double)
in (map check [0..] !!)
main = do
g <- newStdGen
let rs = take 10000 $ randomRs (0,1000::Int) g
direct = map is_square
memo = map is_square_mem
defaultMain [ bench "direct" $ whnf direct rs
, bench "memo" $ whnf memo rs
]
This workload may or may not be a fair representative of what you're doing, but as written, the cache miss rate appears too high:
Wikipedia's article on Integer Square Roots has algorithms can be adapted to suit your needs. Newton's method is nice because it converges quadratically, i.e., you get twice as many correct digits each step.
I would advise you to stay away from Double if the input might be bigger than 2^53, after which not all integers can be exactly represented as Double.
Oh, today I needed to determine if a number is perfect cube, and similar solution was VERY slow.
So, I came up with a pretty clever alternative
cubes = map (\x -> x*x*x) [1..]
is_cube n = n == (head $ dropWhile (<n) cubes)
Very simple. I think, I need to use a tree for faster lookups, but now I'll try this solution, maybe it will be fast enough for my task. If not, I'll edit the answer with proper datastructure
Sometimes you shouldn't divide problems into too small parts (like checks is_square):
intersectSorted [] _ = []
intersectSorted _ [] = []
intersectSorted xs (y:ys) | head xs > y = intersectSorted xs ys
intersectSorted (x:xs) ys | head ys > x = intersectSorted xs ys
intersectSorted (x:xs) (y:ys) | x == y = x : intersectSorted xs ys
squares = [x*x | x <- [ 1..]]
weird = [2*x+1 | x <- [ 1..]]
perfectSquareWeird = intersectSorted squares weird
There's a very simple way to test for a perfect square - quite literally, you check if the square root of the number has anything other than zero in the fractional part of it.
I'm assuming a square root function that returns a floating point, in which case you can do (Psuedocode):
func IsSquare(N)
sq = sqrt(N)
return (sq modulus 1.0) equals 0.0
It's not particularly pretty or fast, but here's a cast-free, FPA-free version based on Newton's method that works (slowly) for arbitrarily large integers:
import Control.Applicative ((<*>))
import Control.Monad (join)
import Data.Ratio ((%))
isSquare = (==) =<< (^2) . floor . (join g <*> join f) . (%1)
where
f n x = (x + n / x) / 2
g n x y | abs (x - y) > 1 = g n y $ f n y
| otherwise = y
It could probably be sped up with some additional number theory trickery.

Resources