Performance Improvements in Haskell

Performance Improvements in Haskell - performance

I want to improve my haskell skills of writing really performant code (coming from a C/C++ Background this is important for my ego :D).
So I have written two functions to calculate Pi by the Leibnitz Formula (its not about calculation pi, it was just an example):
calcPiStep k = (if even k then 4.0 else -4.0) / (2.0*fromIntegral k + 1.0)
calcPiN n = foldl (\p x -> p + calcPiStep x) 0.0 [0..n]
calcPiInf = toinf [0..] 0
where
toinf = \(x:xs) l -> do
let curr = l + calcPiStep x
curr:toinf xs curr
calcPiInf constructs a infinite List by recursion. calcPiN with a foldl and a lambda with n iterations.
I have found that calcPiInf is faster than calcPiN AND does not run into a stack overflow for too large numbers.
First question: is this just because of lazy evaluation?
Secondly I wrote a corresponding C++ Program:
using namespace std;
double calcPi(int n){
double pi = 0;
for(size_t i = 0; i < n; i++){
pi += 4.0*(i%2 == 0?1:-1)/(2.0*i + 1.0);
}
return pi;
}
int main(){
std::cout.precision(10);
cout << calcPi(5000000) << endl;
}
Which is far faster than my Haskell Solution. Is it theoretically possible to rewrite my Haskell Code to achieve a similar performance as in C++?

Use foldl' (from Data.List) instead of foldl (and prefer that variant compared to a lazily generated list)
Use explicit type signatures, or you end up with Integer.
Use optimizations (-O2)
The following code takes ~3.599s on my system (GHC 8.0.2, no optimizations)
calcPiStep k = (if even k then 4.0 else -4.0) / (2.0*fromIntegral k + 1.0)
calcPiN n = foldl (\p x -> p + calcPiStep x) 0.0 [0..n]
main = print $ calcPiN 5000000
Using foldl' instead of foldl yields ~1.7s (only ~40% of the original time).
import Data.List
calcPiStep k = (if even k then 4.0 else -4.0) / (2.0*fromIntegral k + 1.0)
calcPiN n = foldl' (\p x -> p + calcPiStep x) 0.0 [0..n]
main = print $ calcPiN 5000000
Using type signatures yields ~0.8s, or another 50% reduction. If we now add optimizations, we end up with 0.066s, which is still around twice as slow as the C++ variant (0.033s on my machine with -O3, gcc), but it's almost there.
Note that we could also have used -O2 immediately to get below a single second, but any improvement before adding -O2 often (but not necessarily!) also leads to an improvement afterwards.
Here are all times depending on whether type signatures, foldl' or optimization flags were used. Note that type signatures together with -O2 already bring us to close to C++'s speed. However, that behaviour might not hold in general, and we need to change some functions depending on the lazyness:
Type annotations
foldl'
-O2
Runtime [s]
yes
no
yes
0.063
yes
yes
yes
0.063
no
yes
yes
0.180
no
no
yes
0.190
yes
yes
no
0.825
no
yes
no
1.700
yes
no
no
2.477
no
no
no
3.599

Related

Haskell State monad vs state as parameter performance test

I start to learn a State Monad and one idea bother me. Instead of passing accumulator as parameter, we can wrap everything to the state monad.
So I wanted to compare performance between using State monad vs passing it as parameter.
So I created two functions:
sum1 :: Int -> [Int] -> Int
sum1 x [] = x
sum1 x (y:xs) = sum1 (x + y) xs
and
sumState:: [Int] -> Int
sumState xs = execState (traverse f xs) 0
where f n = modify (n+)
I compared them on the input array [1..1000000000].
sumState running time was around 15s
sum1 around 5s
We can see clear winner, but the I realised that sumState can be optimised as:
We can use strict version of modify
We do not need necessary the map list output, so we can use traverse_ instead
So the new optimised state function is:
sumState:: [Int] -> Int
sumState xs = execState (traverse_ f xs) 0
where f n = modify' (n+)
which has running time around 350ms. This is a huge improvement. It was shocking.
Why the modified sumState has better performance then sum1? Can sum1 be optimised to match or even be better then sumState?
I also tried other different implementation of sum as
using built in sum function, which gives me around 240ms ((sum [1..x] ::Int))
using strict foldl', which gives me the same result around 240ms (with implicit [Int] -> Int)
Does it actually mean that it is better to use foldl function or State monad to pass accumulator instead of passing it as argument to the function?
Thank you for help.
EDIT:
Each function was in separate file with own main function and compiled with "-O2" flag.
main = do
x <- (read . head ) <$> getArgs
print $ <particular sum function> [1..x]
Runtime was measured via time command on linux.

To give a bit more explanation as to why traverse is slower: traverse f xs has has type State [()] and that [()] (list of unit tuples) is built up during the summation. This prevents further optimizations and would cause a memory leak if you were not using lazy state.
Update: I think GHC should have been able to notice that that list of unit tuples is never used, so I opened a GHC issue.
In both cases, To get the best performance we want to combine (or fuse) the summation with the enumeration [1..x] into a tight recursive loop which simply increments and adds until it reaches x. The resulting code would look something like this:
sumFromTo :: Int -> Int -> Int -> Int
sumFromTo s x y
| x == y = s + x
| otherwise = sumFromTo (s + x) (x + 1) y
This avoids allocations for the list [1..x].
The base library achieves this optimization using foldr/build fusion, also known as short cut fusion. The sum, foldl' and traverse (for lists) functions are implemented using the foldr function and [1..x] is implemented using the build function. The foldr and build function have special optimization rules so that they can be fused. Your custom sum1 function doesn't use foldr and so it can never be fused with [1..x] in this way.

Ironically, the same problem that plagued your implementation of sumState is also the problem with sum1. You don't have strict accumulation, so you build up thunks like so:
sum 0 [1, 2, 3]
sum (0 + 1) [2, 3]
sum ((0 + 1) + 2) [3]
sum (((0 + 1) + 2) + 3) []
(((0 + 1) + 2) + 3)
((1 + 2) + 3)
(3 + 3)
6
If you add strictness to sum1, you should see a dramatic improvement in efficiency because you eliminate the non-tail-recursive evaluation of the thunk (((0 + 1) + 2) + 3), which is the costly part of sum1. Using strict accumulation makes this much more efficient:
sum1 x [] = []
sum1 x (y : xs) = x `seq` sum1 (x + y) xs
should give you comparable performance to sum (although as noted in another answer, GHC may not be able to use fusion properly to give you the truly magical performance of sum on the list [1..x]).

Efficient summation in OCaml

Please note I am almost a complete newbie in OCaml. In order to learn a bit, and test its performance, I tried to implement a module that approximates Pi using the Leibniz series.
My first attempt led to a stack overflow (the actual error, not this site). Knowing from Haskell that this may come from too many "thunks", or promises to compute something, while recursing over the addends, I looked for some way of keeping just the last result while summing with the next. I found the following tail-recursive implementations of sum and map in the notes of an OCaml course, here and here, and expected the compiler to produce an efficient result.
However, the resulting executable, compiled with ocamlopt, is much slower than a C++ version compiled with clang++. Is this code as efficient as possible? Is there some optimization flag I am missing?
My complete code is:
let (--) i j =
let rec aux n acc =
if n < i then acc else aux (n-1) (n :: acc)
in aux j [];;
let sum_list_tr l =
let rec helper a l = match l with
| [] -> a
| h :: t -> helper (a +. h) t
in helper 0. l
let rec tailmap f l a = match l with
| [] -> a
| h :: t -> tailmap f t (f h :: a);;
let rev l =
let rec helper l a = match l with
| [] -> a
| h :: t -> helper t (h :: a)
in helper l [];;
let efficient_map f l = rev (tailmap f l []);;
let summand n =
let m = float_of_int n
in (-1.) ** m /. (2. *. m +. 1.);;
let pi_approx n =
4. *. sum_list_tr (efficient_map summand (0 -- n));;
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (pi_approx n);;
Just for reference, here are the measured times on my machine:
❯❯❯ time ocaml/main 10000000
3.14159275359
ocaml/main 10000000 3,33s user 0,30s system 99% cpu 3,625 total
❯❯❯ time cpp/main 10000000
3.14159
cpp/main 10000000 0,17s user 0,00s system 99% cpu 0,174 total
For completeness, let me state that the first helper function, an equivalent to Python's range, comes from this SO thread, and that this is run using OCaml version 4.01.0, installed via MacPorts on a Darwin 13.1.0.

As I noted in a comment, OCaml's float are boxed, which puts OCaml to a disadvantage compared to Clang.
However, I may be noticing another typical rough edge trying OCaml after Haskell:
if I see what your program is doing, you are creating a list of stuff, to then map a function on that list and finally fold it into a result.
In Haskell, you could more or less expect such a program to be automatically “deforested” at compile-time, so that the resulting generated code was an efficient implementation of the task at hand.
In OCaml, the fact that functions can have side-effects, and in particular functions passed to high-order functions such as map and fold, means that it would be much harder for the compiler to deforest automatically. The programmer has to do it by hand.
In other words: stop building huge short-lived data structures such as 0 -- n and (efficient_map summand (0 -- n)). When your program decides to tackle a new summand, make it do all it wants to do with that summand in a single pass. You can see this as an exercise in applying the principles in Wadler's article (again, by hand, because for various reasons the compiler will not do it for you despite your program being pure).
Here are some results:
$ ocamlopt v2.ml
$ time ./a.out 1000000
3.14159165359
real 0m0.020s
user 0m0.013s
sys 0m0.003s
$ ocamlopt v1.ml
$ time ./a.out 1000000
3.14159365359
real 0m0.238s
user 0m0.204s
sys 0m0.029s
v1.ml is your version. v2.ml is what you might consider an idiomatic OCaml version:
let rec q_pi_approx p n acc =
if n = p
then acc
else q_pi_approx (succ p) n (acc +. (summand p))
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (4. *. (q_pi_approx 0 n 0.));;
(reusing summand from your code)
It might be more accurate to sum from the last terms to the first, instead of from the first to the last. This is orthogonal to your question, but you may consider it as an exercise in modifying a function that has been forcefully made tail-recursive. Besides, the (-1.) ** m expression in summand is mapped by the compiler to a call to the pow() function on the host, and that's a bag of hurt you may want to avoid.

I've also tried several variants, here are my conclusions:
Using arrays
Using recursion
Using imperative loop
Recursive function is about 30% more effective than array implementation. Imperative loop is approximately as much effective as a recursion (maybe even little slower).
Here're my implementations:
Array:
open Core.Std
let pi_approx n =
let f m = (-1.) ** m /. (2. *. m +. 1.) in
let qpi = Array.init n ~f:Float.of_int |>
Array.map ~f |>
Array.reduce_exn ~f:(+.) in
qpi *. 4.0
Recursion:
let pi_approx n =
let rec loop n acc m =
if m = n
then acc *. 4.0
else
let acc = acc +. (-1.) ** m /. (2. *. m +. 1.) in
loop n acc (m +. 1.0) in
let n = float_of_int n in
loop n 0.0 0.0
This can be further optimized, by moving local function loop outside, so that compiler can inline it.
Imperative loop:
let pi_approx n =
let sum = ref 0. in
for m = 0 to n -1 do
let m = float_of_int m in
sum := !sum +. (-1.) ** m /. (2. *. m +. 1.)
done;
4.0 *. !sum
But, in the code above creating a ref to the sum will incur boxing/unboxing on each step, that we can further optimize this code by using float_ref trick:
type float_ref = { mutable value : float}
let pi_approx n =
let sum = {value = 0.} in
for m = 0 to n - 1 do
let m = float_of_int m in
sum.value <- sum.value +. (-1.) ** m /. (2. *. m +. 1.)
done;
4.0 *. sum.value
Scoreboard
for-loop (with float_ref) : 1.0
non-local recursion : 0.89
local recursion : 0.86
Pascal's version : 0.77
for-loop (with float ref) : 0.62
array : 0.47
original : 0.08
Update
I've updated the answer, as I've found a way to give 40% speedup (or 33% in comparison with #Pascal's answer.

I would like to add that although floats are boxed in OCaml, float arrays are unboxed. Here is a program that builds a float array corresponding to the Leibnitz sequence and uses it to approximate π:
open Array
let q_pi_approx n =
let summand n =
let m = float_of_int n
in (-1.) ** m /. (2. *. m +. 1.) in
let a = Array.init n summand in
Array.fold_left (+.) 0. a
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (4. *. (q_pi_approx n));;
Obviously, it is still slower than a code that doesn't build any data structure at all. Execution times (the version with array is the last one):
time ./v1 10000000
3.14159275359
real 0m2.479s
user 0m2.380s
sys 0m0.104s
time ./v2 10000000
3.14159255359
real 0m0.402s
user 0m0.400s
sys 0m0.000s
time ./a 10000000
3.14159255359
real 0m0.453s
user 0m0.432s
sys 0m0.020s

Haskell performance: Struggling with utilizing profiling results and basic tuning techniques (eliminating explicit recursion, etc.)

I took a bit of a long break from playing with Haskell, and I'm starting to get back in to it. I'm definitely still learning my way around the language. I've realized that one of the things that has always made me nervous/uncomfortable when writing Haskell is that I don't have a strong grasp on how to craft algorithms that are both idiomatic and performant. I realize that "premature optimization is the root of all evil", but similarly slow code will have to be dealt with eventually and the I just can't get rid of my preconceived notions about languages that are so high-level being super slow.
So, in that vein, I started playing with test cases. One of them that I was working on was a naïve, straight-forward implementation of the classical 4th Order Runge-Kutta method, applied to the fairly trivial IVP dy/dt = -y; y(0) = 1, which gives y = e^-t. I wrote a completely straight forward implementation in both Haskell and C (which I'll post in a moment). The Haskell version was incredibly succinct and gave me warm fuzzies on the inside when I looked at it, but the C version (which actually wasn't horrible to parse at all) was over twice as fast.
I realize that it isn't 100% fair to compare the performance of 2 different languages; and that until the day we all die C will most likely always hold the crown as the king of performance, especially hand-optimized C code. I'm not trying to get my Haskell implementation to run just as fast as my C implementation. But I'm pretty certain that if I was more cognizant of what I was doing then I could eek a bit more speed out of this particular Haskell implementation.
The Haskell version was compiled with -02 under GHC 7.6.3 on OS X 10.8.4, the C version was compiled with Clang and I gave it no flags. The Haskell version averaged around 0.016 seconds when tracked with time, and the C version around 0.006 seconds.
These timings take in to account the entire running time of the binary, including output to stdout, which obviously accounts for some of the overhead, but I did do some profiling on the GHC binary by recompiling with -prof -auto-all and running with +RTS -p and also looking at the GC stats with +RTS -s. I didn't really understand all that much of what I saw, but it seemed to be that my GC wasn't out of control though could probably get reined in a little bit (5%, Productivity at ~93% User, ~85% total elapsed) and that most of the productive time was spent in the function iterateRK, which I knew would be slow when I wrote it but it wasn't immediately obvious to me how to go about cleaning it up. I realize that I'm probably incurring a penalty in my usage of a List, both in the constant consing and the laziness in dumping the results to stdout.
What exactly am I doing wrong? What library functions or Monadic wizardry am I tragically unaware of that I could be using to clean up iterateRK? What are some good resources for learning how to be a GHC profiling rockstar?
RK.hs
rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' h t y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
where k1 = y' t y
k2 = y' (t + h/2) (y + ((h/2) * k1))
k3 = y' (t + h/2) (y + ((h/2) * k2))
k4 = y' (t + h) (y + (h * k3))
iterateRK y' h t0 y0 = y0:(iterateRK y' h t1 y1)
where t1 = t0 + h
y1 = rk4 y' h t0 y0
main = do
let y' t y = -y
let h = 1e-3
let y0 = 1.0
let t0 = 0
let results = iterateRK y' h t0 y0
(putStrLn . show) (take 1000 results)
RK.c
#include<stdio.h>
#define ITERATIONS 1000
double rk4(double f(double t, double x), double h, double tn, double yn)
{
double k1, k2, k3, k4;
k1 = f(tn, yn);
k2 = f((tn + h/2), yn + (h/2 * k1));
k3 = f((tn + h/2), yn + (h/2 * k2));
k4 = f(tn + h, yn + h * k3);
return yn + (h/6) * (k1 + 2*k2 + 2*k3 + k4);
}
double expDot(double t, double x)
{
return -x;
}
int main()
{
double t0, y0, tn, yn, h, results[ITERATIONS];
int i;
h = 1e-3;
y0 = 1.0;
t0 = 0.0;
yn = y0;
for(i = 0; i < ITERATIONS; i++)
{
results[i] = yn;
yn = rk4(expDot, h, tn, yn);
tn += h;
}
for(i = 0; i < ITERATIONS; i++)
{
printf("%.10lf", results[i]);
if(i != ITERATIONS - 1)
printf(", ");
else
printf("\n");
}
return 0;
}

Using your program with increased size gives a stack overflow:
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
This is probably caused by too much laziness.
Looking at the heap profile broken down by type, you get the following:
(Note: I modified your program as leftaroundabout pointed out)
This doesn't look good. You shouldn't require linear space for your algorithm. You seem to be holding your Double values longer than required. Makeing the strict solves the issue:
{-# LANGUAGE BangPatterns #-}
iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' !h !t0 !y0 = y0:(iterateRK y' h t1 y1)
where t1 = t0 + h
y1 = rk4 y' h t0 y0
With this modification, the new heap profile looks like this:
This looks much better, the memory usage is much lower. -sstderr` also confirms that we only spend 2.5% of the total time in the garbage collector after the modification:
%GC time 2.5% (2.9% elapsed)
Now, the haskell version is still about 40% slower than the C one (using user time):
$ time ./tesths; time ./testc
2.47e-321
./tesths 0,73s user 0,01s system 86% cpu 0,853 total
2.470328e-321
./testc 0,51s user 0,01s system 95% cpu 0,549 total
Increasing the number of iterations and using a heap-allocated array for the result storage in C lowers the difference once more:
time ./tesths; time ./testc
2.47e-321
./tesths 18,25s user 0,04s system 96% cpu 19,025 total
2.470328e-321
./testc 16,98s user 0,14s system 98% cpu 17,458 total
This is only a difference of about 9%.
But we can still do better. Using the stream-fusion package, we can eliminate the list completely while still keeping the decoupling. Here is the full code with that optimization included:
{-# LANGUAGE BangPatterns #-}
import qualified Data.List.Stream as S
rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' !h !t !y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
where k1 = y' t y
k2 = y' (t + h/2) (y + ((h/2) * k1))
k3 = y' (t + h/2) (y + ((h/2) * k2))
k4 = y' (t + h) (y + (h * k3))
iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' h = curry $ S.unfoldr $ \(!t0, !y0) -> Just (y0, (t0 + h, rk4 y' h t0 y0))
main :: IO ()
main = do
let y' t y = -y
let h = 1e-3
let y0 = 1.0
let t0 = 0
let results = iterateRK y' h t0 y0
print $ S.head $ (S.drop (pred 10000000) results)
I comiled with:
$ ghc -O2 ./test.hs -o tesths -fllvm
Here are the timings:
$ time ./tesths; time ./testc
2.47e-321
./tesths 15,85s user 0,02s system 97% cpu 16,200 total
2.470328e-321
./testc 16,97s user 0,18s system 97% cpu 17,538 total
Now we're even a bit faster than C, because we do no allocations. To do a similar transformation to the C program, we have to merge the two loops into one and loose the nice abstraction. Even then, it's only as fast as haskell:
$ time ./tesths; time ./testc
2.47e-321
./tesths 15,86s user 0,01s system 98% cpu 16,141 total
2.470328e-321
./testc 15,88s user 0,02s system 98% cpu 16,175 total

I think that in order to make a fair comparison, you should exclude program initialization as well as printing the output (or measure it separately). By default, Haskell uses Strings which are lists of Chars and this makes output quite slow. Also Haskell has a complex runtime whose initialization can bias the results a lot for such a short task. You can use criterion library for that:
import Criterion.Main
-- ...
benchmarkIRK n =
let y' t y = -y
h = 1e-3
y0 = 1.0
t0 = 0
in take n (iterateRK y' h t0 y0)
benchmarkIRKPrint = writeFile "/dev/null" . show . benchmarkIRK
main = defaultMain
[ bench "rk" $ nf benchmarkIRK 1000
, bench "rkPrint" $ nfIO (benchmarkIRKPrint 1000)
]
My measurements show that the actual computation takes something around 27 us, computing and printing takes around 350 us and running the whole program (without criterion) takes around 30 ms. So the actual computation takes just 1/1000 of the whole time and printing it just 1/100.
You should also measure your C program similarly, excluding any startup time and distinguishing what portion of time is consumed by computing and printing.

The timings of your programs have very little to do with the languages' performance, and everything with terminal IO. Remove the printing of each step (BTW, putStrLn . show ≡≡ print) from your Haskell program, and you'll get
$ time RK-hs
1.0
real 0m0.004s
user 0m0.000s
sys 0m0.000s
... which isn't really significant, though – 1000 steps is far to little. With
main :: IO ()
main = do
let y' t y = -y
h = 1e-7
y0 = 1.0
t0 = 0
results = iterateRK y' h t0 y0
print . head $ drop 10000000 results
you get
$ time RK-hs +RTS -K100M
0.36787944117145965
real 0m0.653s
user 0m0.572s
sys 0m0.076s
while the equivalent in C has
$ time RK-c
Segmentation fault (core dumped)
oh great... ...but as you see, I had to increase the stack size for the Haskell program as well. Omitting the storage of the results in a stack-allocated array, we have
$ time RK-c
0.3678794412
real 0m0.152s
user 0m0.148s
sys 0m0.000s
so this is indeed faster, significantly now, than the Haskell version.
When even C has memory problems storing a whole lot of intermediate results (if you put it on the stack), this is worse in Haskell: each list node has to be heap-allocated seperately, and while allocation is much faster in Haskell's garbage-collected heap than in C's heap, it's still slow.

How does one write efficient Dynamic Programming algorithms in Haskell?

I've been playing around with dynamic programming in Haskell. Practically every tutorial I've seen on the subject gives the same, very elegant algorithm based on memoization and the laziness of the Array type. Inspired by those examples, I wrote the following algorithm as a test:
-- pascal n returns the nth entry on the main diagonal of pascal's triangle
-- (mod a million for efficiency)
pascal :: Int -> Int
pascal n = p ! (n,n) where
p = listArray ((0,0),(n,n)) [f (i,j) | i <- [0 .. n], j <- [0 .. n]]
f :: (Int,Int) -> Int
f (_,0) = 1
f (0,_) = 1
f (i,j) = (p ! (i, j-1) + p ! (i-1, j)) `mod` 1000000
My only problem is efficiency. Even using GHC's -O2, this program takes 1.6 seconds to compute pascal 1000, which is about 160 times slower than an equivalent unoptimized C++ program. And the gap only widens with larger inputs.
It seems like I've tried every possible permutation of the above code, along with suggested alternatives like the data-memocombinators library, and they all had the same or worse performance. The one thing I haven't tried is the ST Monad, which I'm sure could be made to run the program only slighter slower than the C version. But I'd really like to write it in idiomatic Haskell, and I don't understand why the idiomatic version is so inefficient. I have two questions:
Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
Thanks a lot.
Edit: The array module used is the standard Data.Array

Well, the algorithm could be designed a little better. Using the vector package and being smart about only keeping one row in memory at a time, we can get something that's idiomatic in a different way:
{-# LANGUAGE BangPatterns #-}
import Data.Vector.Unboxed
import Prelude hiding (replicate, tail, scanl)
pascal :: Int -> Int
pascal !n = go 1 ((replicate (n+1) 1) :: Vector Int) where
go !i !prevRow
| i <= n = go (i+1) (scanl f 1 (tail prevRow))
| otherwise = prevRow ! n
f x y = (x + y) `rem` 1000000
This optimizes down very tightly, especially because the vector package includes some rather ingenious tricks to transparently optimize array operations written in an idiomatic style.

1 Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
The problem is that the code writes thunks to the array. Then when entry (n,n) is read, the evaluation of the thunks jumps all over the array again, recurring until finally a value not needing further recursion is found. That causes a lot of unnecessary allocation and inefficiency.
The C++ code doesn't have that problem, the values are written, and read directly without requiring further evaluation. As it would happen with an STUArray. Does
p = runSTUArray $ do
arr <- newArray ((0,0),(n,n)) 1
forM_ [1 .. n] $ \i ->
forM_ [1 .. n] $ \j -> do
a <- readArray arr (i,j-1)
b <- readArray arr (i-1,j)
writeArray arr (i,j) $! (a+b) `rem` 1000000
return arr
really look so bad?
2 Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
I don't know of one. But there might be.
Addendum:
Once one uses STUArrays or unboxed Vectors, there's still a significant difference to the equivalent C implementation. The reason is that gcc replaces the % by a combination of multiplications, shifts and subtractions (even without optimisations), since the modulus is known. Doing the same by hand in Haskell (since GHC doesn't [yet] do that),
-- fast modulo 1000000
-- for nonnegative Ints < 2^31
-- requires 64-bit Ints
fastMod :: Int -> Int
fastMod n = n - 1000000*((n*1125899907) `shiftR` 50)
gets the Haskell versions on par with C.

The trick is to think about how to write the whole damn algorithm at once, and then use unboxed vectors as your backing data type. For example, the following runs about 20 times faster on my machine than your code:
import qualified Data.Vector.Unboxed as V
combine :: Int -> Int -> Int
combine x y = (x+y) `mod` 1000000
pascal n = V.last $ go n where
go 0 = V.replicate (n+1) 1
go m = V.scanl1 combine (go (m-1))
I then wrote two main functions that called out to yours and mine with an argument of 4000; these ran in 10.42s and 0.54s respectively. Of course, as I'm sure you know, they both get blown out of the water (0.00s) by the version that uses a better algorithm:
pascal' :: Integer -> Integer
pascal :: Int -> Int
pascal' n = product [n+1..n*2] `div` product [2..n]
pascal = fromIntegral . (`mod` 1000000) . pascal' . fromIntegral

What's the way to determine if an Int is a perfect square in Haskell?

I need a simple function
is_square :: Int -> Bool
which determines if an Int N a perfect square (is there an integer x such that x*x = N).
Of course I can just write something like
is_square n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n::Double)
but it looks terrible! Maybe there is a common simple way to implement such a predicate?

Think of it this way, if you have a positive int n, then you're basically doing a binary search on the range of numbers from 1 .. n to find the first number n' where n' * n' = n.
I don't know Haskell, but this F# should be easy to convert:
let is_perfect_square n =
let rec binary_search low high =
let mid = (high + low) / 2
let midSquare = mid * mid
if low > high then false
elif n = midSquare then true
else if n < midSquare then binary_search low (mid - 1)
else binary_search (mid + 1) high
binary_search 1 n
Guaranteed to be O(log n). Easy to modify perfect cubes and higher powers.

There is a wonderful library for most number theory related problems in Haskell included in the arithmoi package.
Use the Math.NumberTheory.Powers.Squares library.
Specifically the isSquare' function.
is_square :: Int -> Bool
is_square = isSquare' . fromIntegral
The library is optimized and well vetted by people much more dedicated to efficiency then you or I. While it currently doesn't have this kind of shenanigans going on under the hood, it could in the future as the library evolves and gets more optimized. View the source code to understand how it works!
Don't reinvent the wheel, always use a library when available.

I think the code you provided is the fastest that you are going to get:
is_square n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n::Double)
The complexity of this code is: one sqrt, one double multiplication, one cast (dbl->int), and one comparison. You could try to use other computation methods to replace the sqrt and the multiplication with just integer arithmetic and shifts, but chances are it is not going to be faster than one sqrt and one multiplication.
The only place where it might be worth using another method is if the CPU on which you are running does not support floating point arithmetic. In this case the compiler will probably have to generate sqrt and double multiplication in software, and you could get advantage in optimizing for your specific application.
As pointed out by other answer, there is still a limitation of big integers, but unless you are going to run into those numbers, it is probably better to take advantage of the floating point hardware support than writing your own algorithm.

In a comment on another answer to this question, you discussed memoization. Keep in mind that this technique helps when your probe patterns exhibit good density. In this case, that would mean testing the same integers over and over. How likely is your code to repeat the same work and thus benefit from caching answers?
You didn't give us an idea of the distribution of your inputs, so consider a quick benchmark that uses the excellent criterion package:
module Main
where
import Criterion.Main
import Random
is_square n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n::Double)
is_square_mem =
let check n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n :: Double)
in (map check [0..] !!)
main = do
g <- newStdGen
let rs = take 10000 $ randomRs (0,1000::Int) g
direct = map is_square
memo = map is_square_mem
defaultMain [ bench "direct" $ whnf direct rs
, bench "memo" $ whnf memo rs
]
This workload may or may not be a fair representative of what you're doing, but as written, the cache miss rate appears too high:

Wikipedia's article on Integer Square Roots has algorithms can be adapted to suit your needs. Newton's method is nice because it converges quadratically, i.e., you get twice as many correct digits each step.
I would advise you to stay away from Double if the input might be bigger than 2^53, after which not all integers can be exactly represented as Double.

Oh, today I needed to determine if a number is perfect cube, and similar solution was VERY slow.
So, I came up with a pretty clever alternative
cubes = map (\x -> x*x*x) [1..]
is_cube n = n == (head $ dropWhile (<n) cubes)
Very simple. I think, I need to use a tree for faster lookups, but now I'll try this solution, maybe it will be fast enough for my task. If not, I'll edit the answer with proper datastructure

Sometimes you shouldn't divide problems into too small parts (like checks is_square):
intersectSorted [] _ = []
intersectSorted _ [] = []
intersectSorted xs (y:ys) | head xs > y = intersectSorted xs ys
intersectSorted (x:xs) ys | head ys > x = intersectSorted xs ys
intersectSorted (x:xs) (y:ys) | x == y = x : intersectSorted xs ys
squares = [x*x | x <- [ 1..]]
weird = [2*x+1 | x <- [ 1..]]
perfectSquareWeird = intersectSorted squares weird

There's a very simple way to test for a perfect square - quite literally, you check if the square root of the number has anything other than zero in the fractional part of it.
I'm assuming a square root function that returns a floating point, in which case you can do (Psuedocode):
func IsSquare(N)
sq = sqrt(N)
return (sq modulus 1.0) equals 0.0

It's not particularly pretty or fast, but here's a cast-free, FPA-free version based on Newton's method that works (slowly) for arbitrarily large integers:
import Control.Applicative ((<*>))
import Control.Monad (join)
import Data.Ratio ((%))
isSquare = (==) =<< (^2) . floor . (join g <*> join f) . (%1)
where
f n x = (x + n / x) / 2
g n x y | abs (x - y) > 1 = g n y $ f n y
| otherwise = y
It could probably be sped up with some additional number theory trickery.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Performance Improvements in Haskell - performance

Related

Haskell State monad vs state as parameter performance test

Efficient summation in OCaml

Haskell performance: Struggling with utilizing profiling results and basic tuning techniques (eliminating explicit recursion, etc.)

How does one write efficient Dynamic Programming algorithms in Haskell?

What's the way to determine if an Int is a perfect square in Haskell?

Categories

Resources