How do I add parallel computation to this example? - algorithm

I have an algorithm for synchronous calculation of a certain integral on a given segment. I want to use the Control.Parallel library, or rather par :: a -> b -> b to add parallel computing to this algorithm.
How can I do this?
integrate :: (Double -> Double) -> Double -> Double -> Double
integrate f a b =
let
step = (b - a) / 1000
segments = [a + x * step | x <- [0..999]]
area x = step * (f x + f (x + step)) / 2
in sum $ map area segments

From the looks of it you are trying to approximate an integral of a function f on the region from b to a using trapezoidal rule. You are right in trying to parallelize the code, but there are a couple of issues with the attempt:
First of all, you need a work stealing scheduler in order to get any benefit, since par is unlikely to give you a speedup
Secondly, the way it is implemented each intermediate point f(x) is computed twice, except for the border points f(a) and f(b)
Few moths ago I needed this functionality, so I added it the the massiv library: trapezoidRule, which conveninetly solves both of the above problems and avoids usage of lists.
Here is an out of the box solution, but it will not automatically parallelize the computation, since there is only one element of the array is being computed (it was designed to estimate integrals over many regions)
integrate' :: (Double -> Double) -> Double -> Double -> Double
integrate' f a b = trapezoidRule Seq P (\scale x -> f (scale x)) a d (Sz1 1) n ! 0
where
n = 1000
d = b - a
As a sanity check:
λ> integrate (\x -> x * x) 10 20 -- implementation from the question
2333.3335
λ> integrate' (\x -> x * x) 10 20
2333.3335
Here is a solution that will do the automatic parallelization and will avoid redundant evaluation:
integrateA :: Int -> (Double -> Double) -> Double -> Double -> Double
integrateA n f a b =
let step = (b - a) / fromIntegral n
sz = size segments - 1
segments = computeAs P $ A.map f (enumFromStepN Par a step (Sz (n + 1)))
area y0 y1 = step * (y0 + y1) / 2
areas = A.zipWith area (extract' 0 sz segments) (extract' 1 sz segments)
in A.sum areas
Because of list fusion, in case of your solution using lists, there will be no allocation, as such, for simple cases it will be very fast. In the above solution there is gonna be an array of size n+1 allocated in order to promote sharing and avoid double function evaluation. There will also be extra cost encountered due to scheduling, since forking off threads does not come for free. But in the end for really expensive functions and very large n it is possible to get factor of ~x3 speed up on a quad core processor.
Below are some benchmarks of integrating gaussian function with n = 100000:
benchmarking Gaussian1D/list
time 3.657 ms (3.623 ms .. 3.687 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 3.627 ms (3.604 ms .. 3.658 ms)
std dev 80.50 μs (63.62 μs .. 115.4 μs)
benchmarking Gaussian1D/array Seq
time 3.408 ms (3.304 ms .. 3.523 ms)
0.987 R² (0.979 R² .. 0.994 R²)
mean 3.670 ms (3.578 ms .. 3.839 ms)
std dev 408.0 μs (293.8 μs .. 627.6 μs)
variance introduced by outliers: 69% (severely inflated)
benchmarking Gaussian1D/array Par
time 1.340 ms (1.286 ms .. 1.393 ms)
0.980 R² (0.967 R² .. 0.989 R²)
mean 1.393 ms (1.328 ms .. 1.485 ms)
std dev 263.3 μs (160.1 μs .. 385.6 μs)
variance introduced by outliers: 90% (severely inflated)
Side note suggestion. Switching to Simpson's rule will give you a better approximation. Implementation is available in massiv ;)
Edit
This is such a fun problem, that I decided to see what would it take to implement it without any array allocations. Here is what I came up with:
integrateS :: Int -> (Double -> Double) -> Double -> Double -> Double
integrateS n f a b =
let step = (b - a) / fromIntegral n
segments = A.map f (enumFromStepN Seq (a + step) step (Sz n))
area y0 y1 = step * (y0 + y1) / 2
sumWith (acc, y0) y1 =
let acc' = acc + area y0 y1
in acc' `seq` (acc', y1)
in fst $ A.foldlS sumWith (0, f a) segments
Above approach runs in constant memory, since the few arrays that do get created aren't real arrays backed by memory, but instead are delayed arrays. With a bit of trickery around fold accumulator we can share the results, thus avoiding double function evaluation. This results in astonishing speed up:
benchmarking Gaussian1D/array Seq no-alloc
time 1.788 ms (1.777 ms .. 1.799 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 1.787 ms (1.781 ms .. 1.795 ms)
std dev 23.85 μs (17.19 μs .. 31.96 μs)
The downside to the above approach is that it is not easily parallelizable, but not impossible. Embrace yourself, here is a monstrosity that can run on 8 capabilities (hardcoded and in my case 4 cores with hyperthreading):
-- | Will not produce correct results if `n` is not divisible by 8
integrateN8 :: Int -> (Double -> Double) -> Double -> Double -> Double
integrateN8 n f a b =
let k = 8
n' = n `div` k
step = (b - a) / fromIntegral n
segments =
makeArrayR D (ParN (fromIntegral k)) (Sz1 k) $ \i ->
let start = a + step * fromIntegral n' * fromIntegral i + step
in (f start, A.map f (enumFromStepN Seq (start + step) step (Sz (n' - 1))))
area y0 y1 = step * (y0 + y1) / 2
sumWith (acc, y0) y1 =
let acc' = acc + area y0 y1
in acc' `seq` (acc', y1)
partialResults =
computeAs U $ A.map (\(y0, arr) -> (y0, A.foldlS sumWith (0, y0) arr)) segments
combine (acc, y0) (y1, (acci, yn)) =
let acc' = acc + acci + area y0 y1
in acc' `seq` (acc', yn)
in fst $ foldlS combine (0, f a) partialResults
The only real array allocated is for keeping partialResults which has a total of 16 Double elements. Speed improvement is not as drastic, but nevertheless it is there:
benchmarking Gaussian1D/array Par no-alloc
time 960.1 μs (914.3 μs .. 1.020 ms)
0.968 R² (0.944 R² .. 0.990 R²)
mean 931.8 μs (900.8 μs .. 976.3 μs)
std dev 129.2 μs (84.20 μs .. 198.8 μs)
variance introduced by outliers: 84% (severely inflated)

my default go-to for any map composition would be by using parmap from Strategies API http://hackage.haskell.org/package/parallel-3.2.2.0/docs/Control-Parallel-Strategies.html#g:7 , I'll add an example once I'm around my PC.
Edit:
You'd use parMap in the following way,
module Main where
import Control.Parallel.Strategies
main = putStrLn $ show $ integrate f 1.1 1.2
f :: Double -> Double
f x = x
integrate :: (Double -> Double) -> Double -> Double -> Double
integrate f a b =
let
step = (b - a) / 1000
segments = [a + x * step | x <- [0..999]]
area x = step * (f x + f (x + step)) / 2
in sum $ parMap rpar area segments
Then compile with:
ghc -O2 -threaded -rtsopts Main.hs and run using the RTS + N flag to control the parallelism ./Main +RTS -N -RTS -N can be specified e.g. -N6 to run on 6 threads or can be left empty to use all possible threads.

Related

Unexplainably incredible performance with Pythagorean triples in Haskell

Let's say we have this simple Haskell function that produces Pythagorean triples:
pytha :: [(Int, Int, Int)]
pytha = [(x, y, z)
| z <- [0..]
, x <- [1..z]
, y <- [x..z]
, x * x + y * y == z * z
]
and we'd like to benchmark how long does it take to produce, say, first 100 triples. So (using the criterion library and assuming import Criterion.Main) we have this benchmark:
main :: IO ()
main = do
countStr <- readFile "count.txt"
defaultMain [ bgroup "pytha" [ bench countStr $ nf (`take` pytha) (read countStr) ] ]
where we even read the count from a file to make sure ghc does not try to evaluate pytha during compile time!
Doing echo 100 > count.txt, compiling the benchmark with -O2 and running on my machine (a 4.0 GHz Sandy Bridge CPU) shows some interesting numbers:
time 967.4 ns (957.6 ns .. 979.3 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 979.6 ns (967.9 ns .. 995.6 ns)
std dev 45.34 ns (33.96 ns .. 60.29 ns)
Slightly modifying this program to show how many triples were considered overall (by producing all the triples first, zipping the list with [0..] and then filtering out all non-Pythagorean triples and looking at the indices of the resulting ones) shows that almost 900000 triples were considered.
All this naturally raises the question: how does the code above manage to achieve 1000 triples/ns on a single core of a pretty standard CPU? Or is it just that my benchmark is wrong?
You need to use a function rather than a value that will be memoized.
pytha :: Int -> [(Int, Int, Int)]
pytha z_max =
[ (x, y, z)
| z <- [0..z_max]
, x <- [1..z]
, y <- [x..z]
, x * x + y * y == z * z
]
GHC isn't going to get clever enough to factor this into takeWhile from a constant list, so it should give a meaningful benchmark. Just make sure Criterion is in charge of passing z_max, which you can reasonably set to maxBound :: Int or some such.
By the way: you can make your implementation much less slow by using floating point operations to calculate much tighter bounds for y.

Most efficient algorithm to find integer points within an ellipse

I'm trying to find all the integer lattice points within various 3D ellipses.
I would like my program to take an integer N, and count all the lattice points within the ellipses of the form ax^2 + by^2 + cz^2 = n, where a,b,c are fixed integers and n is between 1 and N. This program should then return N tuples of the form (n, numlatticePointsWithinEllipse n).
I'm currently doing it by counting the points on the ellipses ax^2 + by^2 + cz^2 = m, for m between 0 and n inclusive, and then summing over m. I'm also only looking at x, y and z all positive initially, and then adding in the negatives by permuting their signs later.
Ideally, I'd like to reach numbers of N = 1,000,000+ within the scale of hours
Taking a specific example of x^2 + y^2 + 3z^2 = N, here's the Haskell code I'm currently using:
import System.Environment
isqrt :: Int -> Int
isqrt 0 = 0
isqrt 1 = 1
isqrt n = head $ dropWhile (\x -> x*x > n) $ iterate (\x -> (x + n `div` x) `div` 2) (n `div` 2)
latticePointsWithoutNegatives :: Int -> [[Int]]
latticePointsWithoutNegatives 0 = [[0,0,0]]
latticePointsWithoutNegatives n = [[x,y,z] | x<-[0.. isqrt n], y<- [0.. isqrt (n - x^2)], z<-[max 0 (isqrt ((n-x^2 -y^2) `div` 3))], x^2 +y^2 + z^2 ==n]
latticePoints :: Int -> [[Int]]
latticePoints n = [ zipWith (*) [x1,x2,x3] y | [x1,x2,x3] <- (latticePointsWithoutNegatives n), y <- [[a,b,c] | a <- (if x1 == 0 then [0] else [-1,1]), b<-(if x2 == 0 then [0] else [-1,1]), c<-(if x3 == 0 then [0] else [-1,1])]]
latticePointsUpTo :: Int -> Int
latticePointsUpTo n = sum [length (latticePoints x) | x<-[0..n]]
listResults :: Int -> [(Int, Int)]
listResults n = [(x, latticePointsUpTo x) | x<- [1..n]]
main = do
args <- getArgs
let cleanArgs = read (head args)
print (listResults cleanArgs)
I've compiled this with
ghc -O2 latticePointsTest
but using the PowerShell "Measure-Command" command, I get the following results:
Measure-Command{./latticePointsTest 10}
TotalMilliseconds : 12.0901
Measure-Command{./latticePointsTest 100}
TotalMilliseconds : 12.0901
Measure-Command{./latticePointsTest 1000}
TotalMilliseconds : 31120.4503
and going any more orders of magnitude up takes us onto the scale of days, rather than hours or minutes.
Is there anything fundamentally wrong with the algorithm I'm using? Is there any core reason why my code isn't scaling well? Any guidance will be greatly appreciated. I may also want to process the data between "latticePoints" and "latticePointsUpTo", so I can't just rely entirely on clever number theoretic counting techniques - I need the underlying tuples preserved.
Some things I would try:
isqrt is not efficient for the range of values you are working work. Simply use the floating point sqrt function:
isqrt = floor $ sqrt ((fromIntegral n) :: Double)
Alternatively, instead of computing integer square roots, use logic like this in your list comprehensions:
x <- takeWhile (\x -> x*x <= n) [0..],
y <- takeWhile (\y -> y*y <= n - x*x) [0..]
Also, I would use expressions like x*x instead of x^2.
Finally, why not compute the number of solutions with something like this:
sols a b c n =
length [ () | x <- takeWhile (\x -> a*x*x <= n) [0..]
, y <- takeWhile (\y -> a*x*x+b*y*y <= n) [0..]
, z <- takeWhile (\z -> a*x*x+b*y*y+c*z*z <= n) [0..]
]
This does not exactly compute the same answer that you want because it doesn't account for positive and negative solutions, but you could easily modify it to compute your answer. The idea is to use one list comprehension instead of iterating over various values of n and summing.
Finally, I think using floor and sqrt to compute the integral square root is completely safe in this case. This code verifies that the integer square root by sing sqrt of (x*x) == x for all x <= 3037000499:
testAll :: Int -> IO ()
testAll n =
print $ head [ (x,a) | x <- [n,n-1 .. 1], let a = floor $ sqrt (fromIntegral (x*x) :: Double), a /= x ]
main = testAll 3037000499
Note I am running this on a 64-bit GHC - otherwise just use Int64 instead of Int since Doubles are 64-bit in either case. Takes only a minute or so to verify.
This shows that taking the floor of sqrt y will never result in the wrong answer if y <= 3037000499^2.

Haskell performance: Struggling with utilizing profiling results and basic tuning techniques (eliminating explicit recursion, etc.)

I took a bit of a long break from playing with Haskell, and I'm starting to get back in to it. I'm definitely still learning my way around the language. I've realized that one of the things that has always made me nervous/uncomfortable when writing Haskell is that I don't have a strong grasp on how to craft algorithms that are both idiomatic and performant. I realize that "premature optimization is the root of all evil", but similarly slow code will have to be dealt with eventually and the I just can't get rid of my preconceived notions about languages that are so high-level being super slow.
So, in that vein, I started playing with test cases. One of them that I was working on was a naïve, straight-forward implementation of the classical 4th Order Runge-Kutta method, applied to the fairly trivial IVP dy/dt = -y; y(0) = 1, which gives y = e^-t. I wrote a completely straight forward implementation in both Haskell and C (which I'll post in a moment). The Haskell version was incredibly succinct and gave me warm fuzzies on the inside when I looked at it, but the C version (which actually wasn't horrible to parse at all) was over twice as fast.
I realize that it isn't 100% fair to compare the performance of 2 different languages; and that until the day we all die C will most likely always hold the crown as the king of performance, especially hand-optimized C code. I'm not trying to get my Haskell implementation to run just as fast as my C implementation. But I'm pretty certain that if I was more cognizant of what I was doing then I could eek a bit more speed out of this particular Haskell implementation.
The Haskell version was compiled with -02 under GHC 7.6.3 on OS X 10.8.4, the C version was compiled with Clang and I gave it no flags. The Haskell version averaged around 0.016 seconds when tracked with time, and the C version around 0.006 seconds.
These timings take in to account the entire running time of the binary, including output to stdout, which obviously accounts for some of the overhead, but I did do some profiling on the GHC binary by recompiling with -prof -auto-all and running with +RTS -p and also looking at the GC stats with +RTS -s. I didn't really understand all that much of what I saw, but it seemed to be that my GC wasn't out of control though could probably get reined in a little bit (5%, Productivity at ~93% User, ~85% total elapsed) and that most of the productive time was spent in the function iterateRK, which I knew would be slow when I wrote it but it wasn't immediately obvious to me how to go about cleaning it up. I realize that I'm probably incurring a penalty in my usage of a List, both in the constant consing and the laziness in dumping the results to stdout.
What exactly am I doing wrong? What library functions or Monadic wizardry am I tragically unaware of that I could be using to clean up iterateRK? What are some good resources for learning how to be a GHC profiling rockstar?
RK.hs
rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' h t y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
where k1 = y' t y
k2 = y' (t + h/2) (y + ((h/2) * k1))
k3 = y' (t + h/2) (y + ((h/2) * k2))
k4 = y' (t + h) (y + (h * k3))
iterateRK y' h t0 y0 = y0:(iterateRK y' h t1 y1)
where t1 = t0 + h
y1 = rk4 y' h t0 y0
main = do
let y' t y = -y
let h = 1e-3
let y0 = 1.0
let t0 = 0
let results = iterateRK y' h t0 y0
(putStrLn . show) (take 1000 results)
RK.c
#include<stdio.h>
#define ITERATIONS 1000
double rk4(double f(double t, double x), double h, double tn, double yn)
{
double k1, k2, k3, k4;
k1 = f(tn, yn);
k2 = f((tn + h/2), yn + (h/2 * k1));
k3 = f((tn + h/2), yn + (h/2 * k2));
k4 = f(tn + h, yn + h * k3);
return yn + (h/6) * (k1 + 2*k2 + 2*k3 + k4);
}
double expDot(double t, double x)
{
return -x;
}
int main()
{
double t0, y0, tn, yn, h, results[ITERATIONS];
int i;
h = 1e-3;
y0 = 1.0;
t0 = 0.0;
yn = y0;
for(i = 0; i < ITERATIONS; i++)
{
results[i] = yn;
yn = rk4(expDot, h, tn, yn);
tn += h;
}
for(i = 0; i < ITERATIONS; i++)
{
printf("%.10lf", results[i]);
if(i != ITERATIONS - 1)
printf(", ");
else
printf("\n");
}
return 0;
}
Using your program with increased size gives a stack overflow:
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
This is probably caused by too much laziness.
Looking at the heap profile broken down by type, you get the following:
(Note: I modified your program as leftaroundabout pointed out)
This doesn't look good. You shouldn't require linear space for your algorithm. You seem to be holding your Double values longer than required. Makeing the strict solves the issue:
{-# LANGUAGE BangPatterns #-}
iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' !h !t0 !y0 = y0:(iterateRK y' h t1 y1)
where t1 = t0 + h
y1 = rk4 y' h t0 y0
With this modification, the new heap profile looks like this:
This looks much better, the memory usage is much lower. -sstderr` also confirms that we only spend 2.5% of the total time in the garbage collector after the modification:
%GC time 2.5% (2.9% elapsed)
Now, the haskell version is still about 40% slower than the C one (using user time):
$ time ./tesths; time ./testc
2.47e-321
./tesths 0,73s user 0,01s system 86% cpu 0,853 total
2.470328e-321
./testc 0,51s user 0,01s system 95% cpu 0,549 total
Increasing the number of iterations and using a heap-allocated array for the result storage in C lowers the difference once more:
time ./tesths; time ./testc
2.47e-321
./tesths 18,25s user 0,04s system 96% cpu 19,025 total
2.470328e-321
./testc 16,98s user 0,14s system 98% cpu 17,458 total
This is only a difference of about 9%.
But we can still do better. Using the stream-fusion package, we can eliminate the list completely while still keeping the decoupling. Here is the full code with that optimization included:
{-# LANGUAGE BangPatterns #-}
import qualified Data.List.Stream as S
rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' !h !t !y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
where k1 = y' t y
k2 = y' (t + h/2) (y + ((h/2) * k1))
k3 = y' (t + h/2) (y + ((h/2) * k2))
k4 = y' (t + h) (y + (h * k3))
iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' h = curry $ S.unfoldr $ \(!t0, !y0) -> Just (y0, (t0 + h, rk4 y' h t0 y0))
main :: IO ()
main = do
let y' t y = -y
let h = 1e-3
let y0 = 1.0
let t0 = 0
let results = iterateRK y' h t0 y0
print $ S.head $ (S.drop (pred 10000000) results)
I comiled with:
$ ghc -O2 ./test.hs -o tesths -fllvm
Here are the timings:
$ time ./tesths; time ./testc
2.47e-321
./tesths 15,85s user 0,02s system 97% cpu 16,200 total
2.470328e-321
./testc 16,97s user 0,18s system 97% cpu 17,538 total
Now we're even a bit faster than C, because we do no allocations. To do a similar transformation to the C program, we have to merge the two loops into one and loose the nice abstraction. Even then, it's only as fast as haskell:
$ time ./tesths; time ./testc
2.47e-321
./tesths 15,86s user 0,01s system 98% cpu 16,141 total
2.470328e-321
./testc 15,88s user 0,02s system 98% cpu 16,175 total
I think that in order to make a fair comparison, you should exclude program initialization as well as printing the output (or measure it separately). By default, Haskell uses Strings which are lists of Chars and this makes output quite slow. Also Haskell has a complex runtime whose initialization can bias the results a lot for such a short task. You can use criterion library for that:
import Criterion.Main
-- ...
benchmarkIRK n =
let y' t y = -y
h = 1e-3
y0 = 1.0
t0 = 0
in take n (iterateRK y' h t0 y0)
benchmarkIRKPrint = writeFile "/dev/null" . show . benchmarkIRK
main = defaultMain
[ bench "rk" $ nf benchmarkIRK 1000
, bench "rkPrint" $ nfIO (benchmarkIRKPrint 1000)
]
My measurements show that the actual computation takes something around 27 us, computing and printing takes around 350 us and running the whole program (without criterion) takes around 30 ms. So the actual computation takes just 1/1000 of the whole time and printing it just 1/100.
You should also measure your C program similarly, excluding any startup time and distinguishing what portion of time is consumed by computing and printing.
The timings of your programs have very little to do with the languages' performance, and everything with terminal IO. Remove the printing of each step (BTW, putStrLn . show ≡≡ print) from your Haskell program, and you'll get
$ time RK-hs
1.0
real 0m0.004s
user 0m0.000s
sys 0m0.000s
... which isn't really significant, though – 1000 steps is far to little. With
main :: IO ()
main = do
let y' t y = -y
h = 1e-7
y0 = 1.0
t0 = 0
results = iterateRK y' h t0 y0
print . head $ drop 10000000 results
you get
$ time RK-hs +RTS -K100M
0.36787944117145965
real 0m0.653s
user 0m0.572s
sys 0m0.076s
while the equivalent in C has
$ time RK-c
Segmentation fault (core dumped)
oh great... ...but as you see, I had to increase the stack size for the Haskell program as well. Omitting the storage of the results in a stack-allocated array, we have
$ time RK-c
0.3678794412
real 0m0.152s
user 0m0.148s
sys 0m0.000s
so this is indeed faster, significantly now, than the Haskell version.
When even C has memory problems storing a whole lot of intermediate results (if you put it on the stack), this is worse in Haskell: each list node has to be heap-allocated seperately, and while allocation is much faster in Haskell's garbage-collected heap than in C's heap, it's still slow.

More efficient algorithm preforms worse in Haskell

A friend of mine showed me a home exercise in a C++ course which he attend. Since I already know C++, but just started learning Haskell I tried to solve the exercise in the "Haskell way".
These are the exercise instructions (I translated from our native language so please comment if the instructions aren't clear):
Write a program which reads non-zero coefficients (A,B,C,D) from the user and places them in the following equation:
A*x + B*y + C*z = D
The program should also read from the user N, which represents a range. The program should find all possible integral solutions for the equation in the range -N/2 to N/2.
For example:
Input: A = 2,B = -3,C = -1, D = 5, N = 4
Output: (-1,-2,-1), (0,-2, 1), (0,-1,-2), (1,-1, 0), (2,-1,2), (2,0, -1)
The most straight-forward algorithm is to try all possibilities by brute force. I implemented it in Haskell in the following way:
triSolve :: Integer -> Integer -> Integer -> Integer -> Integer -> [(Integer,Integer,Integer)]
triSolve a b c d n =
let equation x y z = (a * x + b * y + c * z) == d
minN = div (-n) 2
maxN = div n 2
in [(x,y,z) | x <- [minN..maxN], y <- [minN..maxN], z <- [minN..maxN], equation x y z]
So far so good, but the exercise instructions note that a more efficient algorithm can be implemented, so I thought how to make it better. Since the equation is linear, based on the assumption that Z is always the first to be incremented, once a solution has been found there's no point to increment Z. Instead, I should increment Y, set Z to the minimum value of the range and keep going. This way I can save redundant executions.
Since there are no loops in Haskell (to my understanding at least) I realized that such algorithm should be implemented by using a recursion. I implemented the algorithm in the following way:
solutions :: (Integer -> Integer -> Integer -> Bool) -> Integer -> Integer -> Integer -> Integer -> Integer -> [(Integer,Integer,Integer)]
solutions f maxN minN x y z
| solved = (x,y,z):nextCall x (y + 1) minN
| x >= maxN && y >= maxN && z >= maxN = []
| z >= maxN && y >= maxN = nextCall (x + 1) minN minN
| z >= maxN = nextCall x (y + 1) minN
| otherwise = nextCall x y (z + 1)
where solved = f x y z
nextCall = solutions f maxN minN
triSolve' :: Integer -> Integer -> Integer -> Integer -> Integer -> [(Integer,Integer,Integer)]
triSolve' a b c d n =
let equation x y z = (a * x + b * y + c * z) == d
minN = div (-n) 2
maxN = div n 2
in solutions equation maxN minN minN minN minN
Both yield the same results. However, trying to measure the execution time yielded the following results:
*Main> length $ triSolve' 2 (-3) (-1) 5 100
3398
(2.81 secs, 971648320 bytes)
*Main> length $ triSolve 2 (-3) (-1) 5 100
3398
(1.73 secs, 621862528 bytes)
Meaning that the dumb algorithm actually preforms better than the more sophisticated one. Based on the assumption that my algorithm was correct (which I hope won't turn as wrong :) ), I assume that the second algorithm suffers from an overhead created by the recursion, which the first algorithm isn't since it's implemented using a list comprehension.
Is there a way to implement in Haskell a better algorithm than the dumb one?
(Also, I'll be glad to receive general feedbacks about my coding style)
Of course there is. We have:
a*x + b*y + c*z = d
and as soon as we assume values for x and y, we have that
a*x + b*y = n
where n is a number we know.
Hence
c*z = d - n
z = (d - n) / c
And we keep only integral zs.
It's worth noticing that list comprehensions are given special treatment by GHC, and are generally very fast. This could explain why your triSolve (which uses a list comprehension) is faster than triSolve' (which doesn't).
For example, the solution
solve :: Integer -> Integer -> Integer -> Integer -> Integer -> [(Integer,Integer,Integer)]
-- "Buffalo buffalo buffalo buffalo Buffalo buffalo buffalo..."
solve a b c d n =
[(x,y,z) | x <- vals, y <- vals
, let p = a*x +b*y
, let z = (d - p) `div` c
, z >= minN, z <= maxN, c * z == d - p ]
where
minN = negate (n `div` 2)
maxN = (n `div` 2)
vals = [minN..maxN]
runs fast on my machine:
> length $ solve 2 (-3) (-1) 5 100
3398
(0.03 secs, 4111220 bytes)
whereas the equivalent code written using do notation:
solveM :: Integer -> Integer -> Integer -> Integer -> Integer -> [(Integer,Integer,Integer)]
solveM a b c d n = do
x <- vals
y <- vals
let p = a * x + b * y
z = (d - p) `div` c
guard $ z >= minN
guard $ z <= maxN
guard $ z * c == d - p
return (x,y,z)
where
minN = negate (n `div` 2)
maxN = (n `div` 2)
vals = [minN..maxN]
takes twice as long to run and uses twice as much memory:
> length $ solveM 2 (-3) (-1) 5 100
3398
(0.06 secs, 6639244 bytes)
Usual caveats about testing within GHCI apply -- if you really want to see the difference, you need to compile the code with -O2 and use a decent benchmarking library (like Criterion).

Performance of reservoir sampling vs. getting the length of a list and picking random elements

I have written two functions to pick a random element out of a list of unknown length. The first uses reservoir sampling (with a reservoir of size 1), and the second gets the length of the list to pick a random index and return it. For some reason, the former is much faster.
The first function uses a single traversal and pick each element with probability (1/i), where i is the index of the element in the list. It results in a equal probability of picking each element.
pickRandom :: [a] -> IO a
pickRandom [] = error "List is empty"
pickRandom (x:xs) = do
stdgen <- newStdGen
return (pickRandom' xs x 1 stdgen)
-- Pick a random number using reservoir sampling
pickRandom' :: (RandomGen g) => [a] -> a -> Int -> g -> a
pickRandom' [] xi _ _ = xi
pickRandom' (x:xs) xi n gen =
let (rand, gen') = randomR (0, n) gen in
if (rand == 0) then
pickRandom' xs x (n + 1) gen' -- Update value
else
pickRandom' xs xi (n + 1) gen' -- Keep previous value
The second version traverses the list once to get its length, and then picks an index between 0 and the length of the input list (-1) to get one of the element, again with equal probability. The expected number of traversal of the list 1.5:
-- Traverses the list twice
pickRandomWithLen :: [a] -> IO a
pickRandomWithLen [] = error "List is empty"
pickRandomWithLen xs = do
gen <- newStdGen
(e, _) <- return $ randomR (0, (length xs) - 1) gen
return $ xs !! e
Here is the code I use for benchmarking these two functions:
main :: IO ()
main = do
gen <- newStdGen
let size = 2097152
inputList = getRandList gen size
defaultMain [ bench "Using length" (pickRandomWithLen inputList)
, bench "Using reservoir" (pickRandom inputList)
]
Here is a stripped output:
benchmarking Using reservoir
mean: 82.72108 ns, lb 82.02459 ns, ub 83.61931 ns, ci 0.950
benchmarking Using length
mean: 17.12571 ms, lb 16.97026 ms, ub 17.37352 ms, ci 0.950
In other terms, the first function is about 200 times faster than the second. I expected the runtime to be influenced mainly by random number generation and the number of list traversals (1 vs. 1.5). What other factors can explain such a huge difference?
Your benchmarked actions don't actually evaluate the result,
pickRandom :: [a] -> IO a
pickRandom [] = error "List is empty"
pickRandom (x:xs) = do
stdgen <- newStdGen
return (pickRandom' xs x 1 stdgen)
only gets a new StdGen and returns a thunk. That's pretty immediate.
pickRandomWithLen :: [a] -> IO a
pickRandomWithLen [] = error "List is empty"
pickRandomWithLen xs = do
gen <- newStdGen
(e, _) <- return $ randomR (0, (length xs) - 1) gen
return $ xs !! e
computes the length of the list and then returns a thunk, that is of course much slower.
Forcing both to evaluate the result,
return $! ...
makes the length using version much faster,
benchmarking Using length
mean: 14.65655 ms, lb 14.14580 ms, ub 15.16942 ms, ci 0.950
std dev: 2.631668 ms, lb 2.378186 ms, ub 2.937339 ms, ci 0.950
variance introduced by outliers: 92.581%
variance is severely inflated by outliers
benchmarking Using reservoir
collecting 100 samples, 1 iterations each, in estimated 47.00930 s
mean: 451.5571 ms, lb 448.4355 ms, ub 455.7812 ms, ci 0.950
std dev: 18.50427 ms, lb 14.45557 ms, ub 24.74350 ms, ci 0.950
found 4 outliers among 100 samples (4.0%)
2 (2.0%) high mild
2 (2.0%) high severe
variance introduced by outliers: 38.511%
variance is moderately inflated by outliers
(after forcing the input list to be evaluated before by printing its sum), because that needs only one call to the PRNG, while the reservoir sampling uses length list - 1 calls.
The difference would probably be smaller if a faster PRNG than a StdGen is used.
Indeed, using System.Random.Mersenne instead of StdGen (requires that pickRandom' has IO a result type, and since it offers no generation in a specific range but only default range skews the distribution of picked elements a little, but since we're only interested in the time needed for the pseudo-random number generation, that's not important), the time for the reservoir sampling drops to
mean: 51.83185 ms, lb 51.77620 ms, ub 51.91259 ms, ci 0.950
std dev: 482.4712 us, lb 368.4433 us, ub 649.1758 us, ci 0.950
(the pickRandomWithLen time doesn't change measurably, of course, since it uses only one generation). A roughly nine-fold speedup, that shows that the pseudo-random generation is the dominant factor.

Resources