Getting functional sieve of Eratosthenes fast

Getting functional sieve of Eratosthenes fast - performance

I read this other post about a F# version of this algorithm. I found it very elegant and tried to combine some ideas of the answers.
Although I optimized it to make fewer checks (check only numbers around 6) and leave out unnecessary caching, it is still painfully slow. Calculating the 10,000th prime already take more than 5 minutes. Using the imperative approach, I can test all 31-bit integers in not that much more time.
So my question is if I am missing something that makes all this so slow. For example in another post someone was speculating that LazyList may use locking. Does anyone have an idea?
As StackOverflow's rules say not to post new questions as answers, I feel I have to start a new topic for this.
Here's the code:
#r "FSharp.PowerPack.dll"
open Microsoft.FSharp.Collections
let squareLimit = System.Int32.MaxValue |> float32 |> sqrt |> int
let around6 = LazyList.unfold (fun (candidate, (plus, next)) ->
if candidate > System.Int32.MaxValue - plus then
None
else
Some(candidate, (candidate + plus, (next, plus)))
) (5, (2, 4))
let (|SeqCons|SeqNil|) s =
if Seq.isEmpty s then SeqNil
else SeqCons(Seq.head s, Seq.skip 1 s)
let rec lazyDifference l1 l2 =
if Seq.isEmpty l2 then l1 else
match l1, l2 with
| LazyList.Cons(x, xs), SeqCons(y, ys) ->
if x < y then
LazyList.consDelayed x (fun () -> lazyDifference xs l2)
elif x = y then
lazyDifference xs ys
else
lazyDifference l1 ys
| _ -> LazyList.empty
let lazyPrimes =
let rec loop = function
| LazyList.Cons(p, xs) as ll ->
if p > squareLimit then
ll
else
let increment = p <<< 1
let square = p * p
let remaining = lazyDifference xs {square..increment..System.Int32.MaxValue}
LazyList.consDelayed p (fun () -> loop remaining)
| _ -> LazyList.empty
loop (LazyList.cons 2 (LazyList.cons 3 around6))

If you are calling Seq.skip anywhere, then there's about a 99% chance that you have an O(N^2) algorithm. For nearly every elegant functional lazy Project Euler solution involving sequences, you want to use LazyList, not Seq. (See Juliet's comment link for more discussion.)

Even if you succeed in taming the strange quadratic F# sequences design issues, there is certain algorithmic improvements still ahead. You are working in (...((x-a)-b)-...) manner here. x, or around6, is getting deeper and deeper, but it's the most frequently-producing sequence. Transform it into (x-(a+b+...)) scheme -- or even use a tree structure there -- to gain an improvement in time complexity (sorry, that page is in Haskell). This gets actually very close to the complexity of imperative sieve, although still mush slower than the baseline C++ code.
Measuring local empirical orders of growth as O(n^a) <--> a = log(t_2/t_1) / log(n_2/n_1) (in n primes produced), the ideal n log(n) log(log(n)) translates into O(n^1.12) .. O(n^1.085) behaviour on n=10^5..10^7 range. A simple C++ baseline imperative code achieves O(n^1.45 .. 1.18 .. 1.14) while tree-merging code, as well as priority-queue based code, both exhibit steady O(n^1.20) behaviour, more or less. Of course C++ is ~5020..15 times faster, but that's mostly just a "constant factor". :)

Related

How to execute two functions at once in f#?

I am trying to make my quick sort program in F# work in parallel by making two tasks execute in parallel. I tried looking at Microsofts online documentation but it didn't really help me! Here is my code without parallelism:
let rec quicksort (list: int list) =
match list with
| [] -> [] // if empty list, yield nothing
// otherwise, split the list into a head and tial, and the head is the pivot value
| pivot :: tail ->
// Using List.partition to partition the list into lower and upper
let lower, upper = List.partition (fun x -> x < pivot) tail
// Recursive calls, final product will be low list + pivot + high list
quicksort lower # [pivot] # quicksort upper
I tried implementing something like
Async.Parallel [quicksort lower; # [pivot] # quicksort upper;] |> Async.RunSynchronously
But I get syntax errors referring to the type. What am I missing here?

Parallelizing compute-bound code, such as sorting, should be done rather with Array.Parallel.map instead of Async.Parallel, which is for improving throughput of IO-bound code.
You could parallelize your function as follows with Array.Parallel.map.
let rec quicksort (list: int list) =
match list with
| [] -> [] /
| pivot :: tail ->
let lower, upper = List.partition (fun x -> x < pivot) tail
let sortedArrays = Array.Parallel.map quicksort [| lower; upper |]
sortedArrays.[0] # [pivot] # sortedArrays.[1]
However, you should NOT do this, because the overhead of parallelization is much higher than the benefit of parallelization and the parallelized version is actually much slower.
If you want to speed up quicksort algorithm biggest gains can be achieved by avoiding allocating objects (lists) during the algorithm. Using an array and mutating it in place is the way to go :)

As mentioned by #hvester, adding parallelization to quicksort in this way will not help you much. The implementation is slow because it uses lists and allocations, not because of actual CPU bounds.
That said, if this was just an illustration to look at different ways of parallelizing F# code, then a good alternative to using Array.Parallel.map would be to use tasks:
open System.Threading.Tasks
let rec quicksort (list: int list) =
match list with
| [] -> []
| pivot :: tail ->
let lower, upper = List.partition (fun x -> x < pivot) tail
let lowerRes = Task.Factory.StartNew(fun _ -> quicksort lower)
let upperRes = quicksort upper
lowerRes.Result # [pivot] # upperRes
Tasks let you start work in background using StartNew and then wait for result by accessing the Result property. I think this would be more appropriate in scenarios like this. Array.Parallel.map is more intended for doing parallel processing over larger arrays.

Breaking lists at index

I have a performance question today.
I am making a (Haskell) program and, when profiling, I saw that most of the time is spent in the function you can find below. Its purpose is to take the nth element of a list and return the list without it besides the element itself. My current (slow) definition is as follows:
breakOn :: Int -> [a] -> (a,[a])
breakOn 1 (x:xs) = (x,xs)
breakOn n (x:xs) = (y,x:ys)
where
(y,ys) = breakOn (n-1) xs
The Int argument is known to be in the range 1..n where n is the length of the (never null) list (x:xs), so the function never arises an error.
However, I got a poor performance here. My first guess is that I should change lists for another structure. But, before start picking different structures and testing code (which will take me lot of time) I wanted to ask here for a third person opinion. Also, I'm pretty sure that I'm not doing it in the best way. Any pointers are welcome!
Please, note that the type a may not be an instance of Eq.
Solution
I adapted my code tu use Sequences from the Data.Sequence module. The result is here:
import qualified Data.Sequence as S
breakOn :: Int -> Seq a -> (a,Seq a)
breakOn n xs = (S.index zs 0, ys <> (S.drop 1 zs))
where
(ys,zs) = S.splitAt (n-1) xs
However, I still accept further suggestions of improvement!

Yes, this is inefficient. You can do a bit better by using splitAt (which unboxes the number during the recursive bit), a lot better by using a data structure with efficient splitting, e.g. a fingertree, and best by massaging the context to avoid needing this operation. If you post a bit more context, it may be possible to give more targeted advice.

Prelude functions are generally pretty efficient. You could rewrite your function using splitAt, as so:
breakOn :: Int -> [a] -> (a,[a])
breakOn n xs = (z,ys++zs)
where
(ys,z:zs) = splitAt (n-1) xs

How does one write efficient Dynamic Programming algorithms in Haskell?

I've been playing around with dynamic programming in Haskell. Practically every tutorial I've seen on the subject gives the same, very elegant algorithm based on memoization and the laziness of the Array type. Inspired by those examples, I wrote the following algorithm as a test:
-- pascal n returns the nth entry on the main diagonal of pascal's triangle
-- (mod a million for efficiency)
pascal :: Int -> Int
pascal n = p ! (n,n) where
p = listArray ((0,0),(n,n)) [f (i,j) | i <- [0 .. n], j <- [0 .. n]]
f :: (Int,Int) -> Int
f (_,0) = 1
f (0,_) = 1
f (i,j) = (p ! (i, j-1) + p ! (i-1, j)) `mod` 1000000
My only problem is efficiency. Even using GHC's -O2, this program takes 1.6 seconds to compute pascal 1000, which is about 160 times slower than an equivalent unoptimized C++ program. And the gap only widens with larger inputs.
It seems like I've tried every possible permutation of the above code, along with suggested alternatives like the data-memocombinators library, and they all had the same or worse performance. The one thing I haven't tried is the ST Monad, which I'm sure could be made to run the program only slighter slower than the C version. But I'd really like to write it in idiomatic Haskell, and I don't understand why the idiomatic version is so inefficient. I have two questions:
Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
Thanks a lot.
Edit: The array module used is the standard Data.Array

Well, the algorithm could be designed a little better. Using the vector package and being smart about only keeping one row in memory at a time, we can get something that's idiomatic in a different way:
{-# LANGUAGE BangPatterns #-}
import Data.Vector.Unboxed
import Prelude hiding (replicate, tail, scanl)
pascal :: Int -> Int
pascal !n = go 1 ((replicate (n+1) 1) :: Vector Int) where
go !i !prevRow
| i <= n = go (i+1) (scanl f 1 (tail prevRow))
| otherwise = prevRow ! n
f x y = (x + y) `rem` 1000000
This optimizes down very tightly, especially because the vector package includes some rather ingenious tricks to transparently optimize array operations written in an idiomatic style.

1 Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
The problem is that the code writes thunks to the array. Then when entry (n,n) is read, the evaluation of the thunks jumps all over the array again, recurring until finally a value not needing further recursion is found. That causes a lot of unnecessary allocation and inefficiency.
The C++ code doesn't have that problem, the values are written, and read directly without requiring further evaluation. As it would happen with an STUArray. Does
p = runSTUArray $ do
arr <- newArray ((0,0),(n,n)) 1
forM_ [1 .. n] $ \i ->
forM_ [1 .. n] $ \j -> do
a <- readArray arr (i,j-1)
b <- readArray arr (i-1,j)
writeArray arr (i,j) $! (a+b) `rem` 1000000
return arr
really look so bad?
2 Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
I don't know of one. But there might be.
Addendum:
Once one uses STUArrays or unboxed Vectors, there's still a significant difference to the equivalent C implementation. The reason is that gcc replaces the % by a combination of multiplications, shifts and subtractions (even without optimisations), since the modulus is known. Doing the same by hand in Haskell (since GHC doesn't [yet] do that),
-- fast modulo 1000000
-- for nonnegative Ints < 2^31
-- requires 64-bit Ints
fastMod :: Int -> Int
fastMod n = n - 1000000*((n*1125899907) `shiftR` 50)
gets the Haskell versions on par with C.

The trick is to think about how to write the whole damn algorithm at once, and then use unboxed vectors as your backing data type. For example, the following runs about 20 times faster on my machine than your code:
import qualified Data.Vector.Unboxed as V
combine :: Int -> Int -> Int
combine x y = (x+y) `mod` 1000000
pascal n = V.last $ go n where
go 0 = V.replicate (n+1) 1
go m = V.scanl1 combine (go (m-1))
I then wrote two main functions that called out to yours and mine with an argument of 4000; these ran in 10.42s and 0.54s respectively. Of course, as I'm sure you know, they both get blown out of the water (0.00s) by the version that uses a better algorithm:
pascal' :: Integer -> Integer
pascal :: Int -> Int
pascal' n = product [n+1..n*2] `div` product [2..n]
pascal = fromIntegral . (`mod` 1000000) . pascal' . fromIntegral

Performance comparison of two implementations of a primes filter

I have two programs to find prime numbers (just an exercise, I'm learning Haskell). "primes" is about 10X faster than "primes2", once compiled with ghc (with flag -O). However, in "primes2", I thought it would consider only prime numbers for the divisor test, which should be faster than considering odd numbers in "isPrime", right? What am I missing?
isqrt :: Integral a => a -> a
isqrt = floor . sqrt . fromIntegral
isPrime :: Integral a => a -> Bool
isPrime n = length [i | i <- [1,3..(isqrt n)], mod n i == 0] == 1
primes :: Integral a => a -> [a]
primes n = [2,3,5,7,11,13] ++ (filter (isPrime) [15,17..n])
primes2 :: Integral a => a -> [a]
primes2 n = 2 : [i | i <- [3,5..n], all ((/= 0) . mod i) (primes2 (isqrt i))]

I think what's happening here is that isPrime is a simple loop, whereas primes2 is calling itself recursively — and its recursion pattern looks exponential to me.
Searching through my old source code, I found this code:
primes :: [Integer]
primes = 2 : filter isPrime [3,5..]
isPrime :: Integer -> Bool
isPrime x = all (\n -> x `mod` n /= 0) $
takeWhile (\n -> n * n <= x) primes
This tests each possible prime x only against the primes below sqrt(x), using the already generated list of primes. So it probably doesn't test any given prime more than once.
Memoization in Haskell:
Memoization in Haskell is generally explicit, not implicit. The compiler won't "do the right thing" but it will only do what you tell it to. When you call primes2,
*Main> primes2 5
[2,3,5]
*Main> primes2 10
[2,3,5,7]
Each time you call the function it calculates all of its results all over again. It has to. Why? Because 1) You didn't make it save its results, and 2) the answer is different each time you call it.
In the sample code I gave above, primes is a constant (i.e. it has arity zero) so there's only one copy of it in memory, and its parts only get evaluated once.
If you want memoization, you need to have a value with arity zero somewhere in your code.

I like what Dietrich has done with the memoization, but I think theres a data structure issue here too. Lists are just not the ideal data structure for this. They are, by necessity, lisp style cons cells with no random access. Set seems better suited to me.
import qualified Data.Set as S
sieve :: (Integral a) => a -> S.Set a
sieve top = let l = S.fromList (2:3:([5,11..top]++[7,13..top]))
iter s c
| cur > (div (S.findMax s) 2) = s
| otherwise = iter (s S.\\ (S.fromList [2*cur,3*cur..top])) (S.deleteMin c)
where cur = S.findMin c
in iter l (l S.\\ (S.fromList [2,3]))
I know its kind of ugly, and not too declarative, but it runs rather quickly. Im looking into a way to make this nicer looking using Set.fold and Set.union over the composites. Any other ideas for neatening this up would be appreciated.
PS - see how (2:3:([5,11..top]++[7,13..top])) avoids unnecessary multiples of 3 such as the 15 in your primes. Unfortunately, this ruins your ordering if you work with lists and you sign up for a sorting, but for sets thats not an issue.

What's the way to determine if an Int is a perfect square in Haskell?

I need a simple function
is_square :: Int -> Bool
which determines if an Int N a perfect square (is there an integer x such that x*x = N).
Of course I can just write something like
is_square n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n::Double)
but it looks terrible! Maybe there is a common simple way to implement such a predicate?

Think of it this way, if you have a positive int n, then you're basically doing a binary search on the range of numbers from 1 .. n to find the first number n' where n' * n' = n.
I don't know Haskell, but this F# should be easy to convert:
let is_perfect_square n =
let rec binary_search low high =
let mid = (high + low) / 2
let midSquare = mid * mid
if low > high then false
elif n = midSquare then true
else if n < midSquare then binary_search low (mid - 1)
else binary_search (mid + 1) high
binary_search 1 n
Guaranteed to be O(log n). Easy to modify perfect cubes and higher powers.

There is a wonderful library for most number theory related problems in Haskell included in the arithmoi package.
Use the Math.NumberTheory.Powers.Squares library.
Specifically the isSquare' function.
is_square :: Int -> Bool
is_square = isSquare' . fromIntegral
The library is optimized and well vetted by people much more dedicated to efficiency then you or I. While it currently doesn't have this kind of shenanigans going on under the hood, it could in the future as the library evolves and gets more optimized. View the source code to understand how it works!
Don't reinvent the wheel, always use a library when available.

I think the code you provided is the fastest that you are going to get:
is_square n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n::Double)
The complexity of this code is: one sqrt, one double multiplication, one cast (dbl->int), and one comparison. You could try to use other computation methods to replace the sqrt and the multiplication with just integer arithmetic and shifts, but chances are it is not going to be faster than one sqrt and one multiplication.
The only place where it might be worth using another method is if the CPU on which you are running does not support floating point arithmetic. In this case the compiler will probably have to generate sqrt and double multiplication in software, and you could get advantage in optimizing for your specific application.
As pointed out by other answer, there is still a limitation of big integers, but unless you are going to run into those numbers, it is probably better to take advantage of the floating point hardware support than writing your own algorithm.

In a comment on another answer to this question, you discussed memoization. Keep in mind that this technique helps when your probe patterns exhibit good density. In this case, that would mean testing the same integers over and over. How likely is your code to repeat the same work and thus benefit from caching answers?
You didn't give us an idea of the distribution of your inputs, so consider a quick benchmark that uses the excellent criterion package:
module Main
where
import Criterion.Main
import Random
is_square n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n::Double)
is_square_mem =
let check n = sq * sq == n
where sq = floor $ sqrt $ (fromIntegral n :: Double)
in (map check [0..] !!)
main = do
g <- newStdGen
let rs = take 10000 $ randomRs (0,1000::Int) g
direct = map is_square
memo = map is_square_mem
defaultMain [ bench "direct" $ whnf direct rs
, bench "memo" $ whnf memo rs
]
This workload may or may not be a fair representative of what you're doing, but as written, the cache miss rate appears too high:

Wikipedia's article on Integer Square Roots has algorithms can be adapted to suit your needs. Newton's method is nice because it converges quadratically, i.e., you get twice as many correct digits each step.
I would advise you to stay away from Double if the input might be bigger than 2^53, after which not all integers can be exactly represented as Double.

Oh, today I needed to determine if a number is perfect cube, and similar solution was VERY slow.
So, I came up with a pretty clever alternative
cubes = map (\x -> x*x*x) [1..]
is_cube n = n == (head $ dropWhile (<n) cubes)
Very simple. I think, I need to use a tree for faster lookups, but now I'll try this solution, maybe it will be fast enough for my task. If not, I'll edit the answer with proper datastructure

Sometimes you shouldn't divide problems into too small parts (like checks is_square):
intersectSorted [] _ = []
intersectSorted _ [] = []
intersectSorted xs (y:ys) | head xs > y = intersectSorted xs ys
intersectSorted (x:xs) ys | head ys > x = intersectSorted xs ys
intersectSorted (x:xs) (y:ys) | x == y = x : intersectSorted xs ys
squares = [x*x | x <- [ 1..]]
weird = [2*x+1 | x <- [ 1..]]
perfectSquareWeird = intersectSorted squares weird

There's a very simple way to test for a perfect square - quite literally, you check if the square root of the number has anything other than zero in the fractional part of it.
I'm assuming a square root function that returns a floating point, in which case you can do (Psuedocode):
func IsSquare(N)
sq = sqrt(N)
return (sq modulus 1.0) equals 0.0

It's not particularly pretty or fast, but here's a cast-free, FPA-free version based on Newton's method that works (slowly) for arbitrarily large integers:
import Control.Applicative ((<*>))
import Control.Monad (join)
import Data.Ratio ((%))
isSquare = (==) =<< (^2) . floor . (join g <*> join f) . (%1)
where
f n x = (x + n / x) / 2
g n x y | abs (x - y) > 1 = g n y $ f n y
| otherwise = y
It could probably be sped up with some additional number theory trickery.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Getting functional sieve of Eratosthenes fast - performance

If you are calling Seq.skip anywhere, then there's about a 99% chance that you have an O(N^2) algorithm. For nearly every elegant functional lazy Project Euler solution involving sequences, you want to use LazyList, not Seq. (See Juliet's comment link for more discussion.)

Related

How to execute two functions at once in f#?

Breaking lists at index

How does one write efficient Dynamic Programming algorithms in Haskell?

Performance comparison of two implementations of a primes filter

What's the way to determine if an Int is a perfect square in Haskell?

Categories

Resources