Haskell PINNED or STACK memory for performance - performance

I'm trying to benefit from an optimisation that sometimes occurs in GHC (8.4.3), where a "build" of a large volume of data is put in PINNED memory. (I may not have all the terminology correct here). Here's a simple example:
Pinned1.hs:
main = print $ sum $ tail ([1..100000000] :: [Int])
then:
ghc -O2 Pinned1.hs -prof -rtsopts
Pinned1 +RTS -hc -p -xt
hp2ps -e8in -c Pinned1.hp
shows ~40K PINNED and virtually no STACK usage, and Pinned1 +RTS -hd -p -xt shows the ~40K is ARR_WORDS.
The Pinned1.prof shows:
total time = 2.14 secs (2137 ticks # 1000 us, 1 processor)
total alloc = 8,000,046,088 bytes (excludes profiling overheads)
Having looked at the -sdump-simpl, I can see the kind of code that leads to this. Here's a slightly more complex example, back-translated from Core into Haskell code, where the same thing happens:
Pinned2.hs:
main = print $ sum $ snd $ wgoC 1 0
wgoC :: Int -> Int -> (Int, [Int])
wgoC n finalState =
let (nxt, ys') = case n of 100000000 -> (finalState, [])
_ -> wgoC (n+1) finalState
in (n, n + nxt * 9: ys')
wgoC passes the next n back, which is used in the calculation of the values in the list. It reports ~40K PINNED/ARR_WORDS memory, and almost no STACK, and this profile output:
total time = 5.50 secs (5500 ticks # 1000 us, 1 processor)
total alloc = 16,800,046,112 bytes (excludes profiling overheads)
However, this:
Pinned3.hs:
main = print $ sum $ snd $ wgoD 1 0
wgoD :: Int -> Int -> (Int, [Int])
wgoD n finalState =
let (ttl', ys') = case n of 100000000 -> (finalState, [])
_ -> wgoD (n+1) finalState
in (ttl' + n, n + (ttl' + n) * 9 : ys')
doesn't complete after 2 mins. It does complete with a value of only 1000000, and I see no PINNED memory and STACK usage (~100M) instead. (I think it is the STACK usage that is making it run much more slowly, somehow).
The main difference I see between Pinned2 and Pinned3 is that Pinned3 includes information from the recursive call in the returned state (the fst of the returned pair: the cumulative sum of subsequent values), but Pinned2 only includes the parameter to wgoC.
So my questions are:
Q1) Where (in the compiler pipeline) does the decision to use PINNED memory happen? -ddump-simpl shows no obvious difference, nor -ddump-cmm (though it's a little complicated so maybe I'm missing something).
Q2) What is the PINNED/STACK decision based on? (The only references I can find to PINNED, such as this, say it's useful for FFI calls, but it seems it's also been adopted for this "optimisation" as well).
Q3) Is there some way to modify Pinned3 so that it does use PINNED?
Q4) (as a last resort) Is there some other tweak to Pinned3 so that there is enough STACK space, and it runs in a reasonable time? (Naïvely, I would expect similar performance to Pinned2).
[Note that I'm only trying to understand the PINNED/STACK mechanism here. I'm sure there are other ways to write Pinned3 so it fuses nicely and needs hardly any memory, but this question is not about that.]
Thanks!

Pinned memory doesn't play a role here.
This program:
main = print $ sum $ tail ([1..100000000] :: [Int])
does not use any pinned memory directly. The pinned memory you're seeing is from the initialization of the runtime system itself. Pinned memory is allocated by GHC's byte array primitives; in user code, you're most likely to see pinned memory usage when using Data.Text or Data.ByteString both of which use byte arrays for their internal implementation. For this program, I'm going to guess that the I/O buffers for stdin and stdout are pinned, but maybe it's something else. Anyway, lists of Ints won't pin anything.
Like (almost) all Haskell programs, Pinned1.hs uses tons of heap and tons of stack (gigabytes of each) but, critically, frees it as quickly as it allocates it (or if you prefer when talking about the stack, "pops" it as quickly as it "pushes" it). The same is true of Pinned2.hs. These programs are functioning correctly.
The problem with Pinned3.hs is not that it uses stack instead of pinned memory, but rather that it uses even more stack than Pinned1 and Pinned2 and fails to pop it as quickly as it pushes it, so stack accumulates.
So, why does Pinned3 accumulate stack?
Generally speaking, stack accumulates in a recursive call if some part of the result of the recursive call is the target of a function application when it returns AND evaluating that part of the result itself requires another recursive call. Consider the program:
eatStack 100000000 = 1
eatStack n = 1 + eatStack (n + 1)
main = print $ eatStack 1
which, compiled and run with:
stack ghc -- -O2 -prof -rtsopts EatStack.hs
./EatStack +RTS -hd -p -xt
stack exec hp2ps -- -e8in -c EatStack.hp
produces the usual sort of pyramid-shaped stack accumulation (with a peak of 1.4G or so). The problem here is that the return value of the recursive eatStack (n+1) call is subject to the function application \x -> 1 + x when it returns, and calculating that result itself requires further recursion. That is, calculating eatStack 0 requires pushing \x -> 1 + x onto the stack before calling eatStack 1 which can only return its result after pushing \x -> 1 + x onto the stack before calling eatStack 2, and so on. The result is stack accumulation.
Notably, constructor applications are handled differently. The following program:
noStack 100000000 = []
noStack n = 1 : noStack (n + 1)
main = print $ last (noStack 1)
which applies the partially applied constructor (:) 1 to the recursive result noStack (n+1) uses no stack. (It appears to use 40k of pinned, but again that's really the runtime system. EatStack uses 40k of pinned, too.) In some cases (not here), constructor application like this can cause heap accumulation, but it doesn't generally accumulate stack.
For your Pinned2 and Pinned3 examples, something similar is going on, though it's obviously a little more complicated. Let's look at Pinned2 first, and consider evaluating wgoC 1 0. Matching the case and substituting in the arguments, the evaluation is equivalent to:
wgoC 1 0 =
let (nxt, ys') = wgoC 2 0
in (1, 1 + nxt * 9 : ys')
When sum . snd demands the first element of the list, namely the thunk 1 + nxt * 9, this forces nxt to be evaluated via the recursive call. Because this return value is subject to a function application (namely \x -> 1 + x * 9), this uses a bit of stack, but the evaluation of the recursive call:
wgoC 2 0 =
let (nxt, ys') = wgoC 3 0
in (2, 2 + nxt * 9 : ys')
immediately yields the required value for the locally bound nxt in the wgoC 1 0 call, namely the first element of the returned tuple fst (wgoC 2 0) = 2, without requiring further recursion. So, we take that value 2, pop off the continuation \x -> 1 + x * 9 and pass the value to the continuation to yield 1 + 2 * 9 = 19. That gives the list's first element, with no net stack usage. The rest of the list, namely the locally bound ys' in the wgoC 1 0 call, is still in a thunk 2 + nxt * 9 : ys' as closed by the wgoC 2 0 call.
When the next element is demanded, we'll need some stack to apply the continuation \x -> 2 + x * 9 to the result nxt in the recursive (nxt, ys') = wgoC 3 0, but this will be evaluated the same way, immediately returning nxt = 3 and a thunk for ys', so the continuation \x -> 2 + x * 9 will be popped off the stack and applied to nxt = 3 without further recursion, yielding 2 + 3 * 9 = 29 and a thunk 3 + nxt * 9 : ys' as closed by the wgoC 3 0 call.
Each element can be forced with no net stack use. We push a continuation and then immediately pop it and apply it to a part of the return value from the recursive call that doesn't itself require further recursion. The result is no net stack accumulation.
Now, consider Pinned3, and the call wgoD 1 0:
wgoD 1 0 =
let (ttl', ys') = wgoD 2 0
in (ttl' + 1, 1 + (ttl' + 1) * 9 : ys')
When sum . snd demands the first element of the list, namely the thunk 1 + (ttl' + 1) * 9, this forces ttl' to be evaluated via the recursive call. Because there's a pending function application \x -> 1 + (ttl' + 1) * 9, this will use a bit of stack. The recursive call:
wgoD 2 0 =
let (ttl', ys') = wgoD 3 0
in (ttl' + 2, 2 + (ttl' + 2) * 9 : ys')
can only provide the required value for the locally bound ttl' in the wgoC 1 0 call by evaluating the first component of the return tuple ttl' + 2, but this requires forcing ttl' via the recursive wgoD 3 0 call. Because ttl' is subject to a function application \x -> x + 2 when it returns, we push a little more stack and proceed to evaluate:
wgoD 3 0 =
let (ttl', ys') = wgoD 4 0
in (ttl' + 3, 3 + (ttl' + 3) * 9 : ys')
To get the required ttl' as locally bound in the wgoD 2 0 call, we need to evaluate the first component of return tuple from wgoD 3 0, namely ttl' + 3. This is a function application \x -> x + 3 which we push on the stack, applied to ttl' returned from the recursive call wgoD 4 0.
So, Pinned3 pushes a sequence of continuations \x -> x + 2, \x -> x + 3, \x -> x + 4', etc. onto the stack, all in an effort to evaluate the first component of the tuple returned by wgoD 2 0, never getting an opportunity to pop anything until it gets to wgoD 100000000 0, and then it finally gets a number finalState = 0 as the first tuple component, if there's enough stack to get that far. Then all that stack will be popped off as the continuations are applied, and we'll have the first element of the list!
Once it gets through that, things aren't so bad. All of the expressions ttl' + n have been evaluated at this point, and they can be reused in calculating the expressions n + (ttl' + n) * 9, so the remaining elements can be generated relatively quickly, though -- since their values must be kept somewhere -- you'll also get accumulating heap usage at roughly the same rate as stack usage.
You can swap out the 100000000 for something like 10000000 (seven zeros), and that runs in a reasonable amount of time and shows accumulation of both stack and heap in a pyramid shape; it peaks at 1.4 gigs or so before dropping back down to zero.
I don't see any really straightforward way of "fixing" this while still keeping the algorithmic structure of Pinned3 intact.

Related

Increase efficiency for following algorithm

Problem:-
input = n
output :-
1 2 3.......n [first row]
2n+1 2n+2 2n+3....3n [second row]
3n+1 3n+2 3n+3...4n [second last row]
n+1 n+2 n+3....2n [last row]
In the problem we have to print a square such that we have 'n' numbers of rows in our square and in every row we have 'n' numbers. We prepare rows from numbers from 1 to square(n) in such way we fill numbers for first row, then last row, second row, second last row and so on.....
for e.g. if n = 4
We start from 1 print upto 4 then print a newline, so our first row is:-
1 2 3 4
Then our last row comes in continuation
5 6 7 8
then our second row will be
9 10 11 12
few examples:
input = 1
output = 1
input = 2
output = 1 2
3 4
input = 3
output = 1 2 3
7 8 9
4 5 6
My Code:
n = int(input().strip())
lines = [i for i in range (1, n + 1)]
line_order1 = []
line_order2 = []
#Reordering lines so we know the staring element of our method
for i in lines:
if(i % 2 == 1):
line_order1.append(i)
else:
line_order2.append(i)
print(line_order1)
print(line_order2)
// Getting the desired order of lines
line_order2.reverse()
line_order1.extend(line_order2)
print(line_order1)
// Now printing the desired square
for l in line_order1:
for i in range (1, n+1):
k = n * (l - 1)
print(k + i, end = " ")
print("\n")
Is there a better way to do this in terms of execution time?
While I see a few minor places you can improve your code, the performance is unlikely to be much better (my suggestions below might not make any performance difference at all). Your code will take O(n**2) time, which is the best you can do, since you need to print out that many numbers to form your square. Even if you combine some of your longer, more verbose steps into more compact versions, they'll can only possibly be better by a constant factor.
My first suggestion is to number the lines from 0 to n-1 instead of from 1 to n. This will save you some effort when you have to calculate how what multiple of n to include in the values for the row. Currently you've got an awkward l - 1 in your calculation that you could skip if you just used zero-indexed numbers for the rows. (Also l is a terrible variable name, since it looks like the digit 1 (one) in some fonts.)
My next suggestion is to simplify your code that builds the order. You don't need three lists, you can do the whole thing with one list that you feed two range objects, each counting up or down by two.
line_order = list(range(0, n, 2)) # count up by twos
line_order.extend(range(n - 1 - n%2, 0, -2)) # count down starting at either n-1 or n-2
Or, if you're willing to use a standard library module, you could import itertools and then use:
line_order = itertools.chain(range(0, n, 2), range(n - 1 - n%2, 0, -2))
The itertools.chain function returns an iterator that yields values from each of its iterable arguments as if they were concatenated together, without making any copies of the data or using significant extra memory. The difference is not likely to be a much here (since the maximum n you can usefully print out is fairly small), but if you were doing something different with the result of this algorithm and n was in the billions it would be very nice to avoid filling a list with that many values.
My last suggestion is to use a range again to generate all the numbers in each row directly, rather than explicitly looping from 1 to n and adding k each time.
for row_num in line_order:
print(*range(n * row_num + 1, n * (rownum + 1) + 1))
You can compute the start and end points with the multiples of n already included, rather than needing to do that in a separate step for each one. You certainly didn't need to be recomputing k as often as you were before. You can pass all the values from the range to print in one go using iterable unpacking syntax (*args).
Note though that unpacking the range that way is sort of the reverse of the previous suggestion regarding itertools.chain. If n is large, using a loop over the range would be more memory efficient, since you won't need all n values to exist in memory at a the same time. Here's what that would look like:
for line_num in line_order:
for value in range(n * row_num + 1, n * (rownum + 1) + 1):
print(value, end=" ")
print()

Iterated Logarithm in Prolog

log2(I,E):-
(
number(I)
-> E is log(I)/log(2);
number(E)
-> I is 2**E
).
lgstar(N,A):-
(N>1
->
(
log2(N,Nprev),
lgstar(Nprev,Aprev),
Aprev is A-1
);
A is 0
).
Log * is the number of times a log must be applied to a value until it is less than or equal to 1.
For Example:
log(log(log(log(3000)))) = 0.86364760439
so the log * (3000) = 4
From my understanding of the way recursion in prolog works when I get to the base case of N<1 the A should be returned and on the next level of the stack the Aprev should be inferred to be A +1 or Aprev is 1 and so on until it reaches the top of the stack where A is returned.
Query:
lgstar(3000,A)
--> Should be 4
When I reach the case that N<1 then I try to return 0 to the previous layer on the stack instead I get an arguments are not sufficiently instantiated error.

finding primes very slow in F#

I have answered Project Euler Question 7 very easily using Sieve of Eratosthenes in C and I had no problem with it.
I am still quite new to F# so I tried implementing the same technique
let prime_at pos =
let rec loop f l =
match f with
| x::xs -> loop xs (l |> List.filter(fun i -> i % x <> 0 || i = x))
| _ -> l
List.nth (loop [2..pos] [2..pos*pos]) (pos-1)
which works well when pos < 1000, but will crash at 10000 with out of memory exception
I then tried changing the algorithm to
let isPrime n = n > 1 && seq { for f in [2..n/2] do yield f } |> Seq.forall(fun i -> n % i <> 0)
seq {for i in 2..(10000 * 10000) do if isPrime i then yield i} |> Seq.nth 10000 |> Dump
which runs successfully but still takes a few minutes.
If I understand correctly the first algorithm is tail optimized so why does it crash? And how can I write an algorithm that runs under 1 minute (I have a fast computer)?
Looking at your first attempt
let prime_at pos =
let rec loop f l =
match f with
| x::xs -> loop xs (l |> List.filter(fun i -> i % x <> 0 || i = x))
| _ -> l
List.nth (loop [2..pos] [2..pos*pos]) (pos-1)
At each loop iteration, you are iterating over and creating a new list. This is very slow as list creation is very slow and you don't see any benefits from the cache. Several obvious optimisations such as the factor list skipping the even numbers, are skipped. When pos=10 000 you are trying to create a list which will occupy 10 000 * 10 000 * 4 = 400MB of just integers and a further 800MB of pointers (F# lists are linked lists). Futhermore, as each list element takes up a very small amount of memory there will probably be significant overhead for things like GC overhead. In the function you then create a new list of smiliar size. As a result, I am not surprised that this causes OutOfMemoryException.
Looking at the second example,
let isPrime n =
n > 1 &&
seq { for f in [2..n/2] do yield f }
|> Seq.forall(fun i -> n % i <> 0)
Here, the problem is pretty similar as you are generating giant lists for each element you are testing.
I have written a quite fast F# sieve here https://stackoverflow.com/a/12014908/124259 which shows how to do this faster.
As already mentioned by John, your implementation is slow because it generates some temporary data structures.
In the first case, you are building a list, which needs to be fully created in memory and that introduces significant overhead.
In the second case, you are building a lazy sequence, which does not consume memory (because it is build while it is being iterated), but it still introduces indirection that slows the algorithm down.
In most cases in F#, people tend to prefer readability and so using sequences is a nice way to write the code, but here you probably care more about performance, so I'd avoid sequences. If you want to keep the same structure of your code, you can rewrite isPrime like this:
let isPrime n =
let rec nonDivisible by =
if by = 1 then true // Return 'true' if we reached the end
elif n%by = 0 then false // Return 'false' if there is a divisor
else nonDivisible (by - 1) // Otherwise continue looping
n > 1 && nonDivisible (n/2)
This just replaces the sequence and forall with a recursive function nonDivisible that returns true when the number n is not divisible by any number between 2 and n/2. The function first checks the two termination cases and otherwise performs a recursive call..
With the original implementation, I'm able to find 1000th prime in 1.5sec and with the new one, it takes 22ms. Finding 10000th prime with the new implementation takes 3.2sec on my machine.

What's the ideal implementation for the Sieve of Eratosthenes between Lists, Arrays, and Mutable Arrays?

In Haskell, I've found three simple implementations of the Sieve of Eratosthenes on the Rosetta Code page.
Now my question is, which one should be used in which situations?
Correcting my initial reasoning would be helpful too:
I'm assuming the List one is the most idiomatic and easy to read for a Haskeller. Is it correct, though? I'm wondering if it suffers from the same problems as another list-based sieve that I then learned was not actually implementing the algorithm:
(edit: shown here is the list-based sieve I know has problems, not the one from RosettaCode, which I pasted at the bottom)
primes = sieve [2..] where
sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ]
In terms of performance, the immutable Array seems to be the winner. With an upper bound m of 2000000, the times were about:
1.3s for List
0.6s for Array
1.8s for Mutable Array
So I'd pick Array for performance.
And of course, the Mutable Array one is also easy to reason about since I have a more imperative language background. I'm not sure why I would pick this one if I'm coding in Haskell, though, since it's both slower than the others and non-idiomatic.
Code copied here for reference:
List:
primesTo m = 2 : eratos [3,5..m] where
eratos (p : xs) | p*p>m = p : xs
| True = p : eratos (xs `minus` [p*p, p*p+2*p..])
minus a#(x:xs) b#(y:ys) = case compare x y of
LT -> x : minus xs b
EQ -> minus xs ys
GT -> minus a ys
minus a b = a
Immutable Array:
import Data.Array.Unboxed
primesToA m = sieve 3 (array (3,m) [(i,odd i) | i<-[3..m]] :: UArray Int Bool)
where
sieve p a
| p*p > m = 2 : [i | (i,True) <- assocs a]
| a!p = sieve (p+2) $ a//[(i,False) | i <- [p*p, p*p+2*p..m]]
| otherwise = sieve (p+2) a
Mutable Array:
import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
primeSieve :: Integer -> UArray Integer Bool
primeSieve top = runSTUArray $ do
a <- newArray (2,top) True -- :: ST s (STUArray s Integer Bool)
let r = ceiling . sqrt $ fromInteger top
forM_ [2..r] $ \i -> do
ai <- readArray a i
when ai $ do
forM_ [i*i,i*i+i..top] $ \j -> do
writeArray a j False
return a
-- Return primes from sieve as list:
primesTo :: Integer -> [Integer]
primesTo top = [p | (p,True) <- assocs $ primeSieve top]
EDIT
I showed Turner's Sieve at the top but that's not the list-based example I'm comparing with the other two. I just wanted to know if the list-based example suffers from the same "not the correct Sieve of Eratosthenes" problems as Turner's.
It appears the performance comparison is unfair because the Mutable Array example goes through evens as well and uses Integer rather than Int...
This
primes = sieve [2..] where
sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ]
is not a sieve. It's very inefficient trial division. Don't use that!
I'm curious about how you got your times, there is no way that the Turner "sieve" could produce the primes not exceeding 2,000,000 in mere seconds. Letting it find the primes to 200,000 took
MUT time 6.38s ( 6.39s elapsed)
GC time 9.19s ( 9.20s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 15.57s ( 15.59s elapsed)
on my box (64-bit Linux, ghc-7.6.1, compiled with -O2). The complexity of that algorithm is O(N² / log² N), almost quadratic. Letting it proceed to 2,000,000 would take about twenty minutes.
Your times for the array versions are suspicious too, though in the other direction. Did you measure interpreted code?
Sieving to 2,000,000, compiled with optimisations, the mutable array code took 0.35 seconds to run, and the immutable array code 0.12 seconds.
Now, that still has the mutable array about three times slower than the immutable array.
But, it's an unfair comparison. For the immutable array, you used Ints, and for the mutable array Integers. Changing the mutable array code to use Ints - as it should, since under the hood, arrays are Int-indexed, so using Integer is an unnecessary performance sacrifice that buys nothing - made the mutable array code run in 0.15 seconds. Close to the mutable array code, but not quite there. However, you let the mutable array do more work, since in the immutable array code you only eliminate odd multiples of the odd primes, but in the mutable array code, you mark all multiples of all primes. Changing the mutable array code to treat 2 specially, and only eliminate odd multiples of odd primes brings that down to 0.12 seconds too.
But, you're using range-checked array indexing, which is slow, and, since the validity of the indices is checked in the code itself, unnecessary. Changing that to using unsafeRead and unsafeWrite brings down the time for the immutable array to 0.09 seconds.
Then you have the problem that using
forM_ [x, y .. z]
uses boxed Ints (fortunately, GHC eliminates the list). Writing a loop yourself, so that only unboxed Int#s are used, the time goes down to 0.02 seconds.
{-# LANGUAGE MonoLocalBinds #-}
import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base
primeSieve :: Int -> UArray Int Bool
primeSieve top = runSTUArray $ do
a <- newArray (0,top) True
unsafeWrite a 0 False
unsafeWrite a 1 False
let r = ceiling . sqrt $ fromIntegral top
mark step idx
| top < idx = return ()
| otherwise = do
unsafeWrite a idx False
mark step (idx+step)
sift p
| r < p = return a
| otherwise = do
prim <- unsafeRead a p
when prim $ mark (2*p) (p*p)
sift (p+2)
mark 2 4
sift 3
-- Return primes from sieve as list:
primesTo :: Int -> [Int]
primesTo top = [p | (p,True) <- assocs $ primeSieve top]
main :: IO ()
main = print .last $ primesTo 2000000
So, wrapping up, for a Sieve of Eratosthenes, you should use an array - not surprising, its efficiency depends on being able to step from one multiple to the next in short constant time.
You get very simple and straightforward code with immutable arrays, and that code performs decently for not too high limits (it gets relatively worse for higher limits, since the arrays are still copied and garbage-collected, but that's not too bad).
When you need better performance, you need mutable arrays. Writing efficient mutable array code is not entirely trivial, one has to know how the compiler translates the various constructs to choose the right one, and some would consider such code unidiomatic. But you can also use a library (disclaimer: I wrote it) that provides a fairly efficient implementation rather than writing it yourself.
Mutable array will always be the winner in terms of performance (and you really should've copied the version that works on odds only as a minimum; it should be the fastest of the three - also because it uses Int and not Integer).
For lists, tree-shaped merging incremental sieve should perform better than the one you show. You can always use it with takeWhile (< limit) if needed. I contend that it conveys the true nature of the sieve most clearly:
import Data.List (unfoldr)
primes :: [Int]
primes = 2 : _Y ((3 :) . gaps 5 . _U . map (\p -> [p*p, p*p+2*p..]))
_Y g = g (_Y g) -- recursion
_U ((x:xs):t) = (x :) . union xs . _U -- ~= nub . sort . concat
. unfoldr (\(a:b:c) -> Just (union a b, c)) $ t
gaps k s#(x:xs) | k < x = k : gaps (k+2) s -- ~= [k,k+2..]\\s, when
| otherwise = gaps (k+2) xs -- k<=x && null(s\\[k,k+2..])
union a#(x:xs) b#(y:ys) = case compare x y of -- ~= nub . sort .: (++)
LT -> x : union xs b
EQ -> x : union xs ys
GT -> y : union a ys
_U reimplements Data.List.Ordered.unionAll, and gaps 5 is (minus [5,7..]), fused for efficiency, with minus and union from the same package.
Of course nothing beats the brevity of Data.List.nubBy (((>1).).gcd) [2..] (but it's very slow).
To your 1st new question: not. It does find the multiples by counting up, as any true sieve should (although "minus" on lists is of course under-performant; the above improves on that by re-arranging a linear subtraction chain ((((xs-a)-b)-c)- ... ) into a subtraction of tree-folded additions, xs-(a+((b+c)+...))).
As has been said, using mutable arrays has the best performance. The following code is derived from this "TemplateHaskell" version converted back to something more in line with the straight mutable Array solution as the "TemplateHaskell" doesn't seem to make any difference, with some further optimizations. I believe it is faster than the usual mutable unboxed array versions due to the further optimizations and especially due to the use of the "unsafeRead" and "unsafeWrite" functions that avoid range checking of the arrays, probably also internally using pointers for array access:
{-# OPTIONS -O2 -optc-O3 #-}
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base
primesToUA :: Word32-> [Word32]
primesToUA top = do
let sieveUA top = runSTUArray $ do
let m = ((fromIntegral top) - 3) `div` 2 :: Int
buf <- newArray (0,m) True -- :: ST s (STUArray s Int Bool)
let cullUA i = do
let p = i + i + 3
strt = p * (i + 1) + i
let cull j = do
if j > m then cullUA (i + 1)
else do
unsafeWrite buf j False
cull (j + p)
if strt > m then return ()
else do
e <- unsafeRead buf i
if e then cull strt else cullUA (i + 1)
cullUA 0; return buf
if top > 1 then 2 : [2 * (fromIntegral i) + 3 | (i,True) <- assocs $ sieveUA top]
else []
main = do
x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln
print (last (primesToUA x)) -- 0.01 0.02 0.09 1.26 seconds
EDIT: The above code has been corrected and the times below edited to reflect the correction and as well the comments comparing the paged to the non-paged version have been edited.
The times to run this to the indicated top ranges are as shown in the comment table at the bottom of the above source code as measured by ideone.com and are about exactly five times faster than the answer posted by #WillNess as also measured at ideone.com. This code takes a trivial amount of time to cull the primes to two million and only 1.26 seconds to cull to a hundred million. These times are about 2.86 times faster when run on my i7 (3.5 GHz) at 0.44 seconds to a hundred million, and takes 6.81 seconds to run to one billion. The memory use is just over six megabytes for the former and sixty megabytes for the latter, which is the memory used by the huge (bit packed) array. This array also explains the non-linear performance in that as the array size exceeds the CPU cache sizes the average memory access times get worse per composite number representation cull.
EDIT_ADD: A page segmented sieve is more efficient in that it has better memory access efficiency when the buffer size is kept smaller than the L1 or L2 CPU caches, and as well has the advantage that it is unbounded so that the upper range does not have to be specified in advance and a much smaller memory footprint being just the base primes less than the square root of the range used plus the page buffer memory. The following code has been written as a page segmented implementation and is somewhat faster than the non-paged version; it also offers the advantage that one can change the output range specification at the top of the code to 'Word64' from 'Word32' so it is then not limited to the 32-bit number range, at only a slight cost in processing time (for 32-bit compiled code) for any range that is in common. The code is as follows:
-- from http://www.haskell.org/haskellwiki/Prime_numbers#Using_ST_Array
{-# OPTIONS -O2 -optc-O3 #-}
import Data.Word
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base
primesUA :: () -> [Word32]
primesUA () = do
let pgSZBTS = 262144 * 8
let sieveUA low bps = runSTUArray $ do
let nxt = (fromIntegral low) + (fromIntegral pgSZBTS)
buf <- newArray (0,pgSZBTS - 1) True -- :: ST s (STUArray s Int Bool)
let cullUAbase i = do
let p = i + i + 3
strt = p * (i + 1) + i
when (strt < pgSZBTS) $ do
e <- unsafeRead buf i
if e then do
let loop j = do
if j < pgSZBTS then do
unsafeWrite buf j False
loop (j + p)
else cullUAbase (i + 1)
loop strt
else cullUAbase (i + 1)
let cullUA ~(p:t) = do
let bp = (fromIntegral p)
i = (bp - 3) `div` 2
s = bp * (i + 1) + i
when (s < nxt) $ do
let strt = do
if s >= low then fromIntegral (s - low)
else do
let b = (low - s) `rem` bp
if b == 0 then 0 else fromIntegral (bp - b)
let loop j = do
if j < pgSZBTS then do
unsafeWrite buf j False
loop (j + (fromIntegral p))
else cullUA t
loop strt
if low <= 0 then cullUAbase 0 else cullUA bps
return buf
let sieveList low bps = do
[2 * ((fromIntegral i) + low) + 3 | (i,True) <- assocs $ sieveUA low bps]
let sieve low bps = do
(sieveList low bps) ++ sieve (low + (fromIntegral pgSZBTS)) bps
let primes' = ((sieveList 0 []) ++ sieve (fromIntegral pgSZBTS) primes') :: [Word32]
2 : sieve 0 primes'
main = do
x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln
-- 0.02 0.03 0.13 1.13 seconds
print (length (takeWhile ((>=) (fromIntegral x)) (primesUA ())))
The above code has quite a few more lines of code than the non-paged case due to the need to cull the composite number representations from the first page array differently than succeeding pages. The code also has the fixes so that there aren't memory leaks due to the base primes list and the output list now not being the same list (thus avoiding holding onto the whole list in memory).
Note that this code takes close to linear time (over the range sieved) as the range gets larger due to the culling buffer being of a constant size less than the L2 CPU cache. Memory use is a fraction of that used by the non-paged version at just under 600 kilobytes for a hundred million and just over 600 kilobytes for one billion, which slight increase is just the extra space required for the base primes less than the square root of the range list.
On ideone.com this code produces the number of primes to a hundred million in about 1.13 seconds and about 12 seconds to one billion (32-bit setting). Probably wheel factorization and definitely multi-core processing would make it even faster on a multi-core CPU. On my i7 (3.5 GHz), it takes 0.44 seconds to sieve to a hundred million and 4.7 seconds to one billion, with the roughly linear performance with increasing range as expected. It seems that there is some sort of non-linear overhead in the version of GHC run on ideone.com that has some performance penalty for larger ranges that is not present for the i7 and that is perhaps related to different garbage collection, as the page buffers are being created new for each new page. END_EDIT_ADD
EDIT_ADD2: It seems that much of the processing time for the above page segmented code is used in (lazy) list processing, so the code is accordingly reformulated with several improvements as follows:
Implemented a prime counting function that does not use list processing and uses "popCount" table look ups to count the number of 'one' bits in a 32 bit word at a time. In this way, the time to find the results is insignificant compared to the actual sieve culling time.
Stored the base primes as a list of bit packed page segments, which is much more space efficient than storing a list of primes, and the the time to convert the page segments to primes as required is not much of a computational overhead.
Tuned the prime segment make function so that for the initial zero page segment, it uses its own bit pattern as a source page, thus making the composite number cull code shorter and simpler.
The code then becomes as follows:
{-# OPTIONS -O3 -rtsopts #-} -- -fllvm ide.com doesn't support LLVM
import Data.Word
import Data.Bits
import Control.Monad
import Control.Monad.ST
import Data.Array.ST (runSTUArray)
import Data.Array.Unboxed
import Data.Array.Base
pgSZBTS = (2^18) * 8 :: Int -- size of L2 data cache
type PrimeType = Word32
type Chunk = UArray PrimeType Bool
-- makes a new page chunk and culls it
-- if the base primes list provided is empty then
-- it uses the current array as source (for zero page base primes)
mkChnk :: Word32 -> [Chunk] -> Chunk
mkChnk low bschnks = runSTUArray $ do
let nxt = (fromIntegral low) + (fromIntegral pgSZBTS)
buf <- nxt `seq` newArray (fromIntegral low, fromIntegral nxt - 1) True
let cull ~(p:ps) =
let bp = (fromIntegral p)
i = (bp - 3) `shiftR` 1
s = bp * (i + 1) + i in
let cullp j = do
if j >= pgSZBTS then cull ps
else do
unsafeWrite buf j False
cullp (j + (fromIntegral p)) in
when (s < nxt) $ do
let strt = do
if s >= low then fromIntegral (s - low)
else do
let b = (low - s) `rem` bp
if b == 0 then 0 else fromIntegral (bp - b)
cullp strt
case bschnks of
[] -> do bsbf <- unsafeFreezeSTUArray buf
cull (listChnkPrms [bsbf])
_ -> cull $ listChnkPrms bschnks
return buf
-- creates a page chunk list starting at the lw value
chnksList :: Word32 -> [Chunk]
chnksList lw =
mkChnk lw basePrmChnks : chnksList (lw + fromIntegral pgSZBTS)
-- converts a page chunk list to a list of primes
listChnkPrms :: [Chunk] -> [PrimeType]
listChnkPrms [] = []
listChnkPrms ~(hdchnk#(UArray lw _ rng _):tlchnks) =
let nxtp i =
if i >= rng then [] else
if unsafeAt hdchnk i then
(case ((lw + fromIntegral i) `shiftL` 1) + 3 of
np -> np) : nxtp (i + 1)
else nxtp (i + 1) in
(hdchnk `seq` lw `seq` nxtp 0) ++ listChnkPrms tlchnks
-- the base page chunk list used to cull the higher order pages,
-- note that it has special treatment for the zero page.
-- It is more space efficient to store this as chunks rather than
-- as a list of primes or even a list of deltas (gaps), with the
-- extra processing to convert as needed not too much.
basePrmChnks :: [Chunk]
basePrmChnks = mkChnk 0 [] : chnksList (fromIntegral pgSZBTS)
-- the full list of primes could be accessed with the following function.
primes :: () -> [PrimeType]
primes () = 2 : (listChnkPrms $ chnksList 0)
-- a quite fast prime counting up to the given limit using
-- chunk processing to avoid innermost list processing loops.
countPrimesTo :: PrimeType -> Int
countPrimesTo limit =
let lmtb = (limit - 3) `div` 2 in
let sumChnks acc chnks#(chnk#(UArray lo hi rng _):chnks') =
let cnt :: UArray PrimeType Word32 -> Int
cnt bfw =
case if lmtb < hi then fromIntegral (lmtb - lo) else rng of
crng -> case crng `shiftR` 5 of
rngw ->
let cnt' i ac =
ac `seq` if i >= rngw then
if (i `shiftL` 5) >= rng then ac else
case (-2) `shiftL` fromIntegral (lmtb .&. 31) of
msk -> msk `seq`
case (unsafeAt bfw rngw) .&.
(complement msk) of
bts -> bts `seq` case popCount bts of
c -> c `seq` case ac + c of nac -> nac
else case ac + (popCount $ unsafeAt bfw i) of
nacc -> nacc `seq` cnt' (i + 1) (nacc)
in cnt' 0 0 in
acc `seq` case runST $ do -- make UArray _ Bool into a UArray _ Word32
stbuf <- unsafeThawSTUArray chnk
stbufw <- castSTUArray stbuf
bufw <- unsafeFreezeSTUArray stbufw
return $ cnt bufw of
c -> c `seq` case acc + c of
nacc -> nacc `seq` if hi >= lmtb then nacc
else sumChnks nacc chnks' in
if limit < 2 then 0 else if limit < 3 then 1 else
lmtb `seq` sumChnks 1 (chnksList 0)
main = do
x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln 1000mln
-- 0.02 0.03 0.06 0.45 4.60 seconds
-- 7328 7328 8352 8352 9424 Kilobytes
-- this takes 14.34 seconds and 9424 Kilobytes to 3 billion on ideone.com,
-- and 9.12 seconds for 3 billion on an i7-2700K (3.5 GHz).
-- The above ratio of about 1.6 is much better than usual due to
-- the extremely low memory use of the page segmented algorithm.
-- It seems thaat the Windows Native Code Generator (NCG) from GHC
-- is particularly poor, as the Linux 32-bit version takes
-- less than two thirds of the time for exactly the same program...
print $ countPrimesTo x
-- print $ length $ takeWhile ((>=) x) $ primes () -- the slow way to do this
The times and memory requirements given in the code are as observed when run on ideone.com, with 0.02, 0.03, 0.05, 0.30, 3.0, and 9.1 seconds required to run on my i7-2700K (3.5 GHz) for one, two, ten, a hundred, a thousand (one billion), and three thousand (three billion) million range, respectively, with a pretty much constant memory footprint slowly increasing with the number of base primes less than the square root of the range as required. When compiled with the LLVM compiler back end, these times become 0.01, 0.02, 0.02, 0.12, 1.35, and 4.15 seconds, respectively, due to its more efficient use of registers and machine instructions; this last is quite close to the same speed as if compiled with a 64-bit compiler rather than the 32-bit compiler used as the efficient use of registers means that availability of extra registers doesn't make much difference.
As commented in the code, the ratio between performance on my real machine and the ideone.com servers becomes much less than for much more memory wasteful algorithms due to not being throttled by memory access bottlenecks so the limit to speed is mostly just the ratio of CPU clock speeds as well as CPU processing efficiency per clock cycle. However, as commented there, there is some strange inefficiency with the GHC Native Code Generator (NCG) when run under Windows (32-bit compiler) in that the run times are over 50% slower than if run under Linux (as the ideone.com server uses). AFAIK they both have a common code base for the same Haskell GHC version and the only divergence is in the linker used (which is also used with the LLVM backend, which is not affected) as GHC NCG does not use GCC but only the mingw32 assembler, which should also be the same.
Note that this code when compiled with the LLVM compiler back end is about the same speed as the same algorithm written for highly optimized 'C/C++' implementations indicating that Haskell really has the ability to develop very tight loop coding. It might be said that the Haskell code is quite a bit more readable and secure than equivalent 'C/C++' code once one gets used to the Haskell paradigms of monadic and non-strict code. The further refinements in execution speed for the Sieve of Eratosthenes are purely a function of the tuning of the implementations used and not the choice of language between Haskell and 'C/C++'.
Summary: Of course, this isn't yet the ultimate in speed for a Haskell version of the Sieve of Eratosthenes in that we still haven't tuned the memory access to more efficiently use the fast CPU L1 cache, nor have we significantly reduced the total number of composite culling operations necessary using extreme wheel factorization other than to eliminate the odds processing. However, this is enough to answer the question in showing that mutable arrays are the most efficient way of addressing such tight loop type of problems, with potential speed gains of about 100 times over the use of lists or immutable arrays. END_EDIT_ADD2

Slow tail recursion in F#

I have an F# function that returns a list of numbers starting from 0 in the pattern of skip n, choose n, skip n, choose n... up to a limit. For example, this function for input 2 will return [2, 3, 6, 7, 10, 11...].
Initially I implemented this as a non-tail-recursive function as below:
let rec indicesForStep start blockSize maxSize =
match start with
| i when i > maxSize -> []
| _ -> [for j in start .. ((min (start + blockSize) maxSize) - 1) -> j] # indicesForStep (start + 2 * blockSize) blockSize maxSize
Thinking that tail recursion is desirable, I reimplemented it using an accumulator list as follows:
let indicesForStepTail start blockSize maxSize =
let rec indicesForStepInternal istart accumList =
match istart with
| i when i > maxSize -> accumList
| _ -> indicesForStepInternal (istart + 2 * blockSize) (accumList # [for j in istart .. ((min (istart + blockSize) maxSize) - 1) -> j])
indicesForStepInternal start []
However, when I run this in fsi under Mono with the parameters 1, 1 and 20,000 (i.e. should return [1, 3, 5, 7...] up to 20,000), the tail-recursive version is significantly slower than the first version (12 seconds compared to sub-second).
Why is the tail-recursive version slower? Is it because of the list concatenation? Is it a compiler optimisation? Have I actually implemented it tail-recursively?
I also feel as if I should be using higher-order functions to do this, but I'm not sure exactly how to go about doing it.
As dave points out, the problem is that you're using the # operator to append lists. This is more significant performance issue than tail-recursion. In fact, tail-recursion doesn't really speed-up the program too much (but it makes it work on large inputs where the stack would overflow).
The reason why you'r second version is slower is that you're appending shorter list (the one generated using [...]) to a longer list (accumList). This is slower than appending longer list to a shorter list (because the operation needs to copy the first list).
You can fix it by collecting the elements in the accumulator in a reversed order and then reversing it before returning the result:
let indicesForStepTail start blockSize maxSize =
let rec indicesForStepInternal istart accumList =
match istart with
| i when i > maxSize -> accumList |> List.rev
| _ ->
let acc =
[for j in ((min (istart + blockSize) maxSize) - 1) .. -1 .. istart -> j]
# accumList
indicesForStepInternal (istart + 2 * blockSize) acc
indicesForStepInternal start []
As you can see, this has the shorter list (generated using [...]) as the first argument to # and on my machine, it has similar performance to the non-tail-recursive version. Note that the [ ... ] comprehension generates elements in the reversed order - so that they can be reversed back at the end.
You can also write the whole thing more nicely using the F# seq { .. } syntax. You can avoid using the # operator completely, because it allows you to yield individual elemetns using yield and perform tail-recursive calls using yield!:
let rec indicesForStepSeq start blockSize maxSize = seq {
match start with
| i when i > maxSize -> ()
| _ ->
for j in start .. ((min (start + blockSize) maxSize) - 1) do
yield j
yield! indicesForStepSeq (start + 2 * blockSize) blockSize maxSize }
This is how I'd write it. When calling it, you just need to add Seq.toList to evaluate the whole lazy sequence. The performance of this version is similar to the first one.
EDIT With the correction from Daniel, the Seq version is actually slightly faster!
In F# the list type is implemented as a singly linked list. Because of this you get different performance for x # y and y # x if x and y are of different length. That's why your seeing a difference in performance. (x # y) has running time of X.length.
// e.g.
let x = [1;2;3;4]
let y = [5]
If you did x # y then x (4 elements) would be copied into a new list and its internal next pointer would be set to the existing y list. If you did y # x then y (1 element) would be copied into a new list and its next pointer would be set to the existing list x.
I wouldn't use a higher order function to do this. I'd use list comprehension instead.
let indicesForStepTail start blockSize maxSize =
[
for block in start .. (blockSize * 2) .. (maxSize - 1) do
for i in block .. (block + blockSize - 1) do
yield i
]
This looks like the list append is the problem. Append is basically an O(N) operation on the size of the first argument. By accumulating on the left, this operation takes O(N^2) time.
The way this is typically done in functional code seems to be to accumulate the list in reverse order (by accumulating on the right), then at the end, return the reverse of the list.
The first version you have avoids the append problem, but as you point out, is not tail recursive.
In F#, probably the easiest way to solve this problem is with sequences. It is not very functional looking, but you can easily create an infinite sequence following your pattern, and use Seq.take to get the items you are interested in.

Resources