Haskell explicit recursion vs `iterate` - performance

While writing a function using iterate in Haskell, I found that an equivalent version with explicit recursion seemed noticeably faster - even though I believed that explicit recursion ought to be frowned upon in Haskell.
Similarly, I expected GHC to be able to inline/optimise list combinators appropriately so that the resulting machine code is at least similarly performing to the explicit recursion.
Here's a (different) example, which also displays the slowdown I observed.
steps m n and its variant steps' compute the number of Collatz steps n takes to reach 1, giving up after m attempts.
steps uses explicit recursion while steps' uses list functions.
import Data.List (elemIndex)
import Control.Exception (evaluate)
import Control.DeepSeq (rnf)
collatz :: Int -> Int
collatz n
| even n = n `quot` 2
| otherwise = 3 * n + 1
steps :: Int -> Int -> Maybe Int
steps m = go 0
where go k n
| n == 1 = Just k
| k == m = Nothing
| otherwise = go (k+1) (collatz n)
steps' :: Int -> Int -> Maybe Int
steps' m = elemIndex 1 . take m . iterate collatz
main :: IO ()
main = evaluate $ rnf $ map (steps 800) $ [1..10^7]
I tested these by evaluating for all values up to 10^7, each giving up after 800 steps. On my machine (compiled with ghc -O2), explicit recursion took just under 4 seconds (3.899s) but list combinators took about 5 times longer (19.922s).
Why is explicit recursion so much better in this case, and is there a way of writing this without explicit recursion while preserving performance?

Updated: I submitted Trac 15426 for this bug.
The problem disappears if you copy the definitions of elemIndex and findIndex into your module:
import Control.Exception (evaluate)
import Control.DeepSeq (rnf)
import Data.Maybe (listToMaybe)
import Data.List (findIndices)
elemIndex :: Eq a => a -> [a] -> Maybe Int
elemIndex x = findIndex (x==)
findIndex :: (a -> Bool) -> [a] -> Maybe Int
findIndex p = listToMaybe . findIndices p
collatz :: Int -> Int
collatz n
| even n = n `quot` 2
| otherwise = 3 * n + 1
steps' :: Int -> Int -> Maybe Int
steps' m = elemIndex 1 . take m . iterate collatz
main :: IO ()
main = evaluate $ rnf $ map (steps' 800) $ [1..10^7]
The problem seems to be that these must be inlinable for GHC to get the fusion right. Unfortunately, neither of them is marked inlinable in Data.OldList.
The change to allow findIndex to participate in fusion is relatively recent (see Trac 14387) where listToMaybe was reimplemented as a foldr. So, it probably hasn't seen a lot of testing yet.

Related

How to optimize this Haskell code summing up the primes in sublinear time?

Problem 10 from Project Euler is to find the sum of all the primes below given n.
I solved it simply by summing up the primes generated by the sieve of Eratosthenes. Then I came across much more efficient solution by Lucy_Hedgehog (sub-linear!).
For n = 2⋅10^9:
Python code (from the quote above) runs in 1.2 seconds in Python 2.7.3.
C++ code (mine) runs in about 0.3 seconds (compiled with g++ 4.8.4).
I re-implemented the same algorithm in Haskell, since I'm learning it:
import Data.List
import Data.Map (Map, (!))
import qualified Data.Map as Map
problem10 :: Integer -> Integer
problem10 n = (sieve (Map.fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]) 2 r vs) ! n
where vs = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]
r = floor (sqrt (fromIntegral n))
sieve :: Map Integer Integer -> Integer -> Integer -> [Integer] -> Map Integer Integer
sieve m p r vs | p > r = m
| otherwise = sieve (if m ! p > m ! (p - 1) then update m vs p else m) (p + 1) r vs
update :: Map Integer Integer -> [Integer] -> Integer -> Map Integer Integer
update m vs p = foldl' decrease m (map (\v -> (v, sumOfSieved m v p)) (takeWhile (>= p*p) vs))
decrease :: Map Integer Integer -> (Integer, Integer) -> Map Integer Integer
decrease m (k, v) = Map.insertWith (flip (-)) k v m
sumOfSieved :: Map Integer Integer -> Integer -> Integer -> Integer
sumOfSieved m v p = p * (m ! (v `div` p) - m ! (p - 1))
main = print $ problem10 $ 2*10^9
I compiled it with ghc -O2 10.hs and run with time ./10.
It gives the correct answer, but takes about 7 seconds.
I compiled it with ghc -prof -fprof-auto -rtsopts 10 and run with ./10 +RTS -p -h.
10.prof shows that decrease takes 52.2% time and 67.5% allocations.
After running hp2ps 10.hp I got such heap profile:
Again looks like decrease takes most of the heap. GHC version 7.6.3.
How would you optimize run time of this Haskell code?
Update 13.06.17:
I tried replacing immutable Data.Map with mutable Data.HashTable.IO.BasicHashTable from the hashtables package, but I'm probably doing something bad, since for tiny n = 30 it already takes too long, about 10 seconds. What's wrong?
Update 18.06.17:
Curious about the HashTable performance issues is a good read. I took Sherh's code using mutable Data.HashTable.ST.Linear, but dropped Data.Judy in instead. It runs in 1.1 seconds, still relatively slow.
I've done some small improvements so it runs in 3.4-3.5 seconds on my machine.
Using IntMap.Strict helped a lot. Other than that I just manually performed some ghc optimizations just to be sure. And make Haskell code more close to Python code from your link. As a next step you could try to use some mutable HashMap. But I'm not sure... IntMap can't be much faster than some mutable container because it's an immutable one. Though I'm still surprised about it's efficiency. I hope this can be implemented faster.
Here is the code:
import Data.List (foldl')
import Data.IntMap.Strict (IntMap, (!))
import qualified Data.IntMap.Strict as IntMap
p :: Int -> Int
p n = (sieve (IntMap.fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]) 2 r vs) ! n
where vs = [n `div` i | i <- [1..r]] ++ [n', n' - 1 .. 1]
r = floor (sqrt (fromIntegral n) :: Double)
n' = n `div` r - 1
sieve :: IntMap Int -> Int -> Int -> [Int] -> IntMap Int
sieve m' p' r vs = go m' p'
where
go m p | p > r = m
| m ! p > m ! (p - 1) = go (update m vs p) (p + 1)
| otherwise = go m (p + 1)
update :: IntMap Int -> [Int] -> Int -> IntMap Int
update s vs p = foldl' decrease s (takeWhile (>= p2) vs)
where
sp = s ! (p - 1)
p2 = p * p
sumOfSieved v = p * (s ! (v `div` p) - sp)
decrease m v = IntMap.adjust (subtract $ sumOfSieved v) v m
main :: IO ()
main = print $ p $ 2*10^(9 :: Int)
UPDATE:
Using mutable hashtables I've managed to make performance up to ~5.5sec on Haskell with this implementation.
Also, I used unboxed vectors instead of lists in several places. Linear hashing seems to be the fastest. I think this can be done even faster. I noticed sse42 option in hasthables package. Not sure I've managed to set it correctly but even without it runs that fast.
UPDATE 2 (19.06.2017)
I've managed to make it 3x faster then best solution from #Krom (using my code + his map) by dropping judy hashmap at all. Instead just plain arrays are used. You can come up with the same idea if you notice that keys for S hashmap are either sequence from 1 to n' or n div i for i from 1 to r. So we can represent such HashMap as two arrays making lookups in array depending on searching key.
My code + Judy HashMap
$ time ./judy
95673602693282040
real 0m0.590s
user 0m0.588s
sys 0m0.000s
My code + my sparse map
$ time ./sparse
95673602693282040
real 0m0.203s
user 0m0.196s
sys 0m0.004s
This can be done even faster if instead of IOUArray already generated vectors and Vector library is used and readArray is replaced by unsafeRead. But I don't think this should be done if only you're not really interested in optimizing this as much as possible.
Comparison with this solution is cheating and is not fair. I expect same ideas implemented in Python and C++ will be even faster. But #Krom solution with closed hashmap is already cheating because it uses custom data structure instead of standard one. At least you can see that standard and most popular hash maps in Haskell are not that fast. Using better algorithms and better ad-hoc data structures can be better for such problems.
Here's resulting code.
First as a baseline, the timings of the existing approaches
on my machine:
Original program posted in the question:
time stack exec primorig
95673602693282040
real 0m4.601s
user 0m4.387s
sys 0m0.251s
Second the version using Data.IntMap.Strict from
here
time stack exec primIntMapStrict
95673602693282040
real 0m2.775s
user 0m2.753s
sys 0m0.052s
Shershs code with Data.Judy dropped in here
time stack exec prim-hash2
95673602693282040
real 0m0.945s
user 0m0.955s
sys 0m0.028s
Your python solution.
I compiled it with
python -O -m py_compile problem10.py
and the timing:
time python __pycache__/problem10.cpython-36.opt-1.pyc
95673602693282040
real 0m1.163s
user 0m1.160s
sys 0m0.003s
Your C++ version:
$ g++ -O2 --std=c++11 p10.cpp -o p10
$ time ./p10
sum(2000000000) = 95673602693282040
real 0m0.314s
user 0m0.310s
sys 0m0.003s
I didn't bother to provide a baseline for slow.hs, as I didn't
want to wait for it to complete when run with an argument of
2*10^9.
Subsecond performance
The following program runs in under a second on my machine.
It uses a hand rolled hashmap, which uses closed hashing with
linear probing and uses some variant of knuths hashfunction,
see here.
Certainly it is somewhat tailored to the case, as the lookup
function for example expects the searched keys to be present.
Timings:
time stack exec prim
95673602693282040
real 0m0.725s
user 0m0.714s
sys 0m0.047s
First I implemented my hand rolled hashmap simply to hash
the keys with
key `mod` size
and selected a size multiple times higher than the expected
input, but the program took 22s or more to complete.
Finally it was a matter of choosing a hash function which was
good for the workload.
Here is the program:
import Data.Maybe
import Control.Monad
import Data.Array.IO
import Data.Array.Base (unsafeRead)
type Number = Int
data Map = Map { keys :: IOUArray Int Number
, values :: IOUArray Int Number
, size :: !Int
, factor :: !Int
}
newMap :: Int -> Int -> IO Map
newMap s f = do
k <- newArray (0, s-1) 0
v <- newArray (0, s-1) 0
return $ Map k v s f
storeKey :: IOUArray Int Number -> Int -> Int -> Number -> IO Int
storeKey arr s f key = go ((key * f) `mod` s)
where
go :: Int -> IO Int
go ind = do
v <- readArray arr ind
go2 v ind
go2 v ind
| v == 0 = do { writeArray arr ind key; return ind; }
| v == key = return ind
| otherwise = go ((ind + 1) `mod` s)
loadKey :: IOUArray Int Number -> Int -> Int -> Number -> IO Int
loadKey arr s f key = s `seq` key `seq` go ((key *f) `mod` s)
where
go :: Int -> IO Int
go ix = do
v <- unsafeRead arr ix
if v == key then return ix else go ((ix + 1) `mod` s)
insertIntoMap :: Map -> (Number, Number) -> IO Map
insertIntoMap m#(Map ks vs s f) (k, v) = do
ix <- storeKey ks s f k
writeArray vs ix v
return m
fromList :: Int -> Int -> [(Number, Number)] -> IO Map
fromList s f xs = do
m <- newMap s f
foldM insertIntoMap m xs
(!) :: Map -> Number -> IO Number
(!) (Map ks vs s f) k = do
ix <- loadKey ks s f k
readArray vs ix
mupdate :: Map -> Number -> (Number -> Number) -> IO ()
mupdate (Map ks vs s fac) i f = do
ix <- loadKey ks s fac i
old <- readArray vs ix
let x' = f old
x' `seq` writeArray vs ix x'
r' :: Number -> Number
r' = floor . sqrt . fromIntegral
vs' :: Integral a => a -> a -> [a]
vs' n r = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]
vss' n r = r + n `div` r -1
list' :: Int -> Int -> [Number] -> IO Map
list' s f vs = fromList s f [(i, i * (i + 1) `div` 2 - 1) | i <- vs]
problem10 :: Number -> IO Number
problem10 n = do
m <- list' (19*vss) (19*vss+7) vs
nm <- sieve m 2 r vs
nm ! n
where vs = vs' n r
vss = vss' n r
r = r' n
sieve :: Map -> Number -> Number -> [Number] -> IO Map
sieve m p r vs | p > r = return m
| otherwise = do
v1 <- m ! p
v2 <- m ! (p - 1)
nm <- if v1 > v2 then update m vs p else return m
sieve nm (p + 1) r vs
update :: Map -> [Number] -> Number -> IO Map
update m vs p = foldM (decrease p) m $ takeWhile (>= p*p) vs
decrease :: Number -> Map -> Number -> IO Map
decrease p m k = do
v <- sumOfSieved m k p
mupdate m k (subtract v)
return m
sumOfSieved :: Map -> Number -> Number -> IO Number
sumOfSieved m v p = do
v1 <- m ! (v `div` p)
v2 <- m ! (p - 1)
return $ p * (v1 - v2)
main = do { n <- problem10 (2*10^9) ; print n; } -- 2*10^9
I am not a professional with hashing and that sort of stuff, so
this can certainly be improved a lot. Maybe we Haskellers should
improve the of the shelf hash maps or provide some simpler ones.
My hashmap, Shershs code
If I plug my hashmap in Shershs (see answer below) code, see here
we are even down to
time stack exec prim-hash2
95673602693282040
real 0m0.601s
user 0m0.604s
sys 0m0.034s
Why is slow.hs slow?
If you read through the source
for the function insert in Data.HashTable.ST.Basic, you
will see that it deletes the old key value pair and inserts
a new one. It doesn't look up the "place" for the value and
mutate it, as one might imagine, if one reads that it is
a "mutable" hashtable. Here the hashtable itself is mutable,
so you don't need to copy the whole hashtable for insertion
of a new key value pair, but the value places for the pairs
are not. I don't know if that is the whole story of slow.hs
being slow, but my guess is, it is a pretty big part of it.
A few minor improvements
So that's the idea I followed while trying to improve
your program the first time.
See, you don't need a mutable mapping from keys to values.
Your key set is fixed. You want a mapping from keys to mutable
places. (Which is, by the way, what you get from C++ by default.)
And so I tried to come up with that. I used IntMap IORef from
Data.IntMap.Strict and Data.IORef first and got a timing
of
tack exec prim
95673602693282040
real 0m2.134s
user 0m2.141s
sys 0m0.028s
I thought maybe it would help to work with unboxed values
and to get that, I used IOUArray Int Int with 1 element
each instead of IORef and got those timings:
time stack exec prim
95673602693282040
real 0m2.015s
user 0m2.018s
sys 0m0.038s
Not much of a difference and so I tried to get rid of bounds
checking in the 1 element arrays by using unsafeRead and
unsafeWrite and got a timing of
time stack exec prim
95673602693282040
real 0m1.845s
user 0m1.850s
sys 0m0.030s
which was the best I got using Data.IntMap.Strict.
Of course I ran each program multiple times to see if
the times are stable and the differences in run time aren't
just noise.
It looks like these are all just micro-optimizations.
And here is the program that ran fastest for me without using a hand rolled data structure:
import qualified Data.IntMap.Strict as M
import Control.Monad
import Data.Array.IO
import Data.Array.Base (unsafeRead, unsafeWrite)
type Number = Int
type Place = IOUArray Number Number
type Map = M.IntMap Place
tupleToRef :: (Number, Number) -> IO (Number, Place)
tupleToRef = traverse (newArray (0,0))
insertRefs :: [(Number, Number)] -> IO [(Number, Place)]
insertRefs = traverse tupleToRef
fromList :: [(Number, Number)] -> IO Map
fromList xs = M.fromList <$> insertRefs xs
(!) :: Map -> Number -> IO Number
(!) m i = unsafeRead (m M.! i) 0
mupdate :: Map -> Number -> (Number -> Number) -> IO ()
mupdate m i f = do
let place = m M.! i
old <- unsafeRead place 0
let x' = f old
-- make the application of f strict
x' `seq` unsafeWrite place 0 x'
r' :: Number -> Number
r' = floor . sqrt . fromIntegral
vs' :: Integral a => a -> a -> [a]
vs' n r = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]
list' :: [Number] -> IO Map
list' vs = fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]
problem10 :: Number -> IO Number
problem10 n = do
m <- list' vs
nm <- sieve m 2 r vs
nm ! n
where vs = vs' n r
r = r' n
sieve :: Map -> Number -> Number -> [Number] -> IO Map
sieve m p r vs | p > r = return m
| otherwise = do
v1 <- m ! p
v2 <- m ! (p - 1)
nm <- if v1 > v2 then update m vs p else return m
sieve nm (p + 1) r vs
update :: Map -> [Number] -> Number -> IO Map
update m vs p = foldM (decrease p) m $ takeWhile (>= p*p) vs
decrease :: Number -> Map -> Number -> IO Map
decrease p m k = do
v <- sumOfSieved m k p
mupdate m k (subtract v)
return m
sumOfSieved :: Map -> Number -> Number -> IO Number
sumOfSieved m v p = do
v1 <- m ! (v `div` p)
v2 <- m ! (p - 1)
return $ p * (v1 - v2)
main = do { n <- problem10 (2*10^9) ; print n; } -- 2*10^9
If you profile that, you see that it spends most of the time in the custom lookup function (!),
don't know how to improve that further. Trying to inline (!) with {-# INLINE (!) #-}
didn't yield better results; maybe ghc already did this.
This code of mine evaluates the sum to 2⋅10^9 in 0.3 seconds and the sum to 10^12 (18435588552550705911377) in 19.6 seconds (if given sufficient RAM).
import Control.DeepSeq
import qualified Control.Monad as ControlMonad
import qualified Data.Array as Array
import qualified Data.Array.ST as ArrayST
import qualified Data.Array.Base as ArrayBase
primeLucy :: (Integer -> Integer) -> (Integer -> Integer) -> Integer -> (Integer->Integer)
primeLucy f sf n = g
where
r = fromIntegral $ integerSquareRoot n
ni = fromIntegral n
loop from to c = let go i = ControlMonad.when (to<=i) (c i >> go (i-1)) in go from
k = ArrayST.runSTArray $ do
k <- ArrayST.newListArray (-r,r) $ force $
[sf (div n (toInteger i)) - sf 1|i<-[r,r-1..1]] ++
[0] ++
[sf (toInteger i) - sf 1|i<-[1..r]]
ControlMonad.forM_ (takeWhile (<=r) primes) $ \p -> do
l <- ArrayST.readArray k (p-1)
let q = force $ f (toInteger p)
let adjust = \i j -> do { v <- ArrayBase.unsafeRead k (i+r); w <- ArrayBase.unsafeRead k (j+r); ArrayBase.unsafeWrite k (i+r) $!! v+q*(l-w) }
loop (-1) (-div r p) $ \i -> adjust i (i*p)
loop (-div r p-1) (-min r (div ni (p*p))) $ \i -> adjust i (div (-ni) (i*p))
loop r (p*p) $ \i -> adjust i (div i p)
return k
g :: Integer -> Integer
g m
| m >= 1 && m <= integerSquareRoot n = k Array.! (fromIntegral m)
| m >= integerSquareRoot n && m <= n && div n (div n m)==m = k Array.! (fromIntegral (negate (div n m)))
| otherwise = error $ "Function not precalculated for value " ++ show m
primeSum :: Integer -> Integer
primeSum n = (primeLucy id (\m -> div (m*m+m) 2) n) n
If your integerSquareRoot function is buggy (as reportedly some are), you can replace it here with floor . sqrt . fromIntegral.
Explanation:
As the name suggests it is based upon a generalization of the famous method by "Lucy Hedgehog" eventually discovered by the original poster.
It allows you to calculate many sums of the form (with p prime) without enumerating all the primes up to N and in time O(N^0.75).
Its inputs are the function f (i.e., id if you want the prime sum), its summatory function over all the integers (i.e., in that case the sum of the first m integers or div (m*m+m) 2), and N.
PrimeLucy returns a lookup function (with p prime) restricted to certain values of n: .
Try this and let me know how fast it is:
-- sum of primes
import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
sieve :: Int -> UArray Int Bool
sieve n = runSTUArray $ do
let m = (n-1) `div` 2
r = floor . sqrt $ fromIntegral n
bits <- newArray (0, m-1) True
forM_ [0 .. r `div` 2 - 1] $ \i -> do
isPrime <- readArray bits i
when isPrime $ do
let a = 2*i*i + 6*i + 3
b = 2*i*i + 8*i + 6
forM_ [a, b .. (m-1)] $ \j -> do
writeArray bits j False
return bits
primes :: Int -> [Int]
primes n = 2 : [2*i+3 | (i, True) <- assocs $ sieve n]
main = do
print $ sum $ primes 1000000
You can run it on ideone. My algorithm is the Sieve of Eratosthenes, and it should be quite fast for small n. For n = 2,000,000,000, the array size may be a problem, in which case you will need to use a segmented sieve. See my blog for more information about the Sieve of Eratosthenes. See this answer for information about a segmented sieve (but not in Haskell, unfortunately).

Batching actions for caching and performance while avoiding the dirty work

Say I have two pure but unsafe functions, that do the same, but one of them is working on batches, and is asymptotically faster:
f :: Int -> Result -- takes O(1) time
f = unsafePerformIO ...
g :: [Int] -> [Result] -- takes O(log n) time
g = unsafePerformIO ...
A naive implementation:
getUntil :: Int -> [Result]
getUntil 0 = f 0
getUntil n = f n : getUntil n-1
switch is the n value where g gets cheaper than f.
getUntil will in practice be called with ever increasing n, but it might not start at 0. So since the Haskell runtime can memoize getUntil, performance will be optimal if getUntil is called with an interval lower than switch. But once the interval gets larger, this implementation is slow.
In an imperative program, I guess I would make a TreeMap (which could quickly be checked for gaps) for caching all calls. On cache misses, it would get filled with the results of g, if the gap was greater than switch in length, and f otherwise, respectively.
How can this be optimized in Haskell?
I think I am just looking for:
an ordered map filled on-demand using a fill function that would fill all values up to the requested index using one function if the missing range is small, another if it is large
a get operation on the map which returns a list of all lower values up to the requested index. This would result in a function similar to getUntil above.
I'll elaborate in my proposal for using map, after some tests I just ran.
import System.IO
import System.IO.Unsafe
import Control.Concurrent
import Control.Monad
switch :: Int
switch = 1000
f :: Int -> Int
f x = unsafePerformIO $ do
threadDelay $ 500 * x
putStrLn $ "Calculated from scratch: f(" ++ show x ++ ")"
return $ 500*x
g :: Int -> Int
g x = unsafePerformIO $ do
threadDelay $ x*x `div` 2
putStrLn $ "Calculated from scratch: g(" ++ show x ++ ")"
return $ x*x `div` 2
cachedFG :: [Int]
cachedFG = map g [0 .. switch] ++ map f [switch+1 ..]
main :: IO ()
main = forever $ getLine >>= print . (cachedFG !!) . read
… where f, g and switch have the same meaning indicated in the question.
The above program can be compiled as is using GHC. When executed, positive integers can be entered, followed by a newline, and the application will print some value based on the number entered by the user plus some extra indication on what values are being calculated from scratch.
A short session with this program is:
User: 10000
Program: Calculated from scratch: f(10000)
Program: 5000000
User: 10001
Program: Calculated from scratch: f(10001)
Program: 5000500
User: 10000
Program: 5000000
^C
The program has to be killed/terminated manually.
Notice that the last value entered doesn't show a "calculated from scratch" message. This indicates that the program has the value cached/memoized somewhere. You can try executing this program yourself; but have into account that threadDelay's lag is proportional to the value entered.
The getUntil function then could be implemented using:
getUntil :: Int -> [Int]
getUntil n = take n cachedFG
or:
getUntil :: Int -> [Int]
getUntil = flip take cachedFG
If you don't know the value for switch, you can try evaluating f and g in parallel and use the fastest result, but that's another show.

sorting integers fast in haskell

Is there any function in haskell libraries that sorts integers in O(n) time?? [By, O(n) I mean faster than comparison sort and specific for integers]
Basically I find that the following code takes a lot of time with the sort (as compared to summing the list without sorting) :
import System.Random
import Control.DeepSeq
import Data.List (sort)
genlist gen = id $!! sort $!! take (2^22) ((randoms gen)::[Int])
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
Summing a list doesn't require deepseq but what I am trying for does, but the above code is good enough for the pointers I am seeking.
Time : 6 seconds (without sort); about 35 seconds (with sort)
Memory : about 80 MB (without sort); about 310 MB (with sort)
Note 1 : memory is a bigger issue than time for me here as for the task at hand I am getting out of memory errors (memory usage becomes 3GB! after 30 minutes of run-time)
I am assuming faster algorithms will provide bettor memory print too, hence looking for O(n) time.
Note 2 : I am looking for fast algorithms for Int64, though fast algorithms for other specific types will also be helpful.
Solution Used : IntroSort with unboxed vectors was good enough for my task:
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
sort :: [Int] -> [Int]
sort = V.toList . V.modify I.sort . V.fromList
I would consider using vectors instead of lists for this, as lists have a lot of overhead per-element while an unboxed vector is essentially just a contiguous block of bytes. The vector-algorithms package contains various sorting algorithms you can use for this, including radix sort, which I expect should do well in your case.
Here's a simple example, though it might be a good idea to keep the result in vector form if you plan on doing further processing on it.
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Radix as R
sort :: [Int] -> [Int]
sort = V.toList . V.modify R.sort . V.fromList
Also, I suspect that a significant portion of the run time of your example is coming from the random number generator, as the standard one isn't exactly known for its performance. You should make sure that you're timing only the sorting part, and if you need a lot of random numbers in your program, there are faster generators available on Hackage.
The idea to sort the numbers using an array is the right one for reducing the memory usage.
However, using the maximum and minimum of the list as bounds may cause exceeding memory usage or even a runtime failure when maximum xs - minimum xs > (maxBound :: Int).
So I suggest writing the list contents to an unboxed mutable array, sorting that inplace (e.g. with quicksort), and then building a list from that again.
import System.Random
import Control.DeepSeq
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import Control.Monad.ST
myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi
| lo < hi = do
let lscan p h i
| i < h = do
v <- unsafeRead a i
if p < v then return i else lscan p h (i+1)
| otherwise = return i
rscan p l i
| l < i = do
v <- unsafeRead a i
if v < p then return i else rscan p l (i-1)
| otherwise = return i
swap i j = do
v <- unsafeRead a i
unsafeRead a j >>= unsafeWrite a i
unsafeWrite a j v
sloop p l h
| l < h = do
l1 <- lscan p h l
h1 <- rscan p l1 h
if (l1 < h1) then (swap l1 h1 >> sloop p l1 h1) else return l1
| otherwise = return l
piv <- unsafeRead a hi
i <- sloop piv lo hi
swap i hi
myqsort a lo (i-1)
myqsort a (i+1) hi
| otherwise = return ()
genlist gen = runST $ do
arr <- newListArray (0,2^22-1) $ take (2^22) (randoms gen)
myqsort arr 0 (2^22-1)
let collect acc 0 = do
v <- unsafeRead arr 0
return (v:acc)
collect acc i = do
v <- unsafeRead arr i
collect (v:acc) (i-1)
collect [] (2^22-1)
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
is reasonably fast and uses less memory. It still uses a lot of memory for the list, 222 Ints take 32MB storage raw (with 64-bit Ints), with the list overhead of iirc five words per element, that adds up to ~200MB, but less than half of the original.
This is taken from Richard Bird's book, Pearls of Functional Algorithm Design, (though I had to edit it a little, as the code in the book didn't compile exactly as written).
import Data.Array(Array,accumArray,assocs)
sort :: [Int] -> [Int]
sort xs = concat [replicate k x | (x,k) <- assocs count]
where count :: Array Int Int
count = accumArray (+) 0 range (zip xs (repeat 1))
range = (0, maximum xs)
It works by creating an Array indexed by integers where the values are the number of times each integer occurs in the list. Then it creates a list of the indexes, repeating them the same number of times they occurred in the original list according to the counts.
You should note that it is linear with the maximum value in the list, not the length of the list, so a list like [ 2^x | x <- [0..n] ] would not be sorted linearly.

Performance comparison of two implementations of a primes filter

I have two programs to find prime numbers (just an exercise, I'm learning Haskell). "primes" is about 10X faster than "primes2", once compiled with ghc (with flag -O). However, in "primes2", I thought it would consider only prime numbers for the divisor test, which should be faster than considering odd numbers in "isPrime", right? What am I missing?
isqrt :: Integral a => a -> a
isqrt = floor . sqrt . fromIntegral
isPrime :: Integral a => a -> Bool
isPrime n = length [i | i <- [1,3..(isqrt n)], mod n i == 0] == 1
primes :: Integral a => a -> [a]
primes n = [2,3,5,7,11,13] ++ (filter (isPrime) [15,17..n])
primes2 :: Integral a => a -> [a]
primes2 n = 2 : [i | i <- [3,5..n], all ((/= 0) . mod i) (primes2 (isqrt i))]
I think what's happening here is that isPrime is a simple loop, whereas primes2 is calling itself recursively — and its recursion pattern looks exponential to me.
Searching through my old source code, I found this code:
primes :: [Integer]
primes = 2 : filter isPrime [3,5..]
isPrime :: Integer -> Bool
isPrime x = all (\n -> x `mod` n /= 0) $
takeWhile (\n -> n * n <= x) primes
This tests each possible prime x only against the primes below sqrt(x), using the already generated list of primes. So it probably doesn't test any given prime more than once.
Memoization in Haskell:
Memoization in Haskell is generally explicit, not implicit. The compiler won't "do the right thing" but it will only do what you tell it to. When you call primes2,
*Main> primes2 5
[2,3,5]
*Main> primes2 10
[2,3,5,7]
Each time you call the function it calculates all of its results all over again. It has to. Why? Because 1) You didn't make it save its results, and 2) the answer is different each time you call it.
In the sample code I gave above, primes is a constant (i.e. it has arity zero) so there's only one copy of it in memory, and its parts only get evaluated once.
If you want memoization, you need to have a value with arity zero somewhere in your code.
I like what Dietrich has done with the memoization, but I think theres a data structure issue here too. Lists are just not the ideal data structure for this. They are, by necessity, lisp style cons cells with no random access. Set seems better suited to me.
import qualified Data.Set as S
sieve :: (Integral a) => a -> S.Set a
sieve top = let l = S.fromList (2:3:([5,11..top]++[7,13..top]))
iter s c
| cur > (div (S.findMax s) 2) = s
| otherwise = iter (s S.\\ (S.fromList [2*cur,3*cur..top])) (S.deleteMin c)
where cur = S.findMin c
in iter l (l S.\\ (S.fromList [2,3]))
I know its kind of ugly, and not too declarative, but it runs rather quickly. Im looking into a way to make this nicer looking using Set.fold and Set.union over the composites. Any other ideas for neatening this up would be appreciated.
PS - see how (2:3:([5,11..top]++[7,13..top])) avoids unnecessary multiples of 3 such as the 15 in your primes. Unfortunately, this ruins your ordering if you work with lists and you sign up for a sorting, but for sets thats not an issue.

Any way to create the unmemo-monad?

Suppose someone makes a program to play chess, or solve sudoku. In this kind of program it makes sense to have a tree structure representing game states.
This tree would be very large, "practically infinite". Which isn't by itself a problem as Haskell supports infinite data structures.
An familiar example of an infinite data structure:
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
Nodes are only allocated when first used, so the list takes finite memory. One may also iterate over an infinite list if they don't keep references to its head, allowing the garbage collector to collect its parts which are not needed anymore.
Back to the tree example - suppose one does some iteration over the tree, the tree nodes iterated over may not be freed if the root of the tree is still needed (for example in an iterative deepening search, the tree would be iterated over several times and so the root needs to be kept).
One possible solution for this problem that I thought of is using an "unmemo-monad".
I'll try to demonstrate what this monad is supposed to do using monadic lists:
import Control.Monad.ListT (ListT) -- cabal install List
import Data.Copointed -- cabal install pointed
import Data.List.Class
import Prelude hiding (enumFromTo)
nums :: ListT Unmemo Int -- What is Unmemo?
nums = enumFromTo 0 1000000
main = print $ div (copoint (foldlL (+) 0 nums)) (copoint (lengthL nums))
Using nums :: [Int], the program would take a lot of memory as a reference to nums is needed by lengthL nums while it is being iterated over foldlL (+) 0 nums.
The purpose of Unmemo is to make the runtime not keep the nodes iterated over.
I attempted using ((->) ()) as Unmemo, but it yields the same results as nums :: [Int] does - the program uses a lot of memory, as evident by running it with +RTS -s.
Is there anyway to implement Unmemo that does what I want?
Same trick as with a stream -- don't capture the remainder directly, but instead capture a value and a function which yields a remainder. You can add memoization on top of this as necessary.
data UTree a = Leaf a | Branch a (a -> [UTree a])
I'm not in the mood to figure it out precisely at the moment, but this structure arises, I'm sure, naturally as the cofree comonad over a fairly straightforward functor.
Edit
Found it: http://hackage.haskell.org/packages/archive/comonad-transformers/1.6.3/doc/html/Control-Comonad-Trans-Stream.html
Or this is perhaps simpler to understand: http://hackage.haskell.org/packages/archive/streams/0.7.2/doc/html/Data-Stream-Branching.html
In either case, the trick is that your f can be chosen to be something like data N s a = N (s -> (s,[a])) for an appropriate s (s being the type of your state parameter of the stream -- the seed of your unfold, if you will). That might not be exactly correct, but something close should do...
But of course for real work, you can scrap all this and just write the datatype directly as above.
Edit 2
The below code illustrates how this can prevent sharing. Note that even in the version without sharing, there are humps in the profile indicating that the sum and length calls aren't running in constant space. I'd imagine that we'd need an explicit strict accumulation to knock those down.
{-# LANGUAGE DeriveFunctor #-}
import Data.Stream.Branching(Stream(..))
import qualified Data.Stream.Branching as S
import Control.Arrow
import Control.Applicative
import Data.List
data UM s a = UM (s -> Maybe a) deriving Functor
type UStream s a = Stream (UM s) a
runUM s (UM f) = f s
liftUM x = UM $ const (Just x)
nullUM = UM $ const Nothing
buildUStream :: Int -> Int -> Stream (UM ()) Int
buildUStream start end = S.unfold (\x -> (x, go x)) start
where go x
| x < end = liftUM (x + 1)
| otherwise = nullUM
sumUS :: Stream (UM ()) Int -> Int
sumUS x = S.head $ S.scanr (\x us -> maybe 0 id (runUM () us) + x) x
lengthUS :: Stream (UM ()) Int -> Int
lengthUS x = S.head $ S.scanr (\x us -> maybe 0 id (runUM () us) + 1) x
sumUS' :: Stream (UM ()) Int -> Int
sumUS' x = last $ usToList $ liftUM $ S.scanl (+) 0 x
lengthUS' :: Stream (UM ()) Int -> Int
lengthUS' x = last $ usToList $ liftUM $ S.scanl (\acc _ -> acc + 1) 0 x
usToList x = unfoldr (\um -> (S.head &&& S.tail) <$> runUM () um) x
maxNum = 1000000
nums = buildUStream 0 maxNum
numsL :: [Int]
numsL = [0..maxNum]
-- All these need to be run with increased stack to avoid an overflow.
-- This generates an hp file with two humps (i.e. the list is not shared)
main = print $ div (fromIntegral $ sumUS' nums) (fromIntegral $ lengthUS' nums)
-- This generates an hp file as above, and uses somewhat less memory, at the cost of
-- an increased number of GCs. -H helps a lot with that.
-- main = print $ div (fromIntegral $ sumUS nums) (fromIntegral $ lengthUS nums)
-- This generates an hp file with one hump (i.e. the list is shared)
-- main = print $ div (fromIntegral $ sum $ numsL) (fromIntegral $ length $ numsL)

Resources