Why is `filterM + mapM_` so much slower than `mapM_ + when`, with large lists? - performance

I don't know very much about how Haskell optimization works internally but I've been using filters quite much hoping that they are optimized into something equivalent to a simple if in C++. For example
mapM_ print $ filter (\n -> n `mod` 2 == 0) [0..10]
will compile into equivalent of
for (int i = 0; i < 10; i++)
if (i%2 == 0)
printf("%d\n", i);
With long lists (10 000 000 elements) it seems to be true for a basic filter but there is a huge difference if I use the monadic filterM. I wrote a piece of code for this speed testing and it's obvious that the usage of filterM lasts much longer (250x) than a more imperative approach using when.
import Data.Array.IO
import Control.Monad
import System.CPUTime
main :: IO ()
main = do
start <- getCPUTime
arr <- newArray (0, 100) 0 :: IO (IOUArray Int Int)
let
okSimple i =
i < 100
ok i = do
return $ i < 100
-- -- of course we don't need IO for a simple i < 100
-- -- but my goal is to ask for the contents of the array, e.g.
-- ok i = do
-- current <- readArray arr (i `mod` 101)
-- return$ i `mod` 37 > current `mod` 37
write :: Int -> IO ()
write i =
writeArray arr (i `mod` 101) i
writeIfOkSimple :: Int -> IO ()
writeIfOkSimple i =
when (okSimple i) $ write i
writeIfOk :: Int -> IO ()
writeIfOk i =
ok i >>= (\isOk -> when isOk $ write i)
-------------------------------------------------------------------
---- these four methods have approximately same execution time ----
---- (but the last one is executed on 250 times shorter list) ----
-------------------------------------------------------------------
-- mapM_ write$ filter okSimple [0..10000000*250] -- t = 20.694
-- mapM_ writeIfOkSimple [0..10000000*250] -- t = 20.698
-- mapM_ writeIfOk [0..10000000*250] -- t = 20.669
filterM ok [0..10000000] >>= mapM_ write -- t = 17.200
-- evaluate array
elems <- getElems arr
print $ sum elems
end <- getCPUTime
print $ fromIntegral (end - start) / (10^12)
My question is: shouldn't both approaches (using writeIfOk / using filterM ok and write) compile into the same code (iterate list, ask for condition, write data)? If not, can I do something (rewrite code, add compilation flags, use inline pragma or something) to make them computationally equivalent or should I always use when when performance is critical?

Boiling this question down to its essence, your asking about the difference between
f (filter g xs)
and
f =<< filterM (pure . g) xs
This basically comes down to laziness. filter g xs produces its result incrementally as it's demanded, only walking xs far enough to find the next element of the result. filterM is defined something like this:
filterM _p [] = pure []
filterM p (x : xs)
= liftA2 (\q r -> if q then x : r else r)
(p x)
(filterM p xs)
Since IO is a "strict" applicative, this will not produce anything at all until it's walked the whole list, accumulating the p x results in memory.

Related

Why does my Haskell code not appear to run in Parallel

I am trying to solve a 2-sum algorithm problem for Standford university online course on coursera. I need to find all distinct pairs x+y in a list that sum to a value t in a range [-10000 .. 10000]. I know there more efficient implementations but I thought it would be a good time to try and do some Haskell parallel programming.
I have tried to implement parellelisation just by looping through half of the range in two different threads (which I think are called sparks). My code is the following:
module Main where
import Data.List
import qualified Data.Map as M
import Debug.Trace
import Control.Parallel (par,pseq)
main :: IO ()
main = interact run
range :: [Int]
range = [negate 10000..10000]
emptyMap :: M.Map Int Bool
emptyMap = M.fromList $ zip [] []
run :: String -> String
run xs = let parsedInput = map (read :: String -> Int) $ words xs
hashMap = M.fromList $ zip parsedInput (repeat True)
pcalc r = map (\t -> trace (show t) (countVals hashMap parsedInput t)) r
bot = pcalc (take (div (length range) 2) range)
top = pcalc (drop (div (length range) 2) range)
out = top `par` bot `pseq` (sum bot + sum top)
in show out
countVals :: M.Map Int Bool -> [Int] -> Int -> Int
countVals m ks t = foldl' go 0 ks
where go acum x = if M.lookup y m == Just True
&& y /= x
then 1
else acum
where y = t - x
You can see I have two variables top and bot which I am trying to calculate in parallel via
out = top `par` bot `pseq` (sum bot + sum top)
which is what I thought other stack overflow answers are recommending. However when I compile and run I only seem to see the trace from the bot variable.
% stack ghc --package parallel -- -threaded Main.hs
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking Main ...
% ./Main +RTS -N8 < input.txt
-10000
-9999
-9998
-9997
-9996
...
Whereas I was expecting something like:
% ./Main +RTS -N8 < input.txt
-10000
0
-9999
1
-9998
2
-9997
-9996
...
Can someone help point out what exactly I am doing wrong? Thanks
Let's focus on this part:
bot = pcalc (take (div (length range) 2) range)
top = pcalc (drop (div (length range) 2) range)
out = top `par` bot `pseq` (sum bot + sum top)
Here, bot and top are lists. When we seq, pseq or par a value we cause it to be evaluated; since Haskell is lazy, evaluation stops when the "weak head normal form" is reached, i.e. until the first constructor appears in the result. For list values, this means that they are reduced to either [] or unevaluatedHead : unevaluatedTail.
Because of this, top `par` bot `pseq` ... only parallelizes the evaluation of the first cell of the lists, and not their full contents. The whole lists will only get evaluated after pseq when we sum them, but that is run on only one core.
To force the code to be parallel, we can parallelize the sums instead:
sumBot = sum bot
sumTop = sum top
out = sumBot `par` sumTop `pseq` sumBot + sumTop
Since evaluating the sums to WHNF requires evaluating the whole list, this should properly parallelize the computation.

Is runInBoundThread the best tool for parallelism?

Say, I want to fold monoids in parallel. My computer has 8 cores. I have this function to split a list into equal-sized smaller lists (with bounded modulo-bias):
import Data.List
parallelize :: Int -> [a] -> [[a]]
parallelize 0 _ = []
parallelize n [] = replicate n []
parallelize n xs = let
(us,vs) = splitAt (quot (length xs) n) xs
in us : parallelize (n-1) vs
The first version of parallel fold I made was:
import Control.Concurrent
import Control.Concurrent.QSemN
import Data.Foldable
import Data.IORef
foldP :: Monoid m => [m] -> IO m
foldP xs = do
result <- newIORef mempty
sem <- newQSemN 0
n <- getNumCapabilities
let yss = parallelize n xs
for_ yss (\ys -> forkIO (modifyIORef result (fold ys <>) >> signalQSemN sem 1))
waitQSemN sem n
readIORef result
But usage of IORefs and semaphores seemed ugly to me. So I made another version:
import Data.Traversable
foldP :: Monoid m => [m] -> IO m
foldP xs = do
n <- getNumCapabilities
let yss = parallelize n xs
rs <- for yss (\ys -> runInUnboundThread (return (fold ys)))
return (fold rs)
The test code I used is:
import Data.Monoid
import System.CPUTime
main :: IO ()
main = do
start <- getCPUTime
Product result <- foldP (fmap Product [1 .. 100])
end <- getCPUTime
putStrLn ("Time took: " ++ show (end - start) ++ "ps.")
putStrLn ("Result: " ++ show result)
The second version of foldP outperformed the first version. When I used runInBoundThread instead of runInUnboundThread, it became even faster.
By what are these performance differences made?
TLDR; Use fold function from massiv package and you will likely get the most efficient solution in Haskell.
I would like to start by saying that the first thing that people forget when trying to implement concurrent patterns like this is exception handling. In the solution from the question the exception handling is non-existent thus it is totally wrong. Therefore I'd recommend to use existing implementations for common concurrency patterns. async is the goto library for concurrency, but for such use case it will not be the most efficient solution.
This particular example can easily be solved with scheduler package, in fact it is exactly the kind of stuff it was designed for. Here is how you can use it to achieve folding of monoids:
import Control.Scheduler
import Control.Monad.IO.Unlift
foldP :: (MonadUnliftIO m, Monoid n) => Comp -> [n] -> m n
foldP comp xs = do
rs <-
withScheduler comp $ \scheduler ->
mapM_ (scheduleWork scheduler . pure . fold) (parallelize (numWorkers scheduler) xs)
pure $ fold rs
See the Comp type for explanation on best parallelization strategies. From what I found in practice Par will usually work best, because it will use pinned threads created with forkOn
Note that the parallelize function is implemented inefficiently and dangerously as well, it is better to write it this way:
parallelize :: Int -> [a] -> [[a]]
parallelize n' xs' = go 0 id xs'
where
n = max 1 n'
-- at least two elements make sense to get benefit of parallel fold
k = max 2 $ quot (length xs') n
go i acc xs
| null xs = acc []
| i < n =
case splitAt k xs of
(ls, rs) -> go (i + 1) (acc . (ls :)) rs
| otherwise = acc . (xs:) $ []
One more bit of advise is that list is far from ideal data structure for parallelization and efficiency in general. In order to split the lists into chunks before parallelizing computation you already have to go through the data structure with parallelize, which can be avoided if you were to use an array. What I am getting at is use an array instead, as suggested in the beginning of this answer.

How to optimize this Haskell code summing up the primes in sublinear time?

Problem 10 from Project Euler is to find the sum of all the primes below given n.
I solved it simply by summing up the primes generated by the sieve of Eratosthenes. Then I came across much more efficient solution by Lucy_Hedgehog (sub-linear!).
For n = 2⋅10^9:
Python code (from the quote above) runs in 1.2 seconds in Python 2.7.3.
C++ code (mine) runs in about 0.3 seconds (compiled with g++ 4.8.4).
I re-implemented the same algorithm in Haskell, since I'm learning it:
import Data.List
import Data.Map (Map, (!))
import qualified Data.Map as Map
problem10 :: Integer -> Integer
problem10 n = (sieve (Map.fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]) 2 r vs) ! n
where vs = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]
r = floor (sqrt (fromIntegral n))
sieve :: Map Integer Integer -> Integer -> Integer -> [Integer] -> Map Integer Integer
sieve m p r vs | p > r = m
| otherwise = sieve (if m ! p > m ! (p - 1) then update m vs p else m) (p + 1) r vs
update :: Map Integer Integer -> [Integer] -> Integer -> Map Integer Integer
update m vs p = foldl' decrease m (map (\v -> (v, sumOfSieved m v p)) (takeWhile (>= p*p) vs))
decrease :: Map Integer Integer -> (Integer, Integer) -> Map Integer Integer
decrease m (k, v) = Map.insertWith (flip (-)) k v m
sumOfSieved :: Map Integer Integer -> Integer -> Integer -> Integer
sumOfSieved m v p = p * (m ! (v `div` p) - m ! (p - 1))
main = print $ problem10 $ 2*10^9
I compiled it with ghc -O2 10.hs and run with time ./10.
It gives the correct answer, but takes about 7 seconds.
I compiled it with ghc -prof -fprof-auto -rtsopts 10 and run with ./10 +RTS -p -h.
10.prof shows that decrease takes 52.2% time and 67.5% allocations.
After running hp2ps 10.hp I got such heap profile:
Again looks like decrease takes most of the heap. GHC version 7.6.3.
How would you optimize run time of this Haskell code?
Update 13.06.17:
I tried replacing immutable Data.Map with mutable Data.HashTable.IO.BasicHashTable from the hashtables package, but I'm probably doing something bad, since for tiny n = 30 it already takes too long, about 10 seconds. What's wrong?
Update 18.06.17:
Curious about the HashTable performance issues is a good read. I took Sherh's code using mutable Data.HashTable.ST.Linear, but dropped Data.Judy in instead. It runs in 1.1 seconds, still relatively slow.
I've done some small improvements so it runs in 3.4-3.5 seconds on my machine.
Using IntMap.Strict helped a lot. Other than that I just manually performed some ghc optimizations just to be sure. And make Haskell code more close to Python code from your link. As a next step you could try to use some mutable HashMap. But I'm not sure... IntMap can't be much faster than some mutable container because it's an immutable one. Though I'm still surprised about it's efficiency. I hope this can be implemented faster.
Here is the code:
import Data.List (foldl')
import Data.IntMap.Strict (IntMap, (!))
import qualified Data.IntMap.Strict as IntMap
p :: Int -> Int
p n = (sieve (IntMap.fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]) 2 r vs) ! n
where vs = [n `div` i | i <- [1..r]] ++ [n', n' - 1 .. 1]
r = floor (sqrt (fromIntegral n) :: Double)
n' = n `div` r - 1
sieve :: IntMap Int -> Int -> Int -> [Int] -> IntMap Int
sieve m' p' r vs = go m' p'
where
go m p | p > r = m
| m ! p > m ! (p - 1) = go (update m vs p) (p + 1)
| otherwise = go m (p + 1)
update :: IntMap Int -> [Int] -> Int -> IntMap Int
update s vs p = foldl' decrease s (takeWhile (>= p2) vs)
where
sp = s ! (p - 1)
p2 = p * p
sumOfSieved v = p * (s ! (v `div` p) - sp)
decrease m v = IntMap.adjust (subtract $ sumOfSieved v) v m
main :: IO ()
main = print $ p $ 2*10^(9 :: Int)
UPDATE:
Using mutable hashtables I've managed to make performance up to ~5.5sec on Haskell with this implementation.
Also, I used unboxed vectors instead of lists in several places. Linear hashing seems to be the fastest. I think this can be done even faster. I noticed sse42 option in hasthables package. Not sure I've managed to set it correctly but even without it runs that fast.
UPDATE 2 (19.06.2017)
I've managed to make it 3x faster then best solution from #Krom (using my code + his map) by dropping judy hashmap at all. Instead just plain arrays are used. You can come up with the same idea if you notice that keys for S hashmap are either sequence from 1 to n' or n div i for i from 1 to r. So we can represent such HashMap as two arrays making lookups in array depending on searching key.
My code + Judy HashMap
$ time ./judy
95673602693282040
real 0m0.590s
user 0m0.588s
sys 0m0.000s
My code + my sparse map
$ time ./sparse
95673602693282040
real 0m0.203s
user 0m0.196s
sys 0m0.004s
This can be done even faster if instead of IOUArray already generated vectors and Vector library is used and readArray is replaced by unsafeRead. But I don't think this should be done if only you're not really interested in optimizing this as much as possible.
Comparison with this solution is cheating and is not fair. I expect same ideas implemented in Python and C++ will be even faster. But #Krom solution with closed hashmap is already cheating because it uses custom data structure instead of standard one. At least you can see that standard and most popular hash maps in Haskell are not that fast. Using better algorithms and better ad-hoc data structures can be better for such problems.
Here's resulting code.
First as a baseline, the timings of the existing approaches
on my machine:
Original program posted in the question:
time stack exec primorig
95673602693282040
real 0m4.601s
user 0m4.387s
sys 0m0.251s
Second the version using Data.IntMap.Strict from
here
time stack exec primIntMapStrict
95673602693282040
real 0m2.775s
user 0m2.753s
sys 0m0.052s
Shershs code with Data.Judy dropped in here
time stack exec prim-hash2
95673602693282040
real 0m0.945s
user 0m0.955s
sys 0m0.028s
Your python solution.
I compiled it with
python -O -m py_compile problem10.py
and the timing:
time python __pycache__/problem10.cpython-36.opt-1.pyc
95673602693282040
real 0m1.163s
user 0m1.160s
sys 0m0.003s
Your C++ version:
$ g++ -O2 --std=c++11 p10.cpp -o p10
$ time ./p10
sum(2000000000) = 95673602693282040
real 0m0.314s
user 0m0.310s
sys 0m0.003s
I didn't bother to provide a baseline for slow.hs, as I didn't
want to wait for it to complete when run with an argument of
2*10^9.
Subsecond performance
The following program runs in under a second on my machine.
It uses a hand rolled hashmap, which uses closed hashing with
linear probing and uses some variant of knuths hashfunction,
see here.
Certainly it is somewhat tailored to the case, as the lookup
function for example expects the searched keys to be present.
Timings:
time stack exec prim
95673602693282040
real 0m0.725s
user 0m0.714s
sys 0m0.047s
First I implemented my hand rolled hashmap simply to hash
the keys with
key `mod` size
and selected a size multiple times higher than the expected
input, but the program took 22s or more to complete.
Finally it was a matter of choosing a hash function which was
good for the workload.
Here is the program:
import Data.Maybe
import Control.Monad
import Data.Array.IO
import Data.Array.Base (unsafeRead)
type Number = Int
data Map = Map { keys :: IOUArray Int Number
, values :: IOUArray Int Number
, size :: !Int
, factor :: !Int
}
newMap :: Int -> Int -> IO Map
newMap s f = do
k <- newArray (0, s-1) 0
v <- newArray (0, s-1) 0
return $ Map k v s f
storeKey :: IOUArray Int Number -> Int -> Int -> Number -> IO Int
storeKey arr s f key = go ((key * f) `mod` s)
where
go :: Int -> IO Int
go ind = do
v <- readArray arr ind
go2 v ind
go2 v ind
| v == 0 = do { writeArray arr ind key; return ind; }
| v == key = return ind
| otherwise = go ((ind + 1) `mod` s)
loadKey :: IOUArray Int Number -> Int -> Int -> Number -> IO Int
loadKey arr s f key = s `seq` key `seq` go ((key *f) `mod` s)
where
go :: Int -> IO Int
go ix = do
v <- unsafeRead arr ix
if v == key then return ix else go ((ix + 1) `mod` s)
insertIntoMap :: Map -> (Number, Number) -> IO Map
insertIntoMap m#(Map ks vs s f) (k, v) = do
ix <- storeKey ks s f k
writeArray vs ix v
return m
fromList :: Int -> Int -> [(Number, Number)] -> IO Map
fromList s f xs = do
m <- newMap s f
foldM insertIntoMap m xs
(!) :: Map -> Number -> IO Number
(!) (Map ks vs s f) k = do
ix <- loadKey ks s f k
readArray vs ix
mupdate :: Map -> Number -> (Number -> Number) -> IO ()
mupdate (Map ks vs s fac) i f = do
ix <- loadKey ks s fac i
old <- readArray vs ix
let x' = f old
x' `seq` writeArray vs ix x'
r' :: Number -> Number
r' = floor . sqrt . fromIntegral
vs' :: Integral a => a -> a -> [a]
vs' n r = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]
vss' n r = r + n `div` r -1
list' :: Int -> Int -> [Number] -> IO Map
list' s f vs = fromList s f [(i, i * (i + 1) `div` 2 - 1) | i <- vs]
problem10 :: Number -> IO Number
problem10 n = do
m <- list' (19*vss) (19*vss+7) vs
nm <- sieve m 2 r vs
nm ! n
where vs = vs' n r
vss = vss' n r
r = r' n
sieve :: Map -> Number -> Number -> [Number] -> IO Map
sieve m p r vs | p > r = return m
| otherwise = do
v1 <- m ! p
v2 <- m ! (p - 1)
nm <- if v1 > v2 then update m vs p else return m
sieve nm (p + 1) r vs
update :: Map -> [Number] -> Number -> IO Map
update m vs p = foldM (decrease p) m $ takeWhile (>= p*p) vs
decrease :: Number -> Map -> Number -> IO Map
decrease p m k = do
v <- sumOfSieved m k p
mupdate m k (subtract v)
return m
sumOfSieved :: Map -> Number -> Number -> IO Number
sumOfSieved m v p = do
v1 <- m ! (v `div` p)
v2 <- m ! (p - 1)
return $ p * (v1 - v2)
main = do { n <- problem10 (2*10^9) ; print n; } -- 2*10^9
I am not a professional with hashing and that sort of stuff, so
this can certainly be improved a lot. Maybe we Haskellers should
improve the of the shelf hash maps or provide some simpler ones.
My hashmap, Shershs code
If I plug my hashmap in Shershs (see answer below) code, see here
we are even down to
time stack exec prim-hash2
95673602693282040
real 0m0.601s
user 0m0.604s
sys 0m0.034s
Why is slow.hs slow?
If you read through the source
for the function insert in Data.HashTable.ST.Basic, you
will see that it deletes the old key value pair and inserts
a new one. It doesn't look up the "place" for the value and
mutate it, as one might imagine, if one reads that it is
a "mutable" hashtable. Here the hashtable itself is mutable,
so you don't need to copy the whole hashtable for insertion
of a new key value pair, but the value places for the pairs
are not. I don't know if that is the whole story of slow.hs
being slow, but my guess is, it is a pretty big part of it.
A few minor improvements
So that's the idea I followed while trying to improve
your program the first time.
See, you don't need a mutable mapping from keys to values.
Your key set is fixed. You want a mapping from keys to mutable
places. (Which is, by the way, what you get from C++ by default.)
And so I tried to come up with that. I used IntMap IORef from
Data.IntMap.Strict and Data.IORef first and got a timing
of
tack exec prim
95673602693282040
real 0m2.134s
user 0m2.141s
sys 0m0.028s
I thought maybe it would help to work with unboxed values
and to get that, I used IOUArray Int Int with 1 element
each instead of IORef and got those timings:
time stack exec prim
95673602693282040
real 0m2.015s
user 0m2.018s
sys 0m0.038s
Not much of a difference and so I tried to get rid of bounds
checking in the 1 element arrays by using unsafeRead and
unsafeWrite and got a timing of
time stack exec prim
95673602693282040
real 0m1.845s
user 0m1.850s
sys 0m0.030s
which was the best I got using Data.IntMap.Strict.
Of course I ran each program multiple times to see if
the times are stable and the differences in run time aren't
just noise.
It looks like these are all just micro-optimizations.
And here is the program that ran fastest for me without using a hand rolled data structure:
import qualified Data.IntMap.Strict as M
import Control.Monad
import Data.Array.IO
import Data.Array.Base (unsafeRead, unsafeWrite)
type Number = Int
type Place = IOUArray Number Number
type Map = M.IntMap Place
tupleToRef :: (Number, Number) -> IO (Number, Place)
tupleToRef = traverse (newArray (0,0))
insertRefs :: [(Number, Number)] -> IO [(Number, Place)]
insertRefs = traverse tupleToRef
fromList :: [(Number, Number)] -> IO Map
fromList xs = M.fromList <$> insertRefs xs
(!) :: Map -> Number -> IO Number
(!) m i = unsafeRead (m M.! i) 0
mupdate :: Map -> Number -> (Number -> Number) -> IO ()
mupdate m i f = do
let place = m M.! i
old <- unsafeRead place 0
let x' = f old
-- make the application of f strict
x' `seq` unsafeWrite place 0 x'
r' :: Number -> Number
r' = floor . sqrt . fromIntegral
vs' :: Integral a => a -> a -> [a]
vs' n r = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]
list' :: [Number] -> IO Map
list' vs = fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]
problem10 :: Number -> IO Number
problem10 n = do
m <- list' vs
nm <- sieve m 2 r vs
nm ! n
where vs = vs' n r
r = r' n
sieve :: Map -> Number -> Number -> [Number] -> IO Map
sieve m p r vs | p > r = return m
| otherwise = do
v1 <- m ! p
v2 <- m ! (p - 1)
nm <- if v1 > v2 then update m vs p else return m
sieve nm (p + 1) r vs
update :: Map -> [Number] -> Number -> IO Map
update m vs p = foldM (decrease p) m $ takeWhile (>= p*p) vs
decrease :: Number -> Map -> Number -> IO Map
decrease p m k = do
v <- sumOfSieved m k p
mupdate m k (subtract v)
return m
sumOfSieved :: Map -> Number -> Number -> IO Number
sumOfSieved m v p = do
v1 <- m ! (v `div` p)
v2 <- m ! (p - 1)
return $ p * (v1 - v2)
main = do { n <- problem10 (2*10^9) ; print n; } -- 2*10^9
If you profile that, you see that it spends most of the time in the custom lookup function (!),
don't know how to improve that further. Trying to inline (!) with {-# INLINE (!) #-}
didn't yield better results; maybe ghc already did this.
This code of mine evaluates the sum to 2⋅10^9 in 0.3 seconds and the sum to 10^12 (18435588552550705911377) in 19.6 seconds (if given sufficient RAM).
import Control.DeepSeq
import qualified Control.Monad as ControlMonad
import qualified Data.Array as Array
import qualified Data.Array.ST as ArrayST
import qualified Data.Array.Base as ArrayBase
primeLucy :: (Integer -> Integer) -> (Integer -> Integer) -> Integer -> (Integer->Integer)
primeLucy f sf n = g
where
r = fromIntegral $ integerSquareRoot n
ni = fromIntegral n
loop from to c = let go i = ControlMonad.when (to<=i) (c i >> go (i-1)) in go from
k = ArrayST.runSTArray $ do
k <- ArrayST.newListArray (-r,r) $ force $
[sf (div n (toInteger i)) - sf 1|i<-[r,r-1..1]] ++
[0] ++
[sf (toInteger i) - sf 1|i<-[1..r]]
ControlMonad.forM_ (takeWhile (<=r) primes) $ \p -> do
l <- ArrayST.readArray k (p-1)
let q = force $ f (toInteger p)
let adjust = \i j -> do { v <- ArrayBase.unsafeRead k (i+r); w <- ArrayBase.unsafeRead k (j+r); ArrayBase.unsafeWrite k (i+r) $!! v+q*(l-w) }
loop (-1) (-div r p) $ \i -> adjust i (i*p)
loop (-div r p-1) (-min r (div ni (p*p))) $ \i -> adjust i (div (-ni) (i*p))
loop r (p*p) $ \i -> adjust i (div i p)
return k
g :: Integer -> Integer
g m
| m >= 1 && m <= integerSquareRoot n = k Array.! (fromIntegral m)
| m >= integerSquareRoot n && m <= n && div n (div n m)==m = k Array.! (fromIntegral (negate (div n m)))
| otherwise = error $ "Function not precalculated for value " ++ show m
primeSum :: Integer -> Integer
primeSum n = (primeLucy id (\m -> div (m*m+m) 2) n) n
If your integerSquareRoot function is buggy (as reportedly some are), you can replace it here with floor . sqrt . fromIntegral.
Explanation:
As the name suggests it is based upon a generalization of the famous method by "Lucy Hedgehog" eventually discovered by the original poster.
It allows you to calculate many sums of the form (with p prime) without enumerating all the primes up to N and in time O(N^0.75).
Its inputs are the function f (i.e., id if you want the prime sum), its summatory function over all the integers (i.e., in that case the sum of the first m integers or div (m*m+m) 2), and N.
PrimeLucy returns a lookup function (with p prime) restricted to certain values of n: .
Try this and let me know how fast it is:
-- sum of primes
import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
sieve :: Int -> UArray Int Bool
sieve n = runSTUArray $ do
let m = (n-1) `div` 2
r = floor . sqrt $ fromIntegral n
bits <- newArray (0, m-1) True
forM_ [0 .. r `div` 2 - 1] $ \i -> do
isPrime <- readArray bits i
when isPrime $ do
let a = 2*i*i + 6*i + 3
b = 2*i*i + 8*i + 6
forM_ [a, b .. (m-1)] $ \j -> do
writeArray bits j False
return bits
primes :: Int -> [Int]
primes n = 2 : [2*i+3 | (i, True) <- assocs $ sieve n]
main = do
print $ sum $ primes 1000000
You can run it on ideone. My algorithm is the Sieve of Eratosthenes, and it should be quite fast for small n. For n = 2,000,000,000, the array size may be a problem, in which case you will need to use a segmented sieve. See my blog for more information about the Sieve of Eratosthenes. See this answer for information about a segmented sieve (but not in Haskell, unfortunately).

Speeding up a stream like data type

I've made a type which is supposed to emulate a "stream". This is basically a list without memory.
data Stream a = forall s. Stream (s -> Maybe (a, s)) s
Basically a stream has two elements. A state s, and a function that takes the state, and returns an element of type a and the new state.
I want to be able to perform operations on streams, so I've imported Data.Foldable and defined streams on it as such:
import Data.Foldable
instance Foldable Stream where
foldr k z (Stream sf s) = go (sf s)
where
go Nothing = z
go (Just (e, ns)) = e `k` go (sf ns)
To test the speed of my stream, I've defined the following function:
mysum = foldl' (+) 0
And now we can compare the speed of ordinary lists and my stream type:
x1 = [1..n]
x2 = Stream (\s -> if (s == n + 1) then Nothing else Just (s, s + 1)) 1
--main = print $ mysum x1
--main = print $ mysum x2
My streams are about half the speed of lists (full code here).
Furthermore, here's a best case situation, without a list or a stream:
bestcase :: Int
bestcase = go 1 0 where
go i c = if i == n then c + i else go (i+1) (c+i)
This is a lot faster than both the list and stream versions.
So I've got two questions:
How to I get my stream version to be at least as fast as a list.
How to I get my stream version to be close to the speed of bestcase.
As it stands the foldl' you are getting from Foldable is defined in terms of the foldr you gave it. The default implementation is the brilliant and surprisingly good
foldl' :: (b -> a -> b) -> b -> t a -> b
foldl' f z0 xs = foldr f' id xs z0
where f' x k z = k $! f z x
But foldl' is the specialty of your type; fortunately the Foldable class includes foldl' as a method, so you can just add this to your instance.
foldl' op acc0 (Stream sf s0) = loop s0 acc0
where
loop !s !acc = case sf s of
Nothing -> acc
Just (a,s') -> loop s' (op acc a)
For me this seems to give about the same time as bestcase
Note that this is a standard case where we need a strictness annotation on the accumulator. You might look in the vector package's treatment of a similar type https://hackage.haskell.org/package/vector-0.10.12.2/docs/src/Data-Vector-Fusion-Stream.html for some ideas; or in the hidden 'fusion' modules of the text library https://github.com/bos/text/blob/master/Data/Text/Internal/Fusion .

Any way to create the unmemo-monad?

Suppose someone makes a program to play chess, or solve sudoku. In this kind of program it makes sense to have a tree structure representing game states.
This tree would be very large, "practically infinite". Which isn't by itself a problem as Haskell supports infinite data structures.
An familiar example of an infinite data structure:
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
Nodes are only allocated when first used, so the list takes finite memory. One may also iterate over an infinite list if they don't keep references to its head, allowing the garbage collector to collect its parts which are not needed anymore.
Back to the tree example - suppose one does some iteration over the tree, the tree nodes iterated over may not be freed if the root of the tree is still needed (for example in an iterative deepening search, the tree would be iterated over several times and so the root needs to be kept).
One possible solution for this problem that I thought of is using an "unmemo-monad".
I'll try to demonstrate what this monad is supposed to do using monadic lists:
import Control.Monad.ListT (ListT) -- cabal install List
import Data.Copointed -- cabal install pointed
import Data.List.Class
import Prelude hiding (enumFromTo)
nums :: ListT Unmemo Int -- What is Unmemo?
nums = enumFromTo 0 1000000
main = print $ div (copoint (foldlL (+) 0 nums)) (copoint (lengthL nums))
Using nums :: [Int], the program would take a lot of memory as a reference to nums is needed by lengthL nums while it is being iterated over foldlL (+) 0 nums.
The purpose of Unmemo is to make the runtime not keep the nodes iterated over.
I attempted using ((->) ()) as Unmemo, but it yields the same results as nums :: [Int] does - the program uses a lot of memory, as evident by running it with +RTS -s.
Is there anyway to implement Unmemo that does what I want?
Same trick as with a stream -- don't capture the remainder directly, but instead capture a value and a function which yields a remainder. You can add memoization on top of this as necessary.
data UTree a = Leaf a | Branch a (a -> [UTree a])
I'm not in the mood to figure it out precisely at the moment, but this structure arises, I'm sure, naturally as the cofree comonad over a fairly straightforward functor.
Edit
Found it: http://hackage.haskell.org/packages/archive/comonad-transformers/1.6.3/doc/html/Control-Comonad-Trans-Stream.html
Or this is perhaps simpler to understand: http://hackage.haskell.org/packages/archive/streams/0.7.2/doc/html/Data-Stream-Branching.html
In either case, the trick is that your f can be chosen to be something like data N s a = N (s -> (s,[a])) for an appropriate s (s being the type of your state parameter of the stream -- the seed of your unfold, if you will). That might not be exactly correct, but something close should do...
But of course for real work, you can scrap all this and just write the datatype directly as above.
Edit 2
The below code illustrates how this can prevent sharing. Note that even in the version without sharing, there are humps in the profile indicating that the sum and length calls aren't running in constant space. I'd imagine that we'd need an explicit strict accumulation to knock those down.
{-# LANGUAGE DeriveFunctor #-}
import Data.Stream.Branching(Stream(..))
import qualified Data.Stream.Branching as S
import Control.Arrow
import Control.Applicative
import Data.List
data UM s a = UM (s -> Maybe a) deriving Functor
type UStream s a = Stream (UM s) a
runUM s (UM f) = f s
liftUM x = UM $ const (Just x)
nullUM = UM $ const Nothing
buildUStream :: Int -> Int -> Stream (UM ()) Int
buildUStream start end = S.unfold (\x -> (x, go x)) start
where go x
| x < end = liftUM (x + 1)
| otherwise = nullUM
sumUS :: Stream (UM ()) Int -> Int
sumUS x = S.head $ S.scanr (\x us -> maybe 0 id (runUM () us) + x) x
lengthUS :: Stream (UM ()) Int -> Int
lengthUS x = S.head $ S.scanr (\x us -> maybe 0 id (runUM () us) + 1) x
sumUS' :: Stream (UM ()) Int -> Int
sumUS' x = last $ usToList $ liftUM $ S.scanl (+) 0 x
lengthUS' :: Stream (UM ()) Int -> Int
lengthUS' x = last $ usToList $ liftUM $ S.scanl (\acc _ -> acc + 1) 0 x
usToList x = unfoldr (\um -> (S.head &&& S.tail) <$> runUM () um) x
maxNum = 1000000
nums = buildUStream 0 maxNum
numsL :: [Int]
numsL = [0..maxNum]
-- All these need to be run with increased stack to avoid an overflow.
-- This generates an hp file with two humps (i.e. the list is not shared)
main = print $ div (fromIntegral $ sumUS' nums) (fromIntegral $ lengthUS' nums)
-- This generates an hp file as above, and uses somewhat less memory, at the cost of
-- an increased number of GCs. -H helps a lot with that.
-- main = print $ div (fromIntegral $ sumUS nums) (fromIntegral $ lengthUS nums)
-- This generates an hp file with one hump (i.e. the list is shared)
-- main = print $ div (fromIntegral $ sum $ numsL) (fromIntegral $ length $ numsL)

Resources