Using vectors for performance improvement in Haskell - algorithm

I'm very new to Haskell, and I have a question about what performance improvements can be had by using impure (mutable) data structures. I'm trying to piece together a few different things I've heard, so please bear with me if my terminology is not entirely correct, or if there are some small errors.
To make this concrete, consider the quicksort algorithm (taken from the Haskell wiki).
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
where
lesser = filter (< p) xs
greater = filter (>= p) xs
This is not "true quicksort." A "true" quicksort algorithm is in-place, and this is not. This is very memory inefficient.
On the other hand, it is possible to use vectors in Haskell to implement an in-place quicksort. An example is given in this stackoverflow answer.
How much faster is the second algorithm than the first? Big O notation doesn't help here, because the performance improvement is going to be from using memory more efficiently, not having a better algorithm (right?). I tired to construct some test cases on my own, but I had difficult getting things running.
An ideal answer would give some idea of what makes the in-place Haskell algorithm faster theoretically, and an example comparison of running times on some test data set.

On the other hand, it is possible to use vectors in Haskell to implement an in-place quicksort.
How much faster is the second algorithm than the first?
That depends on the implementation, of course. As can be seen below, for not too short lists, a decent in-place sort on a mutable vector or array is much faster than sorting lists, even if the time for the transformation from and to lists is included (and that conversion makes up the bulk of the time).
However, the list algorithms produce incremental output, while the array/vector algorithms don't produce any result before they have completed, therefore sorting lists can still be preferable.
I don't know exactly what the linked mutable array/vector algorithms did wrong. But they did something quite wrong.
For the mutable vector code, it seems that it used boxed vectors, and it was polymorphic, both can have significant performance impact, though the polymorphism shouldn't matter if the functions are {-# INLINABLE #-}.
For the IOUArray code, well, it looks fun, but slow. It uses an IORef, readArray and writeArray and has no obvious strictness. The abysmal times it takes aren't too surprising, then.
Using a more direct translation of the (monomorphic) C code using an STUArray, with a wrapper to make it work on lists¹,
{-# LANGUAGE BangPatterns #-}
module STUQuickSort (stuquick) where
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import Control.Monad.ST
stuquick :: [Int] -> [Int]
stuquick [] = []
stuquick xs = runST (do
let !len = length xs
arr <- newListArray (0,len-1) xs
myqsort arr 0 (len-1)
-- Can't use getElems for large arrays, that overflows the stack, wth?
let pick acc i
| i < 0 = return acc
| otherwise = do
!v <- unsafeRead arr i
pick (v:acc) (i-1)
pick [] (len-1))
myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi
| lo < hi = do
let lscan p h i
| i < h = do
v <- unsafeRead a i
if p < v then return i else lscan p h (i+1)
| otherwise = return i
rscan p l i
| l < i = do
v <- unsafeRead a i
if v < p then return i else rscan p l (i-1)
| otherwise = return i
swap i j = do
v <- unsafeRead a i
unsafeRead a j >>= unsafeWrite a i
unsafeWrite a j v
sloop p l h
| l < h = do
l1 <- lscan p h l
h1 <- rscan p l1 h
if (l1 < h1) then (swap l1 h1 >> sloop p l1 h1) else return l1
| otherwise = return l
piv <- unsafeRead a hi
i <- sloop piv lo hi
swap i hi
myqsort a lo (i-1)
myqsort a (i+1) hi
| otherwise = return ()
and a wrapper around a good sort (Introsort, not quicksort) on unboxed vectors,
module VSort where
import Data.Vector.Algorithms.Intro
import qualified Data.Vector.Unboxed as U
import Control.Monad.ST
vsort :: [Int] -> [Int]
vsort xs = runST (do
v <- U.unsafeThaw $ U.fromList xs
sort v
s <- U.unsafeFreeze v
return $ U.toList s)
I get times more in line with the expectations (Note: For these timings, the random list has been deepseqed before calling the sorting algorithm. Without that, the conversion to an STUArray would be much slower, since it would first evaluate a long list of thunks to determine the length. The fromList conversion of the vector package doesn't suffer from this problem. Moving the deepseq to the conversion to STUArray, the other sorting [and conversion, in the vector case] algorithms take a little less time, so the difference between vector-algorithms' introsort and the STUArray quicksort becomes a little larger.):
list size: 200000 -O2 -fllvm -fllvm-O2
──────── ──────── ──────── ──────── ────────
Data.List.sort 0.663501s 0.665482s 0.652461s 0.792005s
Naive.quicksort 0.587091s 0.577796s 0.585754s 0.667573s
STUArray.quicksort 1.58023s 0.142626s 1.597479s 0.156411s
VSort.vsort 0.820639s 0.139967s 0.888566s 0.143918s
The times without optimisation are expectedly bad for the STUArray. unsafeRead and unsafeWrite must be inlined to be fast. If not inlined, you get a dictionary lookup for each call. Thus for the large dataset, I omit the unoptimised ways:
list size: 3000000 -O2 -fllvm-O2
──────── ──────── ────────
Data.List.sort 16.728576s 16.442377s
Naive.quicksort 14.297534s 12.253071s
STUArray.quicksort 2.307203s 2.200807s
VSort.vsort 2.069749s 1.921943s
You can see that an inplace sort on a mutable unboxed array is much faster than a list-based sort if done correctly. Whether the difference between the STUArray sort and the sort on the unboxed mutable vector is due to the different algorithm or whether vectors are indeed faster here, I don't know. Since I've never observed vectors to be faster² than STUArrays, I tend to believe the former.
The difference between the STUArray quicksort and the introsort is in part due to the better conversion from and to lists that the vector package offers, in part due to the different algorithms.
At Louis Wasserman's suggestion, I have run a quick benchmark using the other sorting algorithms from the vector-algorithms package, using a not-too-large dataset. The results aren't surprising, the good general-purpose algorithms heapsort, introsort and mergesort all do well, times near the quicksort on the unboxed mutable array (but of course, the quicksort would degrade to quadratic behaviour on almost sorted input, while these are guaranteed O(n*log n) worst case). The special-purpose sorting algorithms AmericanFlag and radix sort do badly, since the input doesn't fit well to their purpose (radix sort would do better on larger inputs with a larger range, as is, it does too many more passes than needed for the data). Insertion sort is by far the worst, due to its quadratic behaviour.
AmericanFlag:
list size: 300000 -O2 -fllvm-O2
──────── ──────── ────────
Data.List.sort 1.083845s 1.084699s
Naive.quicksort 0.981276s 1.05532s
STUArray.quicksort 0.218407s 0.215564s
VSort.vsort 2.566838s 2.618817s
Heap:
list size: 300000 -O2 -fllvm-O2
──────── ──────── ────────
Data.List.sort 1.084252s 1.07894s
Naive.quicksort 0.915984s 0.887354s
STUArray.quicksort 0.219786s 0.225748s
VSort.vsort 0.213507s 0.20152s
Insertion:
list size: 300000 -O2 -fllvm-O2
──────── ──────── ────────
Data.List.sort 1.168837s 1.066058s
Naive.quicksort 1.081806s 0.879439s
STUArray.quicksort 0.241958s 0.209631s
VSort.vsort 36.21295s 27.564993s
Intro:
list size: 300000 -O2 -fllvm-O2
──────── ──────── ────────
Data.List.sort 1.09189s 1.112415s
Naive.quicksort 0.891161s 0.989799s
STUArray.quicksort 0.236596s 0.227348s
VSort.vsort 0.221742s 0.20815s
Merge:
list size: 300000 -O2 -fllvm-O2
──────── ──────── ────────
Data.List.sort 1.087929s 1.074926s
Naive.quicksort 0.875477s 1.019984s
STUArray.quicksort 0.215551s 0.221301s
VSort.vsort 0.236661s 0.230287s
Radix:
list size: 300000 -O2 -fllvm-O2
──────── ──────── ────────
Data.List.sort 1.085658s 1.085726s
Naive.quicksort 1.002067s 0.900985s
STUArray.quicksort 0.217371s 0.228973s
VSort.vsort 1.958216s 1.970619s
Conclusion: Unless you have a specific reason not to, using one of the good general-purpose sorting algorithms from vector-algorithms, with a wrapper to convert from and to lists if necessary, is the recommended way to sort large lists. (These algorithms also work well with boxed vectors, in my measurements approximately 50% slower than unboxed.) For short lists, the overhead of the conversion would be so large that it doesn't pay.
Now, at #applicative's suggestion, a look at the sorting times for vector-algorithms' introsort, a quicksort on unboxed vectors and an improved (shamelessly stealing the implementation of unstablePartition) quicksort on STUArrays.
The improved STUArray quicksort:
{-# LANGUAGE BangPatterns #-}
module NQuick (stuqsort) where
import Data.Array.Base (unsafeRead, unsafeWrite, getNumElements)
import Data.Array.ST
import Control.Monad.ST
import Control.Monad (when)
stuqsort :: STUArray s Int Int -> ST s ()
stuqsort arr = do
n <- getNumElements arr
when (n > 1) (myqsort arr 0 (n-1))
myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi = do
p <- unsafeRead a hi
j <- unstablePartition (< p) lo hi a
h <- unsafeRead a j
unsafeWrite a j p
unsafeWrite a hi h
when (j > lo+1) (myqsort a lo (j-1))
when (j+1 < hi) (myqsort a (j+1) hi)
unstablePartition :: (Int -> Bool) -> Int -> Int -> STUArray s Int Int -> ST s Int
{-# INLINE unstablePartition #-}
unstablePartition f !lf !rg !v = from_left lf rg
where
from_left i j
| i == j = return i
| otherwise = do
x <- unsafeRead v i
if f x
then from_left (i+1) j
else from_right i (j-1)
from_right i j
| i == j = return i
| otherwise = do
x <- unsafeRead v j
if f x
then do
y <- unsafeRead v i
unsafeWrite v i x
unsafeWrite v j y
from_left (i+1) j
else from_right i (j-1)
The vector quicksort:
module VectorQuick (vquicksort) where
import qualified Data.Vector.Unboxed.Mutable as UM
import qualified Data.Vector.Generic.Mutable as GM
import Control.Monad.ST
import Control.Monad (when)
vquicksort :: UM.STVector s Int -> ST s ()
vquicksort uv = do
let li = UM.length uv - 1
ui = UM.unsafeSlice 0 li uv
p <- UM.unsafeRead uv li
j <- GM.unstablePartition (< p) ui
h <- UM.unsafeRead uv j
UM.unsafeWrite uv j p
UM.unsafeWrite uv li h
when (j > 1) (vquicksort (UM.unsafeSlice 0 j uv))
when (j + 1 < li) (vquicksort (UM.unsafeSlice (j+1) (li-j) uv))
The timing code:
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import System.Environment (getArgs)
import System.CPUTime
import System.Random
import Text.Printf
import Data.Array.Unboxed
import Data.Array.ST hiding (unsafeThaw)
import Data.Array.Unsafe (unsafeThaw)
import Data.Array.Base (unsafeAt, unsafeNewArray_, unsafeWrite)
import Control.Monad.ST
import Control.Monad
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import NQuick
import VectorQuick
import qualified Data.Vector.Algorithms.Intro as I
nextR :: StdGen -> (Int, StdGen)
nextR = randomR (minBound, maxBound)
buildArray :: StdGen -> Int -> UArray Int Int
buildArray sg size = runSTUArray (do
arr <- unsafeNewArray_ (0, size-1)
let fill i g
| i < size = do
let (r, g') = nextR g
unsafeWrite arr i r
fill (i+1) g'
| otherwise = return arr
fill 0 sg)
buildVector :: StdGen -> Int -> U.Vector Int
buildVector sg size = U.fromList $ take size (randoms sg)
time :: IO a -> IO ()
time action = do
t0 <- getCPUTime
action
t1 <- getCPUTime
let tm :: Double
tm = fromInteger (t1 - t0) * 1e-9
printf "%.3f ms\n" tm
stu :: UArray Int Int -> Int -> IO ()
stu ua sz = do
let !sa = runSTUArray (do
st <- unsafeThaw ua
stuqsort st
return st)
forM_ [0, sz `quot` 2, sz-1] (print . (sa `unsafeAt`))
intro :: U.Vector Int -> Int -> IO ()
intro uv sz = do
let !sv = runST (do
st <- U.unsafeThaw uv
I.sort st
U.unsafeFreeze st)
forM_ [0, sz `quot` 2, sz-1] (print . U.unsafeIndex sv)
vquick :: U.Vector Int -> Int -> IO ()
vquick uv sz = do
let !sv = runST (do
st <- U.unsafeThaw uv
vquicksort st
U.unsafeFreeze st)
forM_ [0, sz `quot` 2, sz-1] (print . U.unsafeIndex sv)
main :: IO ()
main = do
args <- getArgs
let !num = case args of
(a:_) -> read a
_ -> 1000000
!sg <- getStdGen
let !ar = buildArray sg num
!vc = buildVector sg num
!v2 = buildVector sg (foo num)
algos = [ ("Intro", intro v2), ("STUArray", stu ar), ("Vquick", vquick vc) ]
printf "Created data to be sorted, last elements %d %d %d\n" (ar ! (num-1)) (vc U.! (num-1)) (v2 U.! (num-1))
forM_ algos $ \(name, act) -> do
putStrLn name
time (act num)
-- For the prevention of sharing
foo :: Int -> Int
foo n
| n < 0 = -n
| n > 0 = n
| otherwise = 3
The results (times only):
$ ./timeSorts 3000000
Intro
587.911 ms
STUArray
402.939 ms
Vquick
414.936 ms
$ ./timeSorts 1000000
Intro
193.970 ms
STUArray
131.980 ms
Vquick
134.979 ms
The practically identical quicksorts on the STUArray and the unboxed vector take practically the same time, as expected. (The old quicksort implementation was about 15% slower than the introsort. Comparing to the times above, about 70-75% there was spent converting from/to lists.)
On the random input, the quicksorts perform significantly better than the introsort, but on almost-sorted input, their performance would degrade while introsort wouldn't.
¹ Making the code polymorphic with STUArrays is a pain at best, doing it with IOUArrays and having both the sorting and the wrapper {-# INLINABLE #-} produces the same performance with optimisations - without, the polymorphic code is significantly slower.
² Using the same algorithms, both were always equally fast within the precision of measurement when I compared (not very often).

There's nothing better than a test, right? And the results are not unsurprising: for lists of random integers in range [0 .. 1000000],
list size: 200000 ghc -O2 -fllvm -fllvm-O2
──────── ──────── ──────── ──────── ────────
Data.List.sort 0.878969s 0.883219s 0.878106s 0.888758s
Naïve.quicksort 0.711305s 0.870647s 0.845508s 0.919925s
UArray_IO.quicksort 9.317783s 1.919583s 9.390687s 1.945072s
Vector_Mutable.quicksort 1.48142s 0.823004s 1.526661s 0.806837s
Here, Data.List.sort is just what it is, Naïve.quicksort is the algorithm you quoted, UArray_IO.quicksort and Vector_Mutable.quicksort are taken from the question you linked to: klapaucius' and Dan Burton's answer which turn out to be very suboptimal performance-wise, see what better Daniel Fischer could do it, both wrapped so as to accept lists (not sure if I got this quite right):
quicksort :: [Int] -> [Int]
quicksort l = unsafePerformIO $ do
let bounds = (0, length l)
arr <- newListArray bounds l :: IO (IOUArray Int Int)
uncurry (qsort arr) bounds
getElems arr
and
quicksort :: Ord a => [a] -> [a]
quicksort = toList . iqsort . fromList
respectively.
As you can see, the naïve algorithm is not far behind the mutable solution with Data.Vector in terms of speed for sorting a list of random-generated integers, and the IOUArray is actually much worse. Test was carried out on an Intel i5 laptop running Ubuntu 11.10 x86-64.
The following doesn't really make much sense considering that ɢᴏᴏᴅ mutable implementations are, after all, still well ahead of all those compared here.
Note that this does not mean that a nice list-based program can always keep up with its mutably-implemented equivalents, but GHC sure does a great job at bringing the performance close. Also, it depends of course on the data: these are the times when the random-generated lists to sort contain values in between 0 and 1000 rather than 0 an 1000000 as above, i.e. with many duplicates:
list size: 200000 ghc -O2 -fllvm -fllvm-O2
──────── ──────── ──────── ──────── ────────
Data.List.sort 0.864176s 0.882574s 0.850807s 0.857957s
Naïve.quicksort 1.475362s 1.526076s 1.475557s 1.456759s
UArray_IO.quicksort 24.405938s 5.255001s 23.561911s 5.207535s
Vector_Mutable.quicksort 3.449168s 1.125788s 3.202925s 1.117741s
Not to speak of pre-sorted arrays.
What's quite interesting, (becomes only apparent with really large sizes, which require rtsopts to increase the stack capacity), is how both mutable implementations become significantly slower with -fllvm -O2:
list size: 3⋅10⁶ ghc -O1 -fllvm-O1 -O2 -fllvm-O2
──────── ──────── ──────── ──────── ────────
Data.List.sort 23.897897s 24.138117s 23.708218s 23.631968s
Naïve.quicksort 17.068644s 19.547817s 17.640389s 18.113622s
UArray_IO.quicksort 35.634132s 38.348955s 37.177606s 49.190503s
Vector_Mutable.quicksort 17.286982s 17.251068s 17.361247s 36.840698s
It seems kind of logical to me that the immutable implementations fare better on llvm (doesn't it do everything immutably on some level?), though I don't understand why this only becomes apparent as a slowdown to the mutable versions at high optimisation and large data sizes.
Testing program:
$ cat QSortPerform.hs
module Main where
import qualified Data.List(sort)
import qualified Naïve
import qualified UArray_IO
import qualified Vector_Mutable
import Control.Monad
import System.Random
import System.Environment
sortAlgos :: [ (String, [Int]->[Int]) ]
sortAlgos = [ ("Data.List.sort", Data.List.sort)
, ("Naïve.quicksort", Naïve.quicksort)
, ("UArray_IO.quicksort", UArray_IO.quicksort)
, ("Vector_Mutable.quicksort", Vector_Mutable.quicksort) ]
main = do
args <- getArgs
when (length args /= 2) $ error "Need 2 arguments"
let simSize = read $ args!!1
randArray <- fmap (take simSize . randomRs(0,1000000)) getStdGen
let sorted = case filter ((== args!!0) . fst) sortAlgos of
[(_, algo)] -> algo randArray
_ -> error $ "Argument must be one of "
++ show (map fst sortAlgos)
putStr "First element: "; print $ sorted!!0
putStr "Middle element: "; print $ sorted!!(simSize`div`2)
putStr "Last element: "; print $ sorted!!(simSize-1)
which takes the algorithm name and array size on command-line. Runtime comparison was done with this program:
$ cat PerformCompare.hs
module Main where
import System.Process
import System.Exit
import System.Environment
import Data.Time.Clock
import Data.List
import Control.Monad
import Text.PrettyPrint.Boxes
compiler = "ghc"
testProgram = "./QSortPerform"
flagOpts = [[], ["-O2"], ["-fllvm"], ["-fllvm","-O2"]]
algos = ["Data.List.sort","Naïve.quicksort","UArray_IO.quicksort","Vector_Mutable.quicksort"]
main = do
args <- getArgs
let testSize = case args of
[numb] -> read numb
_ -> 200000
results <- forM flagOpts $ \flags -> do
compilerExitC <- verboseSystem
compiler $ testProgram : "-fforce-recomp" : flags
when (compilerExitC /= ExitSuccess) .
error $ "Compiler error \"" ++ show compilerExitC ++"\""
algoCompare <- forM algos $ \algo -> do
startTime <- getCurrentTime
exitC <- verboseSystem testProgram [algo, show testSize]
endTime <- getCurrentTime
when (exitC /= ExitSuccess) .
error $ "Program error \"" ++ show exitC ++"\""
return . text . show $ diffUTCTime endTime startTime
return . vcat right $ text(concat flags)
: text("────────")
: algoCompare
let table = hsep 2 bottom
$ vcat left (map text $ ("list size: "++show testSize)
: "────────"
: algos )
: results
printBox table
verboseSystem :: String -> [String] -> IO ExitCode
verboseSystem cmd args = do
putStrLn . unwords $ cmd : args
rawSystem cmd args

Related

Haskell explicit recursion vs `iterate`

While writing a function using iterate in Haskell, I found that an equivalent version with explicit recursion seemed noticeably faster - even though I believed that explicit recursion ought to be frowned upon in Haskell.
Similarly, I expected GHC to be able to inline/optimise list combinators appropriately so that the resulting machine code is at least similarly performing to the explicit recursion.
Here's a (different) example, which also displays the slowdown I observed.
steps m n and its variant steps' compute the number of Collatz steps n takes to reach 1, giving up after m attempts.
steps uses explicit recursion while steps' uses list functions.
import Data.List (elemIndex)
import Control.Exception (evaluate)
import Control.DeepSeq (rnf)
collatz :: Int -> Int
collatz n
| even n = n `quot` 2
| otherwise = 3 * n + 1
steps :: Int -> Int -> Maybe Int
steps m = go 0
where go k n
| n == 1 = Just k
| k == m = Nothing
| otherwise = go (k+1) (collatz n)
steps' :: Int -> Int -> Maybe Int
steps' m = elemIndex 1 . take m . iterate collatz
main :: IO ()
main = evaluate $ rnf $ map (steps 800) $ [1..10^7]
I tested these by evaluating for all values up to 10^7, each giving up after 800 steps. On my machine (compiled with ghc -O2), explicit recursion took just under 4 seconds (3.899s) but list combinators took about 5 times longer (19.922s).
Why is explicit recursion so much better in this case, and is there a way of writing this without explicit recursion while preserving performance?
Updated: I submitted Trac 15426 for this bug.
The problem disappears if you copy the definitions of elemIndex and findIndex into your module:
import Control.Exception (evaluate)
import Control.DeepSeq (rnf)
import Data.Maybe (listToMaybe)
import Data.List (findIndices)
elemIndex :: Eq a => a -> [a] -> Maybe Int
elemIndex x = findIndex (x==)
findIndex :: (a -> Bool) -> [a] -> Maybe Int
findIndex p = listToMaybe . findIndices p
collatz :: Int -> Int
collatz n
| even n = n `quot` 2
| otherwise = 3 * n + 1
steps' :: Int -> Int -> Maybe Int
steps' m = elemIndex 1 . take m . iterate collatz
main :: IO ()
main = evaluate $ rnf $ map (steps' 800) $ [1..10^7]
The problem seems to be that these must be inlinable for GHC to get the fusion right. Unfortunately, neither of them is marked inlinable in Data.OldList.
The change to allow findIndex to participate in fusion is relatively recent (see Trac 14387) where listToMaybe was reimplemented as a foldr. So, it probably hasn't seen a lot of testing yet.

Haskell performance—topological sort is not fast enough

I'm a Haskell beginner and have chosen it to solve a programming task for my class, however my solution is too slow and doesn't get accepted. I'm trying to profile it and was hoping that I could get some pointers from more advanced Haskellers here.
The only other solution in my class that got accepted so far was written in Rust. I'm sure that I should be able to achieve similar performance in Haskell and I wrote horrible imperative code in the hope of improving performance, alas to no avail.
My first suspicion relates to work, where I am using forever to go over the in-degree array until I get an out-of-bounds exception. I was hoping for this to be tail-recursive and to compile to a while (true) style loop.
My second suspicion is that I/O is perhaps slowing things down.
EDIT: The problem has likely to do with my algorithm because I am not keeping a queue of nodes with indegree 0. Thank you #luqui.
EDIT2: It seems that the real bottleneck was I/O, I fixed that thanks to #Davislor.
The task is based on this: http://www.spoj.com/UKCPLAD/problems/TOPOSORT/ and I am constrained to use only the libraries in the Haskell Platform.
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE LambdaCase #-}
{-# OPTIONS_GHC -O3 #-}
import Control.Monad
import Data.Array.IO
import Data.IORef
import Data.Int
import Control.Exception
type List = []
type Node = Int32
type Edge = (Node, Node)
type Indegree = Int32
main = do
(numNodes, _) <- readPair <$> getLine
edges <- map readPair . lines <$> getContents
topo numNodes edges
-- lower bound
{-# INLINE lb #-}
lb = 1
topo :: Node -> List Edge -> IO ()
topo numNodes edges = do
result <- newIORef []
count <- newIORef 0
indegrees <- newArray (lb,numNodes) 0 :: IO (IOUArray Node Indegree)
neighbours <- newArray (lb,numNodes) [] :: IO (IOArray Node (List Node))
forM_ edges $ \(from,to) -> do
update indegrees to (+1)
update neighbours from (to:)
let work = forever $ do
z <- getNext indegrees
modifyIORef' result (z:)
modifyIORef' count (+1)
ns <- readArray neighbours z
forM_ ns $ \n -> update indegrees n pred
work `catch`
\(_ :: SomeException) -> do
count <- readIORef count
if numNodes == count
then (mapM_ (\n -> putStr (show n ++ " ")) . reverse) =<< readIORef result
else putStrLn "Sandro fails."
{-# INLINE update #-}
update a i f = do
x <- readArray a i
writeArray a i (f x)
{-# INLINE getNext #-}
getNext indegrees = getNext' indegrees =<< getBounds indegrees
{-# INLINE getNext' #-}
getNext' indegrees (lb,ub) = readArray indegrees lb >>= \case
0 -> writeArray indegrees lb (-1) >> return lb
_ -> getNext' indegrees (lb+1,ub)
readPair :: String -> (Node,Node)
{-# INLINE readPair #-}
readPair = toPair . map read . words
where toPair [x,y] = (x,y)
toPair _ = error "Only two entries per line allowed"
Example output
$ ./topo
8 9
1 4
1 2
4 2
4 3
3 2
5 2
3 5
8 2
8 6
^D
1 4 3 5 7 8 2 6
If you haven’t already, profile your program by compiling with -prof -fprof-auto and then executing with the command-line options +RTS -p. This will generate a profile *.prof that will tell you which functions the program is spending all its time in. However, I can see immediately where the biggest time-waster is. Your instincts were right: it’s the I/O.
Having done that a lot, I can guarantee you that you’ll find that it’s spending the vast majority of its time doing I/O. The first thing you should always do to speed up your program is rewrite it to use fast I/O. Haskell is a fast language, when you use the right data structures. The default I/O library in the Prelude uses singly-linked lists with lazily-evaluated thunks where each node holds a single Unicode character. That would be slow in C, too!
I’ve gotten the best results with Data.ByteString.Lazy.Char8 when the input is ASCII, and Data.ByteString.Builder to generate the output. (An alternative is Data.Text.) That gets you a lazily-evaluated list of strict character buffers on input (so interactive input and output still works), and fills a single buffer on output.
After you’ve written the skeleton of the program with fast I/O, the next step is to look at your algorithm, and especially your data structures. Use profiling to see where all the time goes. But I’d recommend you use a functional algorithm rather than trying to write imperative programs in Haskell with do.
I almost always approach problems like this in Haskell with a more functional style: in particular, my main function is almost always something similar to:
import qualified Data.ByteString.Lazy.Char8 as B8
main :: IO()
main = B8.interact ( output . compute . input )
This makes everything except the call to interact a pure function, and isolates the parsing code and the formatting code so the compute part in the middle can be independent of that.
Since this is an assignment and you want to solve the problem yourself, I’ll refrain from refactoring the program for you, but here’s an example I wrote in response to a question on another forum to perform a counting sort. It should be suitable as a skeleton for other kinds of problems.
import Data.Array.IArray (accumArray, assocs)
import Data.Array.Unboxed (UArray)
import Data.ByteString.Builder (Builder, char7, intDec, toLazyByteString)
import qualified Data.ByteString.Lazy.Char8 as B8
import Data.Monoid ((<>))
main :: IO()
main = B8.interact ( output . compute . input ) where
input :: B8.ByteString -> [Int]
input = map perLine . tail . B8.lines where
perLine = decode . B8.readInt
decode (Just (x, _)) = x
decode Nothing = error "Invalid input: expected integer."
compute :: [Int] -> [Int]
compute = concatMap expand . assocs . countingSort . map encode where
encode i = (i, 1)
countingSort :: [(Int, Int)] -> UArray Int Int
countingSort = accumArray (+) 0 (lower, upper)
lower = 0
upper = 1000000
expand (i,c) = replicate c i
output :: [Int] -> B8.ByteString
output = toLazyByteString . foldMap perCase where
perCase :: Int -> Builder
perCase x = intDec x <> char7 '\n'
At present, this version ran in less than half the time of anyone else’s Haskell solution for the same problem, the same holds true for the actual contest problems I’ve used it for, and the approach generalizes.
So I suggest changing the I/O to be similar to that, first, then profiling, and coming back with the profiling output if that doesn’t make enough of a difference. This might also be a good Code Review question.
Thanks to #Davislor's suggestions I managed to get it to be much faster and I also refactored the code for the better and now I actually have an m log(n) algorithm. Surprisingly this doesn't make that much of a difference—the I/O far outweighed the suboptimal complexity of the algorithm.
EDIT: got rid of unsafePerformIO and it actually runs a teeny-weeny bit faster. Plus adding -XStrict shaves off even more time.
{-# LANGUAGE Strict #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE OverloadedStrings #-}
{-# OPTIONS_GHC -O2 #-}
import Control.Monad
import Data.Array.IO
import Data.Int
import Data.Set (Set)
import qualified Data.Set as Set
import Data.ByteString.Builder (Builder, char7, intDec, toLazyByteString)
import qualified Data.ByteString.Lazy.Char8 as B8
import Data.Monoid ((<>))
type List = []
type Node = Int
type Edge = (Node, Node)
type Indegree = Int
main = B8.putStrLn =<< topo . map readPair . B8.lines =<< B8.getContents
readPair :: B8.ByteString -> (Node,Node)
readPair str = (x,y)
where
(Just (x, str')) = B8.readInt str
(Just (y, _ )) = B8.readInt (B8.tail str')
topo :: List Edge -> IO B8.ByteString
topo inp = do
let (numNodes, _) = head inp
edges = tail inp
indegrees <- newArray (1,numNodes) 0 :: IO (IOUArray Node Indegree)
neighbours <- newArray (1,numNodes) [] :: IO (IOArray Node (List Node))
-- setup
forM_ edges $ \(from,to) -> do
update indegrees to (+1)
update neighbours from (to:)
zeroes <- collectIndegreeZero [] indegrees =<< getBounds indegrees
processQueue (Set.fromList zeroes) [] numNodes indegrees neighbours
where
collectIndegreeZero acc indegrees (lb,ub)
| lb > ub = return acc
| otherwise = do
indegr <- readArray indegrees lb
let acc' = if indegr == 0 then (lb:acc) else acc
collectIndegreeZero acc' indegrees (lb+1,ub)
processQueue queue result numNodes indegrees neighbours = do
if null queue
then if numNodes == 0
then return . toLazyByteString . foldMap whitespace . reverse $ result
else return "Sandro fails."
else do
(node,queue) <- return $ Set.deleteFindMin queue
ns <- readArray neighbours node
queue <- foldM decrIndegrees queue ns
processQueue queue (node:result) (numNodes-1) indegrees neighbours
where
decrIndegrees :: Set Node -> Node -> IO (Set Node)
decrIndegrees q n = do
i <- readArray indegrees n
writeArray indegrees n (i-1)
return $ if i == 1 then Set.insert n q else q
whitespace x = intDec x <> char7 ' '
{-# INLINE update #-}
update a i f = do
x <- readArray a i
writeArray a i (f x)

How can I improve performance of this minmax implementation?

Here is my current code. It's a naive implementation.
import System.Environment (getArgs)
minmax [] = Nothing
minmax [x] = Just (x, x)
minmax (a:b:xs) = Just $ minmax' xs $ sort a b
where minmax' [] lohi = lohi
minmax' [x] lohi#(lo, hi)
| x < lo = (x, hi)
| x > hi = (lo, x)
| otherwise = lohi
minmax' (x:y:xs) (lo, hi) = minmax' xs $ newlo `seq` newhi `seq` (newlo, newhi)
where (lo', hi') = sort x y
newlo = if lo' < lo then lo' else lo
newhi = if hi' > hi then hi' else hi
sort x y = if x < y then (x, y) else (y, x)
main = do
args <- getArgs
let [from, till] = map read args :: [Int]
print $ minmax [from..till]
I want to see how far I can go with haskell compared to c++ rust and maybe using three approaches in all languages.
Naive, naive parallel and using matrices (repa for Haskell and eigen for C++)
From this profiler report I see that Haskell allocates a lot of memory.
minmax +RTS -p -RTS 1 10000000
total time = 0.13 secs (132 ticks # 1000 us, 1 processor)
total alloc = 800,062,584 bytes (excludes profiling overheads)
I've compiled with ghc -O2 -fllvm (ghc 8.2.1). From my point of view naive implementation shouldn't allocate much memory because it should just iterate a list in pairs. How to achieve that, is there anything else I can do to improve performance?
I wanted to try stream-fusion package, but it doesn't support base 4.10. Is there workaround for this? Couldn't find anything about this problem.
Finding the minimum and maximum can be done in linear time and constant space as follows:
sort :: (Int, Int) -> Int -> (Int, Int)
sort (min, max) val
| val < min = (val, max)
| val > max = (min, val)
| otherwise = (min, max)
minmax :: [Int] -> Maybe (Int, Int)
minmax [] = Nothing
minmax (x:xs) = Just $ foldl sort (x, x) xs
Haskell's strictness analyzer should figure out that this code must be strict and optimize it.

How to optimize this Haskell code summing up the primes in sublinear time?

Problem 10 from Project Euler is to find the sum of all the primes below given n.
I solved it simply by summing up the primes generated by the sieve of Eratosthenes. Then I came across much more efficient solution by Lucy_Hedgehog (sub-linear!).
For n = 2⋅10^9:
Python code (from the quote above) runs in 1.2 seconds in Python 2.7.3.
C++ code (mine) runs in about 0.3 seconds (compiled with g++ 4.8.4).
I re-implemented the same algorithm in Haskell, since I'm learning it:
import Data.List
import Data.Map (Map, (!))
import qualified Data.Map as Map
problem10 :: Integer -> Integer
problem10 n = (sieve (Map.fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]) 2 r vs) ! n
where vs = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]
r = floor (sqrt (fromIntegral n))
sieve :: Map Integer Integer -> Integer -> Integer -> [Integer] -> Map Integer Integer
sieve m p r vs | p > r = m
| otherwise = sieve (if m ! p > m ! (p - 1) then update m vs p else m) (p + 1) r vs
update :: Map Integer Integer -> [Integer] -> Integer -> Map Integer Integer
update m vs p = foldl' decrease m (map (\v -> (v, sumOfSieved m v p)) (takeWhile (>= p*p) vs))
decrease :: Map Integer Integer -> (Integer, Integer) -> Map Integer Integer
decrease m (k, v) = Map.insertWith (flip (-)) k v m
sumOfSieved :: Map Integer Integer -> Integer -> Integer -> Integer
sumOfSieved m v p = p * (m ! (v `div` p) - m ! (p - 1))
main = print $ problem10 $ 2*10^9
I compiled it with ghc -O2 10.hs and run with time ./10.
It gives the correct answer, but takes about 7 seconds.
I compiled it with ghc -prof -fprof-auto -rtsopts 10 and run with ./10 +RTS -p -h.
10.prof shows that decrease takes 52.2% time and 67.5% allocations.
After running hp2ps 10.hp I got such heap profile:
Again looks like decrease takes most of the heap. GHC version 7.6.3.
How would you optimize run time of this Haskell code?
Update 13.06.17:
I tried replacing immutable Data.Map with mutable Data.HashTable.IO.BasicHashTable from the hashtables package, but I'm probably doing something bad, since for tiny n = 30 it already takes too long, about 10 seconds. What's wrong?
Update 18.06.17:
Curious about the HashTable performance issues is a good read. I took Sherh's code using mutable Data.HashTable.ST.Linear, but dropped Data.Judy in instead. It runs in 1.1 seconds, still relatively slow.
I've done some small improvements so it runs in 3.4-3.5 seconds on my machine.
Using IntMap.Strict helped a lot. Other than that I just manually performed some ghc optimizations just to be sure. And make Haskell code more close to Python code from your link. As a next step you could try to use some mutable HashMap. But I'm not sure... IntMap can't be much faster than some mutable container because it's an immutable one. Though I'm still surprised about it's efficiency. I hope this can be implemented faster.
Here is the code:
import Data.List (foldl')
import Data.IntMap.Strict (IntMap, (!))
import qualified Data.IntMap.Strict as IntMap
p :: Int -> Int
p n = (sieve (IntMap.fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]) 2 r vs) ! n
where vs = [n `div` i | i <- [1..r]] ++ [n', n' - 1 .. 1]
r = floor (sqrt (fromIntegral n) :: Double)
n' = n `div` r - 1
sieve :: IntMap Int -> Int -> Int -> [Int] -> IntMap Int
sieve m' p' r vs = go m' p'
where
go m p | p > r = m
| m ! p > m ! (p - 1) = go (update m vs p) (p + 1)
| otherwise = go m (p + 1)
update :: IntMap Int -> [Int] -> Int -> IntMap Int
update s vs p = foldl' decrease s (takeWhile (>= p2) vs)
where
sp = s ! (p - 1)
p2 = p * p
sumOfSieved v = p * (s ! (v `div` p) - sp)
decrease m v = IntMap.adjust (subtract $ sumOfSieved v) v m
main :: IO ()
main = print $ p $ 2*10^(9 :: Int)
UPDATE:
Using mutable hashtables I've managed to make performance up to ~5.5sec on Haskell with this implementation.
Also, I used unboxed vectors instead of lists in several places. Linear hashing seems to be the fastest. I think this can be done even faster. I noticed sse42 option in hasthables package. Not sure I've managed to set it correctly but even without it runs that fast.
UPDATE 2 (19.06.2017)
I've managed to make it 3x faster then best solution from #Krom (using my code + his map) by dropping judy hashmap at all. Instead just plain arrays are used. You can come up with the same idea if you notice that keys for S hashmap are either sequence from 1 to n' or n div i for i from 1 to r. So we can represent such HashMap as two arrays making lookups in array depending on searching key.
My code + Judy HashMap
$ time ./judy
95673602693282040
real 0m0.590s
user 0m0.588s
sys 0m0.000s
My code + my sparse map
$ time ./sparse
95673602693282040
real 0m0.203s
user 0m0.196s
sys 0m0.004s
This can be done even faster if instead of IOUArray already generated vectors and Vector library is used and readArray is replaced by unsafeRead. But I don't think this should be done if only you're not really interested in optimizing this as much as possible.
Comparison with this solution is cheating and is not fair. I expect same ideas implemented in Python and C++ will be even faster. But #Krom solution with closed hashmap is already cheating because it uses custom data structure instead of standard one. At least you can see that standard and most popular hash maps in Haskell are not that fast. Using better algorithms and better ad-hoc data structures can be better for such problems.
Here's resulting code.
First as a baseline, the timings of the existing approaches
on my machine:
Original program posted in the question:
time stack exec primorig
95673602693282040
real 0m4.601s
user 0m4.387s
sys 0m0.251s
Second the version using Data.IntMap.Strict from
here
time stack exec primIntMapStrict
95673602693282040
real 0m2.775s
user 0m2.753s
sys 0m0.052s
Shershs code with Data.Judy dropped in here
time stack exec prim-hash2
95673602693282040
real 0m0.945s
user 0m0.955s
sys 0m0.028s
Your python solution.
I compiled it with
python -O -m py_compile problem10.py
and the timing:
time python __pycache__/problem10.cpython-36.opt-1.pyc
95673602693282040
real 0m1.163s
user 0m1.160s
sys 0m0.003s
Your C++ version:
$ g++ -O2 --std=c++11 p10.cpp -o p10
$ time ./p10
sum(2000000000) = 95673602693282040
real 0m0.314s
user 0m0.310s
sys 0m0.003s
I didn't bother to provide a baseline for slow.hs, as I didn't
want to wait for it to complete when run with an argument of
2*10^9.
Subsecond performance
The following program runs in under a second on my machine.
It uses a hand rolled hashmap, which uses closed hashing with
linear probing and uses some variant of knuths hashfunction,
see here.
Certainly it is somewhat tailored to the case, as the lookup
function for example expects the searched keys to be present.
Timings:
time stack exec prim
95673602693282040
real 0m0.725s
user 0m0.714s
sys 0m0.047s
First I implemented my hand rolled hashmap simply to hash
the keys with
key `mod` size
and selected a size multiple times higher than the expected
input, but the program took 22s or more to complete.
Finally it was a matter of choosing a hash function which was
good for the workload.
Here is the program:
import Data.Maybe
import Control.Monad
import Data.Array.IO
import Data.Array.Base (unsafeRead)
type Number = Int
data Map = Map { keys :: IOUArray Int Number
, values :: IOUArray Int Number
, size :: !Int
, factor :: !Int
}
newMap :: Int -> Int -> IO Map
newMap s f = do
k <- newArray (0, s-1) 0
v <- newArray (0, s-1) 0
return $ Map k v s f
storeKey :: IOUArray Int Number -> Int -> Int -> Number -> IO Int
storeKey arr s f key = go ((key * f) `mod` s)
where
go :: Int -> IO Int
go ind = do
v <- readArray arr ind
go2 v ind
go2 v ind
| v == 0 = do { writeArray arr ind key; return ind; }
| v == key = return ind
| otherwise = go ((ind + 1) `mod` s)
loadKey :: IOUArray Int Number -> Int -> Int -> Number -> IO Int
loadKey arr s f key = s `seq` key `seq` go ((key *f) `mod` s)
where
go :: Int -> IO Int
go ix = do
v <- unsafeRead arr ix
if v == key then return ix else go ((ix + 1) `mod` s)
insertIntoMap :: Map -> (Number, Number) -> IO Map
insertIntoMap m#(Map ks vs s f) (k, v) = do
ix <- storeKey ks s f k
writeArray vs ix v
return m
fromList :: Int -> Int -> [(Number, Number)] -> IO Map
fromList s f xs = do
m <- newMap s f
foldM insertIntoMap m xs
(!) :: Map -> Number -> IO Number
(!) (Map ks vs s f) k = do
ix <- loadKey ks s f k
readArray vs ix
mupdate :: Map -> Number -> (Number -> Number) -> IO ()
mupdate (Map ks vs s fac) i f = do
ix <- loadKey ks s fac i
old <- readArray vs ix
let x' = f old
x' `seq` writeArray vs ix x'
r' :: Number -> Number
r' = floor . sqrt . fromIntegral
vs' :: Integral a => a -> a -> [a]
vs' n r = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]
vss' n r = r + n `div` r -1
list' :: Int -> Int -> [Number] -> IO Map
list' s f vs = fromList s f [(i, i * (i + 1) `div` 2 - 1) | i <- vs]
problem10 :: Number -> IO Number
problem10 n = do
m <- list' (19*vss) (19*vss+7) vs
nm <- sieve m 2 r vs
nm ! n
where vs = vs' n r
vss = vss' n r
r = r' n
sieve :: Map -> Number -> Number -> [Number] -> IO Map
sieve m p r vs | p > r = return m
| otherwise = do
v1 <- m ! p
v2 <- m ! (p - 1)
nm <- if v1 > v2 then update m vs p else return m
sieve nm (p + 1) r vs
update :: Map -> [Number] -> Number -> IO Map
update m vs p = foldM (decrease p) m $ takeWhile (>= p*p) vs
decrease :: Number -> Map -> Number -> IO Map
decrease p m k = do
v <- sumOfSieved m k p
mupdate m k (subtract v)
return m
sumOfSieved :: Map -> Number -> Number -> IO Number
sumOfSieved m v p = do
v1 <- m ! (v `div` p)
v2 <- m ! (p - 1)
return $ p * (v1 - v2)
main = do { n <- problem10 (2*10^9) ; print n; } -- 2*10^9
I am not a professional with hashing and that sort of stuff, so
this can certainly be improved a lot. Maybe we Haskellers should
improve the of the shelf hash maps or provide some simpler ones.
My hashmap, Shershs code
If I plug my hashmap in Shershs (see answer below) code, see here
we are even down to
time stack exec prim-hash2
95673602693282040
real 0m0.601s
user 0m0.604s
sys 0m0.034s
Why is slow.hs slow?
If you read through the source
for the function insert in Data.HashTable.ST.Basic, you
will see that it deletes the old key value pair and inserts
a new one. It doesn't look up the "place" for the value and
mutate it, as one might imagine, if one reads that it is
a "mutable" hashtable. Here the hashtable itself is mutable,
so you don't need to copy the whole hashtable for insertion
of a new key value pair, but the value places for the pairs
are not. I don't know if that is the whole story of slow.hs
being slow, but my guess is, it is a pretty big part of it.
A few minor improvements
So that's the idea I followed while trying to improve
your program the first time.
See, you don't need a mutable mapping from keys to values.
Your key set is fixed. You want a mapping from keys to mutable
places. (Which is, by the way, what you get from C++ by default.)
And so I tried to come up with that. I used IntMap IORef from
Data.IntMap.Strict and Data.IORef first and got a timing
of
tack exec prim
95673602693282040
real 0m2.134s
user 0m2.141s
sys 0m0.028s
I thought maybe it would help to work with unboxed values
and to get that, I used IOUArray Int Int with 1 element
each instead of IORef and got those timings:
time stack exec prim
95673602693282040
real 0m2.015s
user 0m2.018s
sys 0m0.038s
Not much of a difference and so I tried to get rid of bounds
checking in the 1 element arrays by using unsafeRead and
unsafeWrite and got a timing of
time stack exec prim
95673602693282040
real 0m1.845s
user 0m1.850s
sys 0m0.030s
which was the best I got using Data.IntMap.Strict.
Of course I ran each program multiple times to see if
the times are stable and the differences in run time aren't
just noise.
It looks like these are all just micro-optimizations.
And here is the program that ran fastest for me without using a hand rolled data structure:
import qualified Data.IntMap.Strict as M
import Control.Monad
import Data.Array.IO
import Data.Array.Base (unsafeRead, unsafeWrite)
type Number = Int
type Place = IOUArray Number Number
type Map = M.IntMap Place
tupleToRef :: (Number, Number) -> IO (Number, Place)
tupleToRef = traverse (newArray (0,0))
insertRefs :: [(Number, Number)] -> IO [(Number, Place)]
insertRefs = traverse tupleToRef
fromList :: [(Number, Number)] -> IO Map
fromList xs = M.fromList <$> insertRefs xs
(!) :: Map -> Number -> IO Number
(!) m i = unsafeRead (m M.! i) 0
mupdate :: Map -> Number -> (Number -> Number) -> IO ()
mupdate m i f = do
let place = m M.! i
old <- unsafeRead place 0
let x' = f old
-- make the application of f strict
x' `seq` unsafeWrite place 0 x'
r' :: Number -> Number
r' = floor . sqrt . fromIntegral
vs' :: Integral a => a -> a -> [a]
vs' n r = [n `div` i | i <- [1..r]] ++ reverse [1..n `div` r - 1]
list' :: [Number] -> IO Map
list' vs = fromList [(i, i * (i + 1) `div` 2 - 1) | i <- vs]
problem10 :: Number -> IO Number
problem10 n = do
m <- list' vs
nm <- sieve m 2 r vs
nm ! n
where vs = vs' n r
r = r' n
sieve :: Map -> Number -> Number -> [Number] -> IO Map
sieve m p r vs | p > r = return m
| otherwise = do
v1 <- m ! p
v2 <- m ! (p - 1)
nm <- if v1 > v2 then update m vs p else return m
sieve nm (p + 1) r vs
update :: Map -> [Number] -> Number -> IO Map
update m vs p = foldM (decrease p) m $ takeWhile (>= p*p) vs
decrease :: Number -> Map -> Number -> IO Map
decrease p m k = do
v <- sumOfSieved m k p
mupdate m k (subtract v)
return m
sumOfSieved :: Map -> Number -> Number -> IO Number
sumOfSieved m v p = do
v1 <- m ! (v `div` p)
v2 <- m ! (p - 1)
return $ p * (v1 - v2)
main = do { n <- problem10 (2*10^9) ; print n; } -- 2*10^9
If you profile that, you see that it spends most of the time in the custom lookup function (!),
don't know how to improve that further. Trying to inline (!) with {-# INLINE (!) #-}
didn't yield better results; maybe ghc already did this.
This code of mine evaluates the sum to 2⋅10^9 in 0.3 seconds and the sum to 10^12 (18435588552550705911377) in 19.6 seconds (if given sufficient RAM).
import Control.DeepSeq
import qualified Control.Monad as ControlMonad
import qualified Data.Array as Array
import qualified Data.Array.ST as ArrayST
import qualified Data.Array.Base as ArrayBase
primeLucy :: (Integer -> Integer) -> (Integer -> Integer) -> Integer -> (Integer->Integer)
primeLucy f sf n = g
where
r = fromIntegral $ integerSquareRoot n
ni = fromIntegral n
loop from to c = let go i = ControlMonad.when (to<=i) (c i >> go (i-1)) in go from
k = ArrayST.runSTArray $ do
k <- ArrayST.newListArray (-r,r) $ force $
[sf (div n (toInteger i)) - sf 1|i<-[r,r-1..1]] ++
[0] ++
[sf (toInteger i) - sf 1|i<-[1..r]]
ControlMonad.forM_ (takeWhile (<=r) primes) $ \p -> do
l <- ArrayST.readArray k (p-1)
let q = force $ f (toInteger p)
let adjust = \i j -> do { v <- ArrayBase.unsafeRead k (i+r); w <- ArrayBase.unsafeRead k (j+r); ArrayBase.unsafeWrite k (i+r) $!! v+q*(l-w) }
loop (-1) (-div r p) $ \i -> adjust i (i*p)
loop (-div r p-1) (-min r (div ni (p*p))) $ \i -> adjust i (div (-ni) (i*p))
loop r (p*p) $ \i -> adjust i (div i p)
return k
g :: Integer -> Integer
g m
| m >= 1 && m <= integerSquareRoot n = k Array.! (fromIntegral m)
| m >= integerSquareRoot n && m <= n && div n (div n m)==m = k Array.! (fromIntegral (negate (div n m)))
| otherwise = error $ "Function not precalculated for value " ++ show m
primeSum :: Integer -> Integer
primeSum n = (primeLucy id (\m -> div (m*m+m) 2) n) n
If your integerSquareRoot function is buggy (as reportedly some are), you can replace it here with floor . sqrt . fromIntegral.
Explanation:
As the name suggests it is based upon a generalization of the famous method by "Lucy Hedgehog" eventually discovered by the original poster.
It allows you to calculate many sums of the form (with p prime) without enumerating all the primes up to N and in time O(N^0.75).
Its inputs are the function f (i.e., id if you want the prime sum), its summatory function over all the integers (i.e., in that case the sum of the first m integers or div (m*m+m) 2), and N.
PrimeLucy returns a lookup function (with p prime) restricted to certain values of n: .
Try this and let me know how fast it is:
-- sum of primes
import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
sieve :: Int -> UArray Int Bool
sieve n = runSTUArray $ do
let m = (n-1) `div` 2
r = floor . sqrt $ fromIntegral n
bits <- newArray (0, m-1) True
forM_ [0 .. r `div` 2 - 1] $ \i -> do
isPrime <- readArray bits i
when isPrime $ do
let a = 2*i*i + 6*i + 3
b = 2*i*i + 8*i + 6
forM_ [a, b .. (m-1)] $ \j -> do
writeArray bits j False
return bits
primes :: Int -> [Int]
primes n = 2 : [2*i+3 | (i, True) <- assocs $ sieve n]
main = do
print $ sum $ primes 1000000
You can run it on ideone. My algorithm is the Sieve of Eratosthenes, and it should be quite fast for small n. For n = 2,000,000,000, the array size may be a problem, in which case you will need to use a segmented sieve. See my blog for more information about the Sieve of Eratosthenes. See this answer for information about a segmented sieve (but not in Haskell, unfortunately).

sorting integers fast in haskell

Is there any function in haskell libraries that sorts integers in O(n) time?? [By, O(n) I mean faster than comparison sort and specific for integers]
Basically I find that the following code takes a lot of time with the sort (as compared to summing the list without sorting) :
import System.Random
import Control.DeepSeq
import Data.List (sort)
genlist gen = id $!! sort $!! take (2^22) ((randoms gen)::[Int])
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
Summing a list doesn't require deepseq but what I am trying for does, but the above code is good enough for the pointers I am seeking.
Time : 6 seconds (without sort); about 35 seconds (with sort)
Memory : about 80 MB (without sort); about 310 MB (with sort)
Note 1 : memory is a bigger issue than time for me here as for the task at hand I am getting out of memory errors (memory usage becomes 3GB! after 30 minutes of run-time)
I am assuming faster algorithms will provide bettor memory print too, hence looking for O(n) time.
Note 2 : I am looking for fast algorithms for Int64, though fast algorithms for other specific types will also be helpful.
Solution Used : IntroSort with unboxed vectors was good enough for my task:
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
sort :: [Int] -> [Int]
sort = V.toList . V.modify I.sort . V.fromList
I would consider using vectors instead of lists for this, as lists have a lot of overhead per-element while an unboxed vector is essentially just a contiguous block of bytes. The vector-algorithms package contains various sorting algorithms you can use for this, including radix sort, which I expect should do well in your case.
Here's a simple example, though it might be a good idea to keep the result in vector form if you plan on doing further processing on it.
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Radix as R
sort :: [Int] -> [Int]
sort = V.toList . V.modify R.sort . V.fromList
Also, I suspect that a significant portion of the run time of your example is coming from the random number generator, as the standard one isn't exactly known for its performance. You should make sure that you're timing only the sorting part, and if you need a lot of random numbers in your program, there are faster generators available on Hackage.
The idea to sort the numbers using an array is the right one for reducing the memory usage.
However, using the maximum and minimum of the list as bounds may cause exceeding memory usage or even a runtime failure when maximum xs - minimum xs > (maxBound :: Int).
So I suggest writing the list contents to an unboxed mutable array, sorting that inplace (e.g. with quicksort), and then building a list from that again.
import System.Random
import Control.DeepSeq
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import Control.Monad.ST
myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi
| lo < hi = do
let lscan p h i
| i < h = do
v <- unsafeRead a i
if p < v then return i else lscan p h (i+1)
| otherwise = return i
rscan p l i
| l < i = do
v <- unsafeRead a i
if v < p then return i else rscan p l (i-1)
| otherwise = return i
swap i j = do
v <- unsafeRead a i
unsafeRead a j >>= unsafeWrite a i
unsafeWrite a j v
sloop p l h
| l < h = do
l1 <- lscan p h l
h1 <- rscan p l1 h
if (l1 < h1) then (swap l1 h1 >> sloop p l1 h1) else return l1
| otherwise = return l
piv <- unsafeRead a hi
i <- sloop piv lo hi
swap i hi
myqsort a lo (i-1)
myqsort a (i+1) hi
| otherwise = return ()
genlist gen = runST $ do
arr <- newListArray (0,2^22-1) $ take (2^22) (randoms gen)
myqsort arr 0 (2^22-1)
let collect acc 0 = do
v <- unsafeRead arr 0
return (v:acc)
collect acc i = do
v <- unsafeRead arr i
collect (v:acc) (i-1)
collect [] (2^22-1)
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
is reasonably fast and uses less memory. It still uses a lot of memory for the list, 222 Ints take 32MB storage raw (with 64-bit Ints), with the list overhead of iirc five words per element, that adds up to ~200MB, but less than half of the original.
This is taken from Richard Bird's book, Pearls of Functional Algorithm Design, (though I had to edit it a little, as the code in the book didn't compile exactly as written).
import Data.Array(Array,accumArray,assocs)
sort :: [Int] -> [Int]
sort xs = concat [replicate k x | (x,k) <- assocs count]
where count :: Array Int Int
count = accumArray (+) 0 range (zip xs (repeat 1))
range = (0, maximum xs)
It works by creating an Array indexed by integers where the values are the number of times each integer occurs in the list. Then it creates a list of the indexes, repeating them the same number of times they occurred in the original list according to the counts.
You should note that it is linear with the maximum value in the list, not the length of the list, so a list like [ 2^x | x <- [0..n] ] would not be sorted linearly.

Resources