Reduce allocation sorting large list (or vector) - sorting

I am trying to reduce GC time in my program. The main suspect is the following piece of code:
Data.Vector.Unboxed.fromList . take n . List.sortBy (flip $ Ord.comparing id)
$ [ ( sum [ (c + a) * wsum z | (z,c) <- IntMap.toList zt_d ] , d)
| d <- IntMap.keys $ m
, let zt_d = IntMap.findWithDefault IntMap.empty d $ m ]
The list being sorted would typically contain several thousand elements. I think the list sort is the culprit, because if I replace take n . List.sortBy (flip $ Ord.comparing id) with return . List.maximum my productivity goes from 60% to 95%.
Is there anything I can do to reduce allocation here?
Update
As recommended, I replaced the List.sort by an inplace sort from vector-algorithms.
Perhaps I'm doing it wrong, but what I'm seeing is that there is no allocation (productivity 97% as opposed to 63% with lists), but the program is many times slower: it runs in 85 seconds with List.sortBy; with inplace sort I killed it after
waiting 7 minutes. I tried both Intro and Merge sorts. Here is my code:
import qualified Data.Vector.Generic.Mutable as GM
import qualified Data.Vector.Generic as G
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Algorithms.Merge as Sort
import qualified Data.Vector.Fusion.Stream as Stream
import Control.Monad.ST
sortBy :: (Ord a, U.Unbox a) => (a -> a -> Ordering) -> [a] -> U.Vector a
sortBy cmp xs = runST $ do
mv <- GM.unstream . Stream.fromList $ xs
Sort.sortBy cmp mv
G.unsafeFreeze mv

The sorting does indeed look like it will cause a lot of allocation. While the sorting is performed on a list, that cannot be completely changed, since sorting lists causes the construction of many intermediate lists. If necessary, you could try to do the sorting on an MVector using for example the vector-algorithms package which provides efficient sorting algorithms.
However, there are further inefficiencies that cause more allocation than necessary in
Data.Vector.Unboxed.fromList . take n . List.sortBy (flip $ Ord.comparing id)
$ [ ( sum [ (c + a) * wsum z | (z,c) <- IntMap.toList zt_d ] , d)
| d <- IntMap.keys $ m
, let zt_d = IntMap.findWithDefault IntMap.empty d $ m ]
When you write
d <- IntMap.keys m, let zt_d = IntMap.findWithDefault IntMap.empty d m
-- The '$' are unnecessary, I left them out
you are 1) traversing the entire map to collect the list of keys, and 2) then look up each key on its own. Since you only look up keys present in the map, you never use the default. Much more efficient is to create the list of key/value pairs in one traversal of the map:
(d,zt_d) <- IntMap.assocs m
Then if id in flip $ Ord.comparing id is indeed the identity function, that would be more readable (and possibly more efficient) as sortBy (flip compare).
Depending on the type of the summed elements (and possibly the optimisation level), it might be better to use Data.List.foldl' (+) 0 instead of sum.

Related

Is runInBoundThread the best tool for parallelism?

Say, I want to fold monoids in parallel. My computer has 8 cores. I have this function to split a list into equal-sized smaller lists (with bounded modulo-bias):
import Data.List
parallelize :: Int -> [a] -> [[a]]
parallelize 0 _ = []
parallelize n [] = replicate n []
parallelize n xs = let
(us,vs) = splitAt (quot (length xs) n) xs
in us : parallelize (n-1) vs
The first version of parallel fold I made was:
import Control.Concurrent
import Control.Concurrent.QSemN
import Data.Foldable
import Data.IORef
foldP :: Monoid m => [m] -> IO m
foldP xs = do
result <- newIORef mempty
sem <- newQSemN 0
n <- getNumCapabilities
let yss = parallelize n xs
for_ yss (\ys -> forkIO (modifyIORef result (fold ys <>) >> signalQSemN sem 1))
waitQSemN sem n
readIORef result
But usage of IORefs and semaphores seemed ugly to me. So I made another version:
import Data.Traversable
foldP :: Monoid m => [m] -> IO m
foldP xs = do
n <- getNumCapabilities
let yss = parallelize n xs
rs <- for yss (\ys -> runInUnboundThread (return (fold ys)))
return (fold rs)
The test code I used is:
import Data.Monoid
import System.CPUTime
main :: IO ()
main = do
start <- getCPUTime
Product result <- foldP (fmap Product [1 .. 100])
end <- getCPUTime
putStrLn ("Time took: " ++ show (end - start) ++ "ps.")
putStrLn ("Result: " ++ show result)
The second version of foldP outperformed the first version. When I used runInBoundThread instead of runInUnboundThread, it became even faster.
By what are these performance differences made?
TLDR; Use fold function from massiv package and you will likely get the most efficient solution in Haskell.
I would like to start by saying that the first thing that people forget when trying to implement concurrent patterns like this is exception handling. In the solution from the question the exception handling is non-existent thus it is totally wrong. Therefore I'd recommend to use existing implementations for common concurrency patterns. async is the goto library for concurrency, but for such use case it will not be the most efficient solution.
This particular example can easily be solved with scheduler package, in fact it is exactly the kind of stuff it was designed for. Here is how you can use it to achieve folding of monoids:
import Control.Scheduler
import Control.Monad.IO.Unlift
foldP :: (MonadUnliftIO m, Monoid n) => Comp -> [n] -> m n
foldP comp xs = do
rs <-
withScheduler comp $ \scheduler ->
mapM_ (scheduleWork scheduler . pure . fold) (parallelize (numWorkers scheduler) xs)
pure $ fold rs
See the Comp type for explanation on best parallelization strategies. From what I found in practice Par will usually work best, because it will use pinned threads created with forkOn
Note that the parallelize function is implemented inefficiently and dangerously as well, it is better to write it this way:
parallelize :: Int -> [a] -> [[a]]
parallelize n' xs' = go 0 id xs'
where
n = max 1 n'
-- at least two elements make sense to get benefit of parallel fold
k = max 2 $ quot (length xs') n
go i acc xs
| null xs = acc []
| i < n =
case splitAt k xs of
(ls, rs) -> go (i + 1) (acc . (ls :)) rs
| otherwise = acc . (xs:) $ []
One more bit of advise is that list is far from ideal data structure for parallelization and efficiency in general. In order to split the lists into chunks before parallelizing computation you already have to go through the data structure with parallelize, which can be avoided if you were to use an array. What I am getting at is use an array instead, as suggested in the beginning of this answer.

Batching actions for caching and performance while avoiding the dirty work

Say I have two pure but unsafe functions, that do the same, but one of them is working on batches, and is asymptotically faster:
f :: Int -> Result -- takes O(1) time
f = unsafePerformIO ...
g :: [Int] -> [Result] -- takes O(log n) time
g = unsafePerformIO ...
A naive implementation:
getUntil :: Int -> [Result]
getUntil 0 = f 0
getUntil n = f n : getUntil n-1
switch is the n value where g gets cheaper than f.
getUntil will in practice be called with ever increasing n, but it might not start at 0. So since the Haskell runtime can memoize getUntil, performance will be optimal if getUntil is called with an interval lower than switch. But once the interval gets larger, this implementation is slow.
In an imperative program, I guess I would make a TreeMap (which could quickly be checked for gaps) for caching all calls. On cache misses, it would get filled with the results of g, if the gap was greater than switch in length, and f otherwise, respectively.
How can this be optimized in Haskell?
I think I am just looking for:
an ordered map filled on-demand using a fill function that would fill all values up to the requested index using one function if the missing range is small, another if it is large
a get operation on the map which returns a list of all lower values up to the requested index. This would result in a function similar to getUntil above.
I'll elaborate in my proposal for using map, after some tests I just ran.
import System.IO
import System.IO.Unsafe
import Control.Concurrent
import Control.Monad
switch :: Int
switch = 1000
f :: Int -> Int
f x = unsafePerformIO $ do
threadDelay $ 500 * x
putStrLn $ "Calculated from scratch: f(" ++ show x ++ ")"
return $ 500*x
g :: Int -> Int
g x = unsafePerformIO $ do
threadDelay $ x*x `div` 2
putStrLn $ "Calculated from scratch: g(" ++ show x ++ ")"
return $ x*x `div` 2
cachedFG :: [Int]
cachedFG = map g [0 .. switch] ++ map f [switch+1 ..]
main :: IO ()
main = forever $ getLine >>= print . (cachedFG !!) . read
… where f, g and switch have the same meaning indicated in the question.
The above program can be compiled as is using GHC. When executed, positive integers can be entered, followed by a newline, and the application will print some value based on the number entered by the user plus some extra indication on what values are being calculated from scratch.
A short session with this program is:
User: 10000
Program: Calculated from scratch: f(10000)
Program: 5000000
User: 10001
Program: Calculated from scratch: f(10001)
Program: 5000500
User: 10000
Program: 5000000
^C
The program has to be killed/terminated manually.
Notice that the last value entered doesn't show a "calculated from scratch" message. This indicates that the program has the value cached/memoized somewhere. You can try executing this program yourself; but have into account that threadDelay's lag is proportional to the value entered.
The getUntil function then could be implemented using:
getUntil :: Int -> [Int]
getUntil n = take n cachedFG
or:
getUntil :: Int -> [Int]
getUntil = flip take cachedFG
If you don't know the value for switch, you can try evaluating f and g in parallel and use the fastest result, but that's another show.

sorting integers fast in haskell

Is there any function in haskell libraries that sorts integers in O(n) time?? [By, O(n) I mean faster than comparison sort and specific for integers]
Basically I find that the following code takes a lot of time with the sort (as compared to summing the list without sorting) :
import System.Random
import Control.DeepSeq
import Data.List (sort)
genlist gen = id $!! sort $!! take (2^22) ((randoms gen)::[Int])
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
Summing a list doesn't require deepseq but what I am trying for does, but the above code is good enough for the pointers I am seeking.
Time : 6 seconds (without sort); about 35 seconds (with sort)
Memory : about 80 MB (without sort); about 310 MB (with sort)
Note 1 : memory is a bigger issue than time for me here as for the task at hand I am getting out of memory errors (memory usage becomes 3GB! after 30 minutes of run-time)
I am assuming faster algorithms will provide bettor memory print too, hence looking for O(n) time.
Note 2 : I am looking for fast algorithms for Int64, though fast algorithms for other specific types will also be helpful.
Solution Used : IntroSort with unboxed vectors was good enough for my task:
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
sort :: [Int] -> [Int]
sort = V.toList . V.modify I.sort . V.fromList
I would consider using vectors instead of lists for this, as lists have a lot of overhead per-element while an unboxed vector is essentially just a contiguous block of bytes. The vector-algorithms package contains various sorting algorithms you can use for this, including radix sort, which I expect should do well in your case.
Here's a simple example, though it might be a good idea to keep the result in vector form if you plan on doing further processing on it.
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Radix as R
sort :: [Int] -> [Int]
sort = V.toList . V.modify R.sort . V.fromList
Also, I suspect that a significant portion of the run time of your example is coming from the random number generator, as the standard one isn't exactly known for its performance. You should make sure that you're timing only the sorting part, and if you need a lot of random numbers in your program, there are faster generators available on Hackage.
The idea to sort the numbers using an array is the right one for reducing the memory usage.
However, using the maximum and minimum of the list as bounds may cause exceeding memory usage or even a runtime failure when maximum xs - minimum xs > (maxBound :: Int).
So I suggest writing the list contents to an unboxed mutable array, sorting that inplace (e.g. with quicksort), and then building a list from that again.
import System.Random
import Control.DeepSeq
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import Control.Monad.ST
myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi
| lo < hi = do
let lscan p h i
| i < h = do
v <- unsafeRead a i
if p < v then return i else lscan p h (i+1)
| otherwise = return i
rscan p l i
| l < i = do
v <- unsafeRead a i
if v < p then return i else rscan p l (i-1)
| otherwise = return i
swap i j = do
v <- unsafeRead a i
unsafeRead a j >>= unsafeWrite a i
unsafeWrite a j v
sloop p l h
| l < h = do
l1 <- lscan p h l
h1 <- rscan p l1 h
if (l1 < h1) then (swap l1 h1 >> sloop p l1 h1) else return l1
| otherwise = return l
piv <- unsafeRead a hi
i <- sloop piv lo hi
swap i hi
myqsort a lo (i-1)
myqsort a (i+1) hi
| otherwise = return ()
genlist gen = runST $ do
arr <- newListArray (0,2^22-1) $ take (2^22) (randoms gen)
myqsort arr 0 (2^22-1)
let collect acc 0 = do
v <- unsafeRead arr 0
return (v:acc)
collect acc i = do
v <- unsafeRead arr i
collect (v:acc) (i-1)
collect [] (2^22-1)
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
is reasonably fast and uses less memory. It still uses a lot of memory for the list, 222 Ints take 32MB storage raw (with 64-bit Ints), with the list overhead of iirc five words per element, that adds up to ~200MB, but less than half of the original.
This is taken from Richard Bird's book, Pearls of Functional Algorithm Design, (though I had to edit it a little, as the code in the book didn't compile exactly as written).
import Data.Array(Array,accumArray,assocs)
sort :: [Int] -> [Int]
sort xs = concat [replicate k x | (x,k) <- assocs count]
where count :: Array Int Int
count = accumArray (+) 0 range (zip xs (repeat 1))
range = (0, maximum xs)
It works by creating an Array indexed by integers where the values are the number of times each integer occurs in the list. Then it creates a list of the indexes, repeating them the same number of times they occurred in the original list according to the counts.
You should note that it is linear with the maximum value in the list, not the length of the list, so a list like [ 2^x | x <- [0..n] ] would not be sorted linearly.

Most efficient way to create the Data.Set of all pairs of elements in a Set?

Given an arbitrary set holding an arbitrary number of elements of arbitrary type, e.g.
mySet1 = Set.fromList [1,2,3,4]
or
mySet2 = Set.fromList ["a","b","c","d"]
or
mySet3 = Set.fromList [A, B, C, D]
for some data constructors A, B, C, D, ...
What is the computationally most efficient way to generate the set of all unordered pairs of elements is the given set? I.e.
setPairs mySet1 == Set.fromList [(1,2),(1,3),(1,4),(2,3),(2,4),(3,4)]
or
setPairs mySet2 == fromList [ ("a","b")
, ("a","c")
, ("a","d")
, ("b","c")
, ("b","d")
, ("c","d") ]
or
setPairs mySet2 == fromList [ (A,B)
, (A,C)
, (A,D)
, (B,C)
, (B,D)
, (C,D) ]
My initial naive guess would be:
setPairs s = fst $ Set.fold
(\e (pairAcc, elementsLeft) ->
( Set.fold
(\e2 pairAcc2 ->
Set.insert (e2, e) pairAcc2
) pairAcc $ Set.delete e elementsLeft
, Set.delete e elementsLeft )
) (Set.empty, s) s
but surely that cannot be the best solution?
Benchmarking might prove me wrong, but my suspicion is that there's no win in staying in the set representation. You're going to need O(n^2) regardless, because that's the size of the output. The key advantage would be producing your list such that you could use a call to S.fromDistinctAscList such that it only costs O(n) to build the set itself.
The following is pretty clean, preserves a fair amount of sharing, and is generally the simplest, most straightforward and intuitive solution I can imagine.
pairs s = S.fromDistinctAscList . concat $ zipWith zip (map (cycle . take 1) ts) (drop 1 ts)
where ts = tails $ S.toList s
Edit
Shorter/clearer (not sure performancewise, but probably as good/better):
pairs s = S.fromDistinctAscList [(x,y) | (x:xt) <- tails (S.toList s), y <- xt]
At first, you need to generate all sets. replicateM from Control.Monad helps with it.
λ> replicateM 2 [1..4]
[[1,1],[1,2],[1,3],[1,4],[2,1],[2,2],[2,3],[2,4],[3,1],[3,2],[3,3],[3,4],[4,1],[4,2],[4,3],[4,4]]
Then you need to filter pairs, where second element is greater than first
λ> filter (\[x,y] -> x < y) $ replicateM 2 [1 .. 4]
[[1,2],[1,3],[1,4],[2,3],[2,4],[3,4]]
Finally, you need to convert every list in a tuple
λ> map (\[x,y] -> (x,y)) $ filter (\[x,y] -> x < y) $ replicateM 2 [1 .. 4]
[(1,2),(1,3),(1,4),(2,3),(2,4),(3,4)]
Then we can formulate it into function pairs:
import Data.Set
import Control.Monad
import Data.List
mySet = Data.Set.fromList [1,2,3,4]
--setOfPairs = Data.Set.fromList [(1,2),(1,3),(1,4),(2,3),(2,4),(3,4)]
setOfPairs = Data.Set.fromList $ pairs mySet
pairs :: Ord a => Set a -> [(a,a)]
pairs x = Data.List.map (\[x,y] -> (x,y)) $ Data.List.filter (\[x,y] -> x < y) $ replicateM 2 $ toList x
So, if I got you question right, you can use pairs mySet, where pairs generate the list of all unordered pairs of mySet.
Is it what you want?
UPD:
List comprehension could be more clear and fast technique to create such sublists, so here is another instance of pairs:
pairs :: Ord a => Set a -> [(a,a)]
pairs set = [(x,y) | let list = toList set, x <- list, y <- list, x < y]
So here is a first stab at a solution using conversion back and forth to a list. Again, I am not sure this is the fastest way to do this but I do know that iteration over sets it's not terribly efficient.
import Data.List
import qualified Data.Set as S
pairs :: S.Set String -> S.Set (String,String)
pairs s = S.fromList $ foldl' (\st e -> (zip l e) ++ st) [] ls
where (l:ls) = tails $ S.toList s
By folding zip over the tails, you get a nice and efficient way to create the set of unordered pairs. However, instinct encourages me that there may be a monadic filterM or foldM solution that's even more elegant. I will keep thinking.
[EDIT]
So here is what should be [but is not on account of the size of the powerset] a faster solution that does not require a toList.
import Data.List
import qualified Data.Set as S
import qualified Data.Foldable as F
pairs :: (Ord a) => S.Set a -> S.Set (a,a)
pairs s = S.fromList $ foldl two [] $ F.foldlM (\st e -> [[e]++st,st]) [] s
where two st (x:xa:[]) = (x,xa) : st
two st _ = st
Uses the power-set solution over monadic lists to build the powerset and then filter out the pairs. I can go into more detail if necessary.

Haskell mutable map/tree

I am looking for a mutable (balanced) tree/map/hash table in Haskell or a way how to simulate it inside a function. I.e. when I call the same function several times, the structure is preserved. So far I have tried Data.HashTable (which is OK, but somewhat slow) and tried Data.Array.Judy but I was unable to make it work with GHC 6.10.4. Are there any other options?
If you want mutable state, you can have it. Just keep passing the updated map around, or keep it in a state monad (which turns out to be the same thing).
import qualified Data.Map as Map
import Control.Monad.ST
import Data.STRef
memoize :: Ord k => (k -> ST s a) -> ST s (k -> ST s a)
memoize f = do
mc <- newSTRef Map.empty
return $ \k -> do
c <- readSTRef mc
case Map.lookup k c of
Just a -> return a
Nothing -> do a <- f k
writeSTRef mc (Map.insert k a c) >> return a
You can use this like so. (In practice, you might want to add a way to clear items from the cache, too.)
import Control.Monad
main :: IO ()
main = do
fib <- stToIO $ fixST $ \fib -> memoize $ \n ->
if n < 2 then return n else liftM2 (+) (fib (n-1)) (fib (n-2))
mapM_ (print <=< stToIO . fib) [1..10000]
At your own risk, you can unsafely escape from the requirement of threading state through everything that needs it.
import System.IO.Unsafe
unsafeMemoize :: Ord k => (k -> a) -> k -> a
unsafeMemoize f = unsafePerformIO $ do
f' <- stToIO $ memoize $ return . f
return $ unsafePerformIO . stToIO . f'
fib :: Integer -> Integer
fib = unsafeMemoize $ \n -> if n < 2 then n else fib (n-1) + fib (n-2)
main :: IO ()
main = mapM_ (print . fib) [1..1000]
Building on #Ramsey's answer, I also suggest you reconceive your function to take a map and return a modified one. Then code using good ol' Data.Map, which is pretty efficient at modifications. Here is a pattern:
import qualified Data.Map as Map
-- | takes input and a map, and returns a result and a modified map
myFunc :: a -> Map.Map k v -> (r, Map.Map k v)
myFunc a m = … -- put your function here
-- | run myFunc over a list of inputs, gathering the outputs
mapFuncWithMap :: [a] -> Map.Map k v -> ([r], Map.Map k v)
mapFuncWithMap as m0 = foldr step ([], m0) as
where step a (rs, m) = let (r, m') = myFunc a m in (r:rs, m')
-- this starts with an initial map, uses successive versions of the map
-- on each iteration, and returns a tuple of the results, and the final map
-- | run myFunc over a list of inputs, gathering the outputs
mapFunc :: [a] -> [r]
mapFunc as = fst $ mapFuncWithMap as Map.empty
-- same as above, but starts with an empty map, and ignores the final map
It is easy to abstract this pattern and make mapFuncWithMap generic over functions that use maps in this way.
Although you ask for a mutable type, let me suggest that you use an immutable data structure and that you pass successive versions to your functions as an argument.
Regarding which data structure to use,
There is an implementation of red-black trees at Kent
If you have integer keys, Data.IntMap is extremely efficient.
If you have string keys, the bytestring-trie package from Hackage looks very good.
The problem is that I cannot use (or I don't know how to) use a non-mutable type.
If you're lucky, you can pass your table data structure as an extra parameter to every function that needs it. If, however, your table needs to be widely distributed, you may wish to use a state monad where the state is the contents of your table.
If you are trying to memoize, you can try some of the lazy memoization tricks from Conal Elliott's blog, but as soon as you go beyond integer arguments, lazy memoization becomes very murky—not something I would recommend you try as a beginner. Maybe you can post a question about the broader problem you are trying to solve? Often with Haskell and mutability the issue is how to contain the mutation or updates within some kind of scope.
It's not so easy learning to program without any global mutable variables.
If I read your comments right, then you have a structure with possibly ~500k total values to compute. The computations are expensive, so you want them done only once, and on subsequent accesses, you just want the value without recomputation.
In this case, use Haskell's laziness to your advantage! ~500k is not so big: Just build a map of all the answers, and then fetch as needed. The first fetch will force computation, subsequent fetches of the same answer will reuse the same result, and if you never fetch a particular computation - it never happens!
You can find a small implementation of this idea using 3D point distances as the computation in the file PointCloud.hs. That file uses Debug.Trace to log when the computation actually gets done:
> ghc --make PointCloud.hs
[1 of 1] Compiling Main ( PointCloud.hs, PointCloud.o )
Linking PointCloud ...
> ./PointCloud
(1,2)
(<calc (1,2)>)
Just 1.0
(1,2)
Just 1.0
(1,5)
(<calc (1,5)>)
Just 1.0
(1,2)
Just 1.0
Are there any other options?
A mutable reference to a purely functional dictionary like Data.Map.

Resources