Is runInBoundThread the best tool for parallelism?

Is runInBoundThread the best tool for parallelism? - performance

Say, I want to fold monoids in parallel. My computer has 8 cores. I have this function to split a list into equal-sized smaller lists (with bounded modulo-bias):
import Data.List
parallelize :: Int -> [a] -> [[a]]
parallelize 0 _ = []
parallelize n [] = replicate n []
parallelize n xs = let
(us,vs) = splitAt (quot (length xs) n) xs
in us : parallelize (n-1) vs
The first version of parallel fold I made was:
import Control.Concurrent
import Control.Concurrent.QSemN
import Data.Foldable
import Data.IORef
foldP :: Monoid m => [m] -> IO m
foldP xs = do
result <- newIORef mempty
sem <- newQSemN 0
n <- getNumCapabilities
let yss = parallelize n xs
for_ yss (\ys -> forkIO (modifyIORef result (fold ys <>) >> signalQSemN sem 1))
waitQSemN sem n
readIORef result
But usage of IORefs and semaphores seemed ugly to me. So I made another version:
import Data.Traversable
foldP :: Monoid m => [m] -> IO m
foldP xs = do
n <- getNumCapabilities
let yss = parallelize n xs
rs <- for yss (\ys -> runInUnboundThread (return (fold ys)))
return (fold rs)
The test code I used is:
import Data.Monoid
import System.CPUTime
main :: IO ()
main = do
start <- getCPUTime
Product result <- foldP (fmap Product [1 .. 100])
end <- getCPUTime
putStrLn ("Time took: " ++ show (end - start) ++ "ps.")
putStrLn ("Result: " ++ show result)
The second version of foldP outperformed the first version. When I used runInBoundThread instead of runInUnboundThread, it became even faster.
By what are these performance differences made?

TLDR; Use fold function from massiv package and you will likely get the most efficient solution in Haskell.
I would like to start by saying that the first thing that people forget when trying to implement concurrent patterns like this is exception handling. In the solution from the question the exception handling is non-existent thus it is totally wrong. Therefore I'd recommend to use existing implementations for common concurrency patterns. async is the goto library for concurrency, but for such use case it will not be the most efficient solution.
This particular example can easily be solved with scheduler package, in fact it is exactly the kind of stuff it was designed for. Here is how you can use it to achieve folding of monoids:
import Control.Scheduler
import Control.Monad.IO.Unlift
foldP :: (MonadUnliftIO m, Monoid n) => Comp -> [n] -> m n
foldP comp xs = do
rs <-
withScheduler comp $ \scheduler ->
mapM_ (scheduleWork scheduler . pure . fold) (parallelize (numWorkers scheduler) xs)
pure $ fold rs
See the Comp type for explanation on best parallelization strategies. From what I found in practice Par will usually work best, because it will use pinned threads created with forkOn
Note that the parallelize function is implemented inefficiently and dangerously as well, it is better to write it this way:
parallelize :: Int -> [a] -> [[a]]
parallelize n' xs' = go 0 id xs'
where
n = max 1 n'
-- at least two elements make sense to get benefit of parallel fold
k = max 2 $ quot (length xs') n
go i acc xs
| null xs = acc []
| i < n =
case splitAt k xs of
(ls, rs) -> go (i + 1) (acc . (ls :)) rs
| otherwise = acc . (xs:) $ []
One more bit of advise is that list is far from ideal data structure for parallelization and efficiency in general. In order to split the lists into chunks before parallelizing computation you already have to go through the data structure with parallelize, which can be avoided if you were to use an array. What I am getting at is use an array instead, as suggested in the beginning of this answer.

Related

Why is `filterM + mapM_` so much slower than `mapM_ + when`, with large lists?

I don't know very much about how Haskell optimization works internally but I've been using filters quite much hoping that they are optimized into something equivalent to a simple if in C++. For example
mapM_ print $ filter (\n -> n `mod` 2 == 0) [0..10]
will compile into equivalent of
for (int i = 0; i < 10; i++)
if (i%2 == 0)
printf("%d\n", i);
With long lists (10 000 000 elements) it seems to be true for a basic filter but there is a huge difference if I use the monadic filterM. I wrote a piece of code for this speed testing and it's obvious that the usage of filterM lasts much longer (250x) than a more imperative approach using when.
import Data.Array.IO
import Control.Monad
import System.CPUTime
main :: IO ()
main = do
start <- getCPUTime
arr <- newArray (0, 100) 0 :: IO (IOUArray Int Int)
let
okSimple i =
i < 100
ok i = do
return $ i < 100
-- -- of course we don't need IO for a simple i < 100
-- -- but my goal is to ask for the contents of the array, e.g.
-- ok i = do
-- current <- readArray arr (i `mod` 101)
-- return$ i `mod` 37 > current `mod` 37
write :: Int -> IO ()
write i =
writeArray arr (i `mod` 101) i
writeIfOkSimple :: Int -> IO ()
writeIfOkSimple i =
when (okSimple i) $ write i
writeIfOk :: Int -> IO ()
writeIfOk i =
ok i >>= (\isOk -> when isOk $ write i)
-------------------------------------------------------------------
---- these four methods have approximately same execution time ----
---- (but the last one is executed on 250 times shorter list) ----
-------------------------------------------------------------------
-- mapM_ write$ filter okSimple [0..10000000*250] -- t = 20.694
-- mapM_ writeIfOkSimple [0..10000000*250] -- t = 20.698
-- mapM_ writeIfOk [0..10000000*250] -- t = 20.669
filterM ok [0..10000000] >>= mapM_ write -- t = 17.200
-- evaluate array
elems <- getElems arr
print $ sum elems
end <- getCPUTime
print $ fromIntegral (end - start) / (10^12)
My question is: shouldn't both approaches (using writeIfOk / using filterM ok and write) compile into the same code (iterate list, ask for condition, write data)? If not, can I do something (rewrite code, add compilation flags, use inline pragma or something) to make them computationally equivalent or should I always use when when performance is critical?

Boiling this question down to its essence, your asking about the difference between
f (filter g xs)
and
f =<< filterM (pure . g) xs
This basically comes down to laziness. filter g xs produces its result incrementally as it's demanded, only walking xs far enough to find the next element of the result. filterM is defined something like this:
filterM _p [] = pure []
filterM p (x : xs)
= liftA2 (\q r -> if q then x : r else r)
(p x)
(filterM p xs)
Since IO is a "strict" applicative, this will not produce anything at all until it's walked the whole list, accumulating the p x results in memory.

Finding a "Count Sequence"

Given a list of integers xs, let:
count :: [Integer] -> Integer -> Integer
count xs n = length . filter (==n) $ xs
count the number of times the integer n occurs in the list.
Now, given a "list" (some sort of array of integers, can be something besides a List) of length n, write a function
countSequence :: [Integer] -> Integer -> Integer -> Integer
countSequence xs n m = [count xs x | x <- [0..m]]
that outputs the "list of counts" (0th index contains number of times 0 occurs in the list, 1st index contains number of times 1 occurs in the list, etc) that has time compleity o(m*n)
The above implementation I've given has complexity O(m*n). In Python (which I'm more familiar with), it's easy to do this in O(m + n) time --- iterate through the list, and each element increment a counter in some other list, which is initialized to be all zeros and length (m+1).
How could I get a better implementation in Haskell? I'd prefer if it wasn't some trivial way to implement the Python solution (such as adding another input to the function to keep the "list of counts" in and then interating through it).

In O(n+m) (sort of, I think, maybe):
import Data.Ix (inRange)
import qualified Data.IntMap.Strict as IM
countSequence m =
foldl' count IM.empty . filter (inRange (0,m))
where count a b = IM.insertWith (+) b 1 a
gives
> countSequence 2 [1,2,3,1,2,-1]
fromList [(1,2),(2,2)]
I haven't used n because you also didn't use n and I'm not sure what it's supposed to be. I also moved the list to the last argument to put it in a position to be eta reduced.

I think you should use your Python intuition -- iterate through the one list and increment a counter in another list. Here's an implementation with O(n+m) runtime:
import Data.Array
countSequence xs m = accumArray (+) 0 (0,m) [(x, 1) | x <- xs, inRange (0,m) x]
(This use case is even the motivating example for the existence of accumArray in the documentation!) In ghci:
> countSequence ([1..5] ++ [1,3..5] ++ [1,4..5] ++ [1,5]) 3
array (0,3) [(0,0),(1,4),(2,1),(3,2)]

I guess using Data.IntMap would be as efficient as it gets for this job. One foldr pass is done to establish the IntMap (cm) and a map to construct a new list holding the counts of elements at corresponding positions.
import qualified Data.IntMap.Lazy as IM
countSequence :: [Int] -> [Int]
countSequence xs = map (\x -> let cm = foldr (\x m -> IM.alter (\mx -> if mx == Nothing then Just 1 else fmap (+1) mx) x m) IM.empty xs
in IM.findWithDefault 0 x cm) xs
*Main> countSequence [1,2,5,1,3,7,8,5,6,4,1,2,3,7,9,3,4,8]
[3,2,2,3,3,2,2,2,1,2,3,2,3,2,1,3,2,2]
*Main> countSequence [4,5,4]
[2,1,2]
*Main> *Main> countSequence [9,8,7,6,5]
[1,1,1,1,1]

Optimize a list function that creates too much garbage (not stack overflow)

I have that Haskell function, that's causing more than 50% of all the allocations of my program, causing 60% of my run time to be taken by the GC. I run with a small stack (-K10K) so there is no stack overflow, but can I make this function faster, with less allocation?
The goal here is to calculate the product of a matrix by a vector. I cannot use hmatrix for example because this is part of a bigger function using the ad Automatic Differentiation package, so I need to use lists of Num. At runtime I suppose the use of the Numeric.AD module means my types must be Scalar Double.
listMProd :: (Num a) => [a] -> [a] -> [a]
listMProd mdt vdt = go mdt vdt 0
where
go [] _ s = [s]
go ls [] s = s : go ls vdt 0
go (y:ys) (x:xs) ix = go ys xs (y*x+ix)
Basically we loop through the matrix, multiplying and adding an accumulator until we reach the end of the vector, storing the result, then continuing restarting the vector again. I have a quickcheck test verifying that I get the same result than the matrix/vector product in hmatrix.
I have tried with foldl, foldr, etc. Nothing I've tried makes the function faster (and some things like foldr cause a space leak).
Running with profiling tells me, on top of the fact that this function is where most of the time and allocation is spent, that there are loads of Cells being created, Cells being a data type from the ad package.
A simple test to run:
import Numeric.AD
main = do
let m :: [Double] = replicate 400 0.2
v :: [Double] = replicate 4 0.1
mycost v m = sum $ listMProd m v
mygrads = gradientDescent (mycost (map auto v)) (map auto m)
print $ mygrads !! 1000
This on my machine tells me GC is busy 47% of the time.
Any ideas?

A very simple optimization is to make the go function strict by its accumulator parameter, because it's small, can be unboxed if a is primitive and always needs to be fully evaluated:
{-# LANGUAGE BangPatterns #-}
listMProd :: (Num a) => [a] -> [a] -> [a]
listMProd mdt vdt = go mdt vdt 0
where
go [] _ !s = [s]
go ls [] !s = s : go ls vdt 0
go (y:ys) (x:xs) !ix = go ys xs (y*x+ix)
On my machine, it gives 3-4x speedup (compiled with -O2).
On the other hand, intermediate lists shouldn't be strict so they could be fused.

Generalizing a combinatoric function?

I've been solving a few combinatoric problems on Haskell, so I wrote down those 2 functions:
permutations :: (Eq a) => [a] -> [[a]]
permutations [] = [[]]
permutations list = do
x <- list
xs <- permutations (filter (/= x) list)
return (x : xs)
combinations :: (Eq a, Ord a) => Int -> [a] -> [[a]]
combinations 0 _ = [[]]
combinations n list = do
x <- list
xs <- combinations (n-1) (filter (> x) list)
return (x : xs)
Which works as follows:
*Main> permutations [1,2,3]
[[1,2,3],[1,3,2],[2,1,3],[2,3,1],[3,1,2],[3,2,1]]
*Main> combinations 2 [1,2,3,4]
[[1,2],[1,3],[1,4],[2,3],[2,4],[3,4]]
Those were uncomfortably similar, so I had to abstract it. I wrote the following abstraction:
combinatoric next [] = [[]]
combinatoric next list = do
x <- list
xs <- combinatoric next (next x list)
return (x : xs)
Which receives a function that controls how to filter the elements of the list. It can be used to easily define permutations:
permutations :: (Eq a) => [a] -> [[a]]
permutations = combinatoric (\ x ls -> filter (/= x) ls)
But I couldn't define combinations this way since it carries an state (n). I could extend the combinatoric with an additional state argument, but that'd become too clunky and I remember such approach was not necessary in a somewhat similar situation. Thus, I wonder: is it possible to define combinations using combinatorics? If not, what is a better abstraction of combinatorics which successfully subsumes both functions?

This isn't a direct answer to your question (sorry), but I don't think your code is correct. The Eq and Ord constraints tipped me off - they shouldn't be necessary - so I wrote a couple of QuickCheck properties.
prop_numberOfPermutations xs = length (permutations xs) === factorial (length xs)
where _ = (xs :: [Int]) -- force xs to be instantiated to [Int]
prop_numberOfCombinations (Positive n) (NonEmpty xs) = n <= length xs ==>
length (combinations n xs) === choose (length xs) n
where _ = (xs :: [Int])
factorial :: Int -> Int
factorial x = foldr (*) 1 [1..x]
choose :: Int -> Int -> Int
choose n 0 = 1
choose 0 r = 0
choose n r = choose (n-1) (r-1) * n `div` r
The first property checks that the number of permutations of a list of length n is n!. The second checks that the number of r-combinations of a list of length n is C(n, r). Both of these properties fail when I run them against your definitions:
ghci> quickCheck prop_numberOfPermutations
*** Failed! Falsifiable (after 5 tests and 4 shrinks):
[0,0,0]
3 /= 6
ghci> quickCheck prop_numberOfCombinations
*** Failed! Falsifiable (after 4 tests and 1 shrink):
Positive {getPositive = 2}
NonEmpty {getNonEmpty = [3,3]}
0 /= 1
It looks like your functions fail when the input list contains duplicate elements. Writing an abstraction for an incorrect implementation isn't a good idea - don't try and run before you can walk! You might find it helpful to read the source code for the standard library's definition of permutations, which does not have an Eq constraint.

First let's improve the original functions. You assume that all elements are distinct wrt their equality for permutations, and that they're distinct and have an ordering for combinations. These constraints aren't necessary and as described in the other answer, the code can produce wrong results. Following the robustness principle, let's accept just unconstrained lists. For this we'll need a helper function that produces all possible splits of a list:
split :: [a] -> [([a], a, [a])]
split = loop []
where
loop _ [] = []
loop rs (x:xs) = (rs, x, xs) : loop (x:rs) xs
Note that the implementation causes prefixes returned by this function to be reversed, but it's nothing we require.
This allows us to write generic permutations and combinations.
permutations :: [a] -> [[a]]
permutations [] = [[]]
permutations list = do
(pre, x, post) <- split list
-- reversing 'pre' isn't really necessary, but makes the output
-- order natural
xs <- permutations (reverse pre ++ post)
return (x : xs)
combinations :: Int -> [a] -> [[a]]
combinations 0 _ = [[]]
combinations n list = do
(_, x, post) <- split list
xs <- combinations (n-1) post
return (x : xs)
Now what they have in common:
At each step they pick an element to output,
update the list of elements to pick from and
stop after some condition is met.
The last point is a bit problematic, as for permutations we end once the list to choose from is empty, while for combinations we have a counter. This is probably the reason why it was difficult to generalize. We can work around this by realizing that for permutations the number of steps is equal to the length of the input list, so we can express the condition in the number of repetitions.
For such problems it's often very convenient to express them using StateT s [] monad, where s is the state we're working with. In our case it'll be the list of elements to choose from. The core of our combinatorial functions can be then expressed with StateT [a] [] a: pick an element from the state and update the state for the next step. Since the stateful computations all happen in the [] monad, we automatically branch all possibilities. With that, we can define a generic function:
import Control.Monad.State
combinatoric :: Int -> StateT [a] [] b -> [a] -> [[b]]
combinatoric n k = evalStateT $ replicateM n k
And then define permutations and combinations by specifying the appropriate number of repetitions and what's the core StateT [a] [] a function:
permutations' :: [a] -> [[a]]
permutations' xs = combinatoric (length xs) f xs
where
f = StateT $ map (\(pre, x, post) -> (x, reverse pre ++ post)) . split
combinations' :: Int -> [a] -> [[a]]
combinations' n xs = combinatoric n f xs
where
f = StateT $ map (\(_, x, post) -> (x, post)) . split

sorting integers fast in haskell

Is there any function in haskell libraries that sorts integers in O(n) time?? [By, O(n) I mean faster than comparison sort and specific for integers]
Basically I find that the following code takes a lot of time with the sort (as compared to summing the list without sorting) :
import System.Random
import Control.DeepSeq
import Data.List (sort)
genlist gen = id $!! sort $!! take (2^22) ((randoms gen)::[Int])
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
Summing a list doesn't require deepseq but what I am trying for does, but the above code is good enough for the pointers I am seeking.
Time : 6 seconds (without sort); about 35 seconds (with sort)
Memory : about 80 MB (without sort); about 310 MB (with sort)
Note 1 : memory is a bigger issue than time for me here as for the task at hand I am getting out of memory errors (memory usage becomes 3GB! after 30 minutes of run-time)
I am assuming faster algorithms will provide bettor memory print too, hence looking for O(n) time.
Note 2 : I am looking for fast algorithms for Int64, though fast algorithms for other specific types will also be helpful.
Solution Used : IntroSort with unboxed vectors was good enough for my task:
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
sort :: [Int] -> [Int]
sort = V.toList . V.modify I.sort . V.fromList

I would consider using vectors instead of lists for this, as lists have a lot of overhead per-element while an unboxed vector is essentially just a contiguous block of bytes. The vector-algorithms package contains various sorting algorithms you can use for this, including radix sort, which I expect should do well in your case.
Here's a simple example, though it might be a good idea to keep the result in vector form if you plan on doing further processing on it.
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Radix as R
sort :: [Int] -> [Int]
sort = V.toList . V.modify R.sort . V.fromList
Also, I suspect that a significant portion of the run time of your example is coming from the random number generator, as the standard one isn't exactly known for its performance. You should make sure that you're timing only the sorting part, and if you need a lot of random numbers in your program, there are faster generators available on Hackage.

The idea to sort the numbers using an array is the right one for reducing the memory usage.
However, using the maximum and minimum of the list as bounds may cause exceeding memory usage or even a runtime failure when maximum xs - minimum xs > (maxBound :: Int).
So I suggest writing the list contents to an unboxed mutable array, sorting that inplace (e.g. with quicksort), and then building a list from that again.
import System.Random
import Control.DeepSeq
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import Control.Monad.ST
myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi
| lo < hi = do
let lscan p h i
| i < h = do
v <- unsafeRead a i
if p < v then return i else lscan p h (i+1)
| otherwise = return i
rscan p l i
| l < i = do
v <- unsafeRead a i
if v < p then return i else rscan p l (i-1)
| otherwise = return i
swap i j = do
v <- unsafeRead a i
unsafeRead a j >>= unsafeWrite a i
unsafeWrite a j v
sloop p l h
| l < h = do
l1 <- lscan p h l
h1 <- rscan p l1 h
if (l1 < h1) then (swap l1 h1 >> sloop p l1 h1) else return l1
| otherwise = return l
piv <- unsafeRead a hi
i <- sloop piv lo hi
swap i hi
myqsort a lo (i-1)
myqsort a (i+1) hi
| otherwise = return ()
genlist gen = runST $ do
arr <- newListArray (0,2^22-1) $ take (2^22) (randoms gen)
myqsort arr 0 (2^22-1)
let collect acc 0 = do
v <- unsafeRead arr 0
return (v:acc)
collect acc i = do
v <- unsafeRead arr i
collect (v:acc) (i-1)
collect [] (2^22-1)
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
is reasonably fast and uses less memory. It still uses a lot of memory for the list, 222 Ints take 32MB storage raw (with 64-bit Ints), with the list overhead of iirc five words per element, that adds up to ~200MB, but less than half of the original.

This is taken from Richard Bird's book, Pearls of Functional Algorithm Design, (though I had to edit it a little, as the code in the book didn't compile exactly as written).
import Data.Array(Array,accumArray,assocs)
sort :: [Int] -> [Int]
sort xs = concat [replicate k x | (x,k) <- assocs count]
where count :: Array Int Int
count = accumArray (+) 0 range (zip xs (repeat 1))
range = (0, maximum xs)
It works by creating an Array indexed by integers where the values are the number of times each integer occurs in the list. Then it creates a list of the indexes, repeating them the same number of times they occurred in the original list according to the counts.
You should note that it is linear with the maximum value in the list, not the length of the list, so a list like [ 2^x | x <- [0..n] ] would not be sorted linearly.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Is runInBoundThread the best tool for parallelism? - performance

Related

Why is `filterM + mapM_` so much slower than `mapM_ + when`, with large lists?

Finding a "Count Sequence"

Optimize a list function that creates too much garbage (not stack overflow)

Generalizing a combinatoric function?

sorting integers fast in haskell

Categories

Resources