Batching actions for caching and performance while avoiding the dirty work - caching

Say I have two pure but unsafe functions, that do the same, but one of them is working on batches, and is asymptotically faster:
f :: Int -> Result -- takes O(1) time
f = unsafePerformIO ...
g :: [Int] -> [Result] -- takes O(log n) time
g = unsafePerformIO ...
A naive implementation:
getUntil :: Int -> [Result]
getUntil 0 = f 0
getUntil n = f n : getUntil n-1
switch is the n value where g gets cheaper than f.
getUntil will in practice be called with ever increasing n, but it might not start at 0. So since the Haskell runtime can memoize getUntil, performance will be optimal if getUntil is called with an interval lower than switch. But once the interval gets larger, this implementation is slow.
In an imperative program, I guess I would make a TreeMap (which could quickly be checked for gaps) for caching all calls. On cache misses, it would get filled with the results of g, if the gap was greater than switch in length, and f otherwise, respectively.
How can this be optimized in Haskell?
I think I am just looking for:
an ordered map filled on-demand using a fill function that would fill all values up to the requested index using one function if the missing range is small, another if it is large
a get operation on the map which returns a list of all lower values up to the requested index. This would result in a function similar to getUntil above.

I'll elaborate in my proposal for using map, after some tests I just ran.
import System.IO
import System.IO.Unsafe
import Control.Concurrent
import Control.Monad
switch :: Int
switch = 1000
f :: Int -> Int
f x = unsafePerformIO $ do
threadDelay $ 500 * x
putStrLn $ "Calculated from scratch: f(" ++ show x ++ ")"
return $ 500*x
g :: Int -> Int
g x = unsafePerformIO $ do
threadDelay $ x*x `div` 2
putStrLn $ "Calculated from scratch: g(" ++ show x ++ ")"
return $ x*x `div` 2
cachedFG :: [Int]
cachedFG = map g [0 .. switch] ++ map f [switch+1 ..]
main :: IO ()
main = forever $ getLine >>= print . (cachedFG !!) . read
… where f, g and switch have the same meaning indicated in the question.
The above program can be compiled as is using GHC. When executed, positive integers can be entered, followed by a newline, and the application will print some value based on the number entered by the user plus some extra indication on what values are being calculated from scratch.
A short session with this program is:
User: 10000
Program: Calculated from scratch: f(10000)
Program: 5000000
User: 10001
Program: Calculated from scratch: f(10001)
Program: 5000500
User: 10000
Program: 5000000
^C
The program has to be killed/terminated manually.
Notice that the last value entered doesn't show a "calculated from scratch" message. This indicates that the program has the value cached/memoized somewhere. You can try executing this program yourself; but have into account that threadDelay's lag is proportional to the value entered.
The getUntil function then could be implemented using:
getUntil :: Int -> [Int]
getUntil n = take n cachedFG
or:
getUntil :: Int -> [Int]
getUntil = flip take cachedFG
If you don't know the value for switch, you can try evaluating f and g in parallel and use the fastest result, but that's another show.

Related

Haskell explicit recursion vs `iterate`

While writing a function using iterate in Haskell, I found that an equivalent version with explicit recursion seemed noticeably faster - even though I believed that explicit recursion ought to be frowned upon in Haskell.
Similarly, I expected GHC to be able to inline/optimise list combinators appropriately so that the resulting machine code is at least similarly performing to the explicit recursion.
Here's a (different) example, which also displays the slowdown I observed.
steps m n and its variant steps' compute the number of Collatz steps n takes to reach 1, giving up after m attempts.
steps uses explicit recursion while steps' uses list functions.
import Data.List (elemIndex)
import Control.Exception (evaluate)
import Control.DeepSeq (rnf)
collatz :: Int -> Int
collatz n
| even n = n `quot` 2
| otherwise = 3 * n + 1
steps :: Int -> Int -> Maybe Int
steps m = go 0
where go k n
| n == 1 = Just k
| k == m = Nothing
| otherwise = go (k+1) (collatz n)
steps' :: Int -> Int -> Maybe Int
steps' m = elemIndex 1 . take m . iterate collatz
main :: IO ()
main = evaluate $ rnf $ map (steps 800) $ [1..10^7]
I tested these by evaluating for all values up to 10^7, each giving up after 800 steps. On my machine (compiled with ghc -O2), explicit recursion took just under 4 seconds (3.899s) but list combinators took about 5 times longer (19.922s).
Why is explicit recursion so much better in this case, and is there a way of writing this without explicit recursion while preserving performance?
Updated: I submitted Trac 15426 for this bug.
The problem disappears if you copy the definitions of elemIndex and findIndex into your module:
import Control.Exception (evaluate)
import Control.DeepSeq (rnf)
import Data.Maybe (listToMaybe)
import Data.List (findIndices)
elemIndex :: Eq a => a -> [a] -> Maybe Int
elemIndex x = findIndex (x==)
findIndex :: (a -> Bool) -> [a] -> Maybe Int
findIndex p = listToMaybe . findIndices p
collatz :: Int -> Int
collatz n
| even n = n `quot` 2
| otherwise = 3 * n + 1
steps' :: Int -> Int -> Maybe Int
steps' m = elemIndex 1 . take m . iterate collatz
main :: IO ()
main = evaluate $ rnf $ map (steps' 800) $ [1..10^7]
The problem seems to be that these must be inlinable for GHC to get the fusion right. Unfortunately, neither of them is marked inlinable in Data.OldList.
The change to allow findIndex to participate in fusion is relatively recent (see Trac 14387) where listToMaybe was reimplemented as a foldr. So, it probably hasn't seen a lot of testing yet.

Efficient summation in OCaml

Please note I am almost a complete newbie in OCaml. In order to learn a bit, and test its performance, I tried to implement a module that approximates Pi using the Leibniz series.
My first attempt led to a stack overflow (the actual error, not this site). Knowing from Haskell that this may come from too many "thunks", or promises to compute something, while recursing over the addends, I looked for some way of keeping just the last result while summing with the next. I found the following tail-recursive implementations of sum and map in the notes of an OCaml course, here and here, and expected the compiler to produce an efficient result.
However, the resulting executable, compiled with ocamlopt, is much slower than a C++ version compiled with clang++. Is this code as efficient as possible? Is there some optimization flag I am missing?
My complete code is:
let (--) i j =
let rec aux n acc =
if n < i then acc else aux (n-1) (n :: acc)
in aux j [];;
let sum_list_tr l =
let rec helper a l = match l with
| [] -> a
| h :: t -> helper (a +. h) t
in helper 0. l
let rec tailmap f l a = match l with
| [] -> a
| h :: t -> tailmap f t (f h :: a);;
let rev l =
let rec helper l a = match l with
| [] -> a
| h :: t -> helper t (h :: a)
in helper l [];;
let efficient_map f l = rev (tailmap f l []);;
let summand n =
let m = float_of_int n
in (-1.) ** m /. (2. *. m +. 1.);;
let pi_approx n =
4. *. sum_list_tr (efficient_map summand (0 -- n));;
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (pi_approx n);;
Just for reference, here are the measured times on my machine:
❯❯❯ time ocaml/main 10000000
3.14159275359
ocaml/main 10000000 3,33s user 0,30s system 99% cpu 3,625 total
❯❯❯ time cpp/main 10000000
3.14159
cpp/main 10000000 0,17s user 0,00s system 99% cpu 0,174 total
For completeness, let me state that the first helper function, an equivalent to Python's range, comes from this SO thread, and that this is run using OCaml version 4.01.0, installed via MacPorts on a Darwin 13.1.0.
As I noted in a comment, OCaml's float are boxed, which puts OCaml to a disadvantage compared to Clang.
However, I may be noticing another typical rough edge trying OCaml after Haskell:
if I see what your program is doing, you are creating a list of stuff, to then map a function on that list and finally fold it into a result.
In Haskell, you could more or less expect such a program to be automatically “deforested” at compile-time, so that the resulting generated code was an efficient implementation of the task at hand.
In OCaml, the fact that functions can have side-effects, and in particular functions passed to high-order functions such as map and fold, means that it would be much harder for the compiler to deforest automatically. The programmer has to do it by hand.
In other words: stop building huge short-lived data structures such as 0 -- n and (efficient_map summand (0 -- n)). When your program decides to tackle a new summand, make it do all it wants to do with that summand in a single pass. You can see this as an exercise in applying the principles in Wadler's article (again, by hand, because for various reasons the compiler will not do it for you despite your program being pure).
Here are some results:
$ ocamlopt v2.ml
$ time ./a.out 1000000
3.14159165359
real 0m0.020s
user 0m0.013s
sys 0m0.003s
$ ocamlopt v1.ml
$ time ./a.out 1000000
3.14159365359
real 0m0.238s
user 0m0.204s
sys 0m0.029s
v1.ml is your version. v2.ml is what you might consider an idiomatic OCaml version:
let rec q_pi_approx p n acc =
if n = p
then acc
else q_pi_approx (succ p) n (acc +. (summand p))
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (4. *. (q_pi_approx 0 n 0.));;
(reusing summand from your code)
It might be more accurate to sum from the last terms to the first, instead of from the first to the last. This is orthogonal to your question, but you may consider it as an exercise in modifying a function that has been forcefully made tail-recursive. Besides, the (-1.) ** m expression in summand is mapped by the compiler to a call to the pow() function on the host, and that's a bag of hurt you may want to avoid.
I've also tried several variants, here are my conclusions:
Using arrays
Using recursion
Using imperative loop
Recursive function is about 30% more effective than array implementation. Imperative loop is approximately as much effective as a recursion (maybe even little slower).
Here're my implementations:
Array:
open Core.Std
let pi_approx n =
let f m = (-1.) ** m /. (2. *. m +. 1.) in
let qpi = Array.init n ~f:Float.of_int |>
Array.map ~f |>
Array.reduce_exn ~f:(+.) in
qpi *. 4.0
Recursion:
let pi_approx n =
let rec loop n acc m =
if m = n
then acc *. 4.0
else
let acc = acc +. (-1.) ** m /. (2. *. m +. 1.) in
loop n acc (m +. 1.0) in
let n = float_of_int n in
loop n 0.0 0.0
This can be further optimized, by moving local function loop outside, so that compiler can inline it.
Imperative loop:
let pi_approx n =
let sum = ref 0. in
for m = 0 to n -1 do
let m = float_of_int m in
sum := !sum +. (-1.) ** m /. (2. *. m +. 1.)
done;
4.0 *. !sum
But, in the code above creating a ref to the sum will incur boxing/unboxing on each step, that we can further optimize this code by using float_ref trick:
type float_ref = { mutable value : float}
let pi_approx n =
let sum = {value = 0.} in
for m = 0 to n - 1 do
let m = float_of_int m in
sum.value <- sum.value +. (-1.) ** m /. (2. *. m +. 1.)
done;
4.0 *. sum.value
Scoreboard
for-loop (with float_ref) : 1.0
non-local recursion : 0.89
local recursion : 0.86
Pascal's version : 0.77
for-loop (with float ref) : 0.62
array : 0.47
original : 0.08
Update
I've updated the answer, as I've found a way to give 40% speedup (or 33% in comparison with #Pascal's answer.
I would like to add that although floats are boxed in OCaml, float arrays are unboxed. Here is a program that builds a float array corresponding to the Leibnitz sequence and uses it to approximate π:
open Array
let q_pi_approx n =
let summand n =
let m = float_of_int n
in (-1.) ** m /. (2. *. m +. 1.) in
let a = Array.init n summand in
Array.fold_left (+.) 0. a
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (4. *. (q_pi_approx n));;
Obviously, it is still slower than a code that doesn't build any data structure at all. Execution times (the version with array is the last one):
time ./v1 10000000
3.14159275359
real 0m2.479s
user 0m2.380s
sys 0m0.104s
time ./v2 10000000
3.14159255359
real 0m0.402s
user 0m0.400s
sys 0m0.000s
time ./a 10000000
3.14159255359
real 0m0.453s
user 0m0.432s
sys 0m0.020s

sorting integers fast in haskell

Is there any function in haskell libraries that sorts integers in O(n) time?? [By, O(n) I mean faster than comparison sort and specific for integers]
Basically I find that the following code takes a lot of time with the sort (as compared to summing the list without sorting) :
import System.Random
import Control.DeepSeq
import Data.List (sort)
genlist gen = id $!! sort $!! take (2^22) ((randoms gen)::[Int])
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
Summing a list doesn't require deepseq but what I am trying for does, but the above code is good enough for the pointers I am seeking.
Time : 6 seconds (without sort); about 35 seconds (with sort)
Memory : about 80 MB (without sort); about 310 MB (with sort)
Note 1 : memory is a bigger issue than time for me here as for the task at hand I am getting out of memory errors (memory usage becomes 3GB! after 30 minutes of run-time)
I am assuming faster algorithms will provide bettor memory print too, hence looking for O(n) time.
Note 2 : I am looking for fast algorithms for Int64, though fast algorithms for other specific types will also be helpful.
Solution Used : IntroSort with unboxed vectors was good enough for my task:
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
sort :: [Int] -> [Int]
sort = V.toList . V.modify I.sort . V.fromList
I would consider using vectors instead of lists for this, as lists have a lot of overhead per-element while an unboxed vector is essentially just a contiguous block of bytes. The vector-algorithms package contains various sorting algorithms you can use for this, including radix sort, which I expect should do well in your case.
Here's a simple example, though it might be a good idea to keep the result in vector form if you plan on doing further processing on it.
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Radix as R
sort :: [Int] -> [Int]
sort = V.toList . V.modify R.sort . V.fromList
Also, I suspect that a significant portion of the run time of your example is coming from the random number generator, as the standard one isn't exactly known for its performance. You should make sure that you're timing only the sorting part, and if you need a lot of random numbers in your program, there are faster generators available on Hackage.
The idea to sort the numbers using an array is the right one for reducing the memory usage.
However, using the maximum and minimum of the list as bounds may cause exceeding memory usage or even a runtime failure when maximum xs - minimum xs > (maxBound :: Int).
So I suggest writing the list contents to an unboxed mutable array, sorting that inplace (e.g. with quicksort), and then building a list from that again.
import System.Random
import Control.DeepSeq
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import Control.Monad.ST
myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi
| lo < hi = do
let lscan p h i
| i < h = do
v <- unsafeRead a i
if p < v then return i else lscan p h (i+1)
| otherwise = return i
rscan p l i
| l < i = do
v <- unsafeRead a i
if v < p then return i else rscan p l (i-1)
| otherwise = return i
swap i j = do
v <- unsafeRead a i
unsafeRead a j >>= unsafeWrite a i
unsafeWrite a j v
sloop p l h
| l < h = do
l1 <- lscan p h l
h1 <- rscan p l1 h
if (l1 < h1) then (swap l1 h1 >> sloop p l1 h1) else return l1
| otherwise = return l
piv <- unsafeRead a hi
i <- sloop piv lo hi
swap i hi
myqsort a lo (i-1)
myqsort a (i+1) hi
| otherwise = return ()
genlist gen = runST $ do
arr <- newListArray (0,2^22-1) $ take (2^22) (randoms gen)
myqsort arr 0 (2^22-1)
let collect acc 0 = do
v <- unsafeRead arr 0
return (v:acc)
collect acc i = do
v <- unsafeRead arr i
collect (v:acc) (i-1)
collect [] (2^22-1)
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
is reasonably fast and uses less memory. It still uses a lot of memory for the list, 222 Ints take 32MB storage raw (with 64-bit Ints), with the list overhead of iirc five words per element, that adds up to ~200MB, but less than half of the original.
This is taken from Richard Bird's book, Pearls of Functional Algorithm Design, (though I had to edit it a little, as the code in the book didn't compile exactly as written).
import Data.Array(Array,accumArray,assocs)
sort :: [Int] -> [Int]
sort xs = concat [replicate k x | (x,k) <- assocs count]
where count :: Array Int Int
count = accumArray (+) 0 range (zip xs (repeat 1))
range = (0, maximum xs)
It works by creating an Array indexed by integers where the values are the number of times each integer occurs in the list. Then it creates a list of the indexes, repeating them the same number of times they occurred in the original list according to the counts.
You should note that it is linear with the maximum value in the list, not the length of the list, so a list like [ 2^x | x <- [0..n] ] would not be sorted linearly.

Reduce allocation sorting large list (or vector)

I am trying to reduce GC time in my program. The main suspect is the following piece of code:
Data.Vector.Unboxed.fromList . take n . List.sortBy (flip $ Ord.comparing id)
$ [ ( sum [ (c + a) * wsum z | (z,c) <- IntMap.toList zt_d ] , d)
| d <- IntMap.keys $ m
, let zt_d = IntMap.findWithDefault IntMap.empty d $ m ]
The list being sorted would typically contain several thousand elements. I think the list sort is the culprit, because if I replace take n . List.sortBy (flip $ Ord.comparing id) with return . List.maximum my productivity goes from 60% to 95%.
Is there anything I can do to reduce allocation here?
Update
As recommended, I replaced the List.sort by an inplace sort from vector-algorithms.
Perhaps I'm doing it wrong, but what I'm seeing is that there is no allocation (productivity 97% as opposed to 63% with lists), but the program is many times slower: it runs in 85 seconds with List.sortBy; with inplace sort I killed it after
waiting 7 minutes. I tried both Intro and Merge sorts. Here is my code:
import qualified Data.Vector.Generic.Mutable as GM
import qualified Data.Vector.Generic as G
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Algorithms.Merge as Sort
import qualified Data.Vector.Fusion.Stream as Stream
import Control.Monad.ST
sortBy :: (Ord a, U.Unbox a) => (a -> a -> Ordering) -> [a] -> U.Vector a
sortBy cmp xs = runST $ do
mv <- GM.unstream . Stream.fromList $ xs
Sort.sortBy cmp mv
G.unsafeFreeze mv
The sorting does indeed look like it will cause a lot of allocation. While the sorting is performed on a list, that cannot be completely changed, since sorting lists causes the construction of many intermediate lists. If necessary, you could try to do the sorting on an MVector using for example the vector-algorithms package which provides efficient sorting algorithms.
However, there are further inefficiencies that cause more allocation than necessary in
Data.Vector.Unboxed.fromList . take n . List.sortBy (flip $ Ord.comparing id)
$ [ ( sum [ (c + a) * wsum z | (z,c) <- IntMap.toList zt_d ] , d)
| d <- IntMap.keys $ m
, let zt_d = IntMap.findWithDefault IntMap.empty d $ m ]
When you write
d <- IntMap.keys m, let zt_d = IntMap.findWithDefault IntMap.empty d m
-- The '$' are unnecessary, I left them out
you are 1) traversing the entire map to collect the list of keys, and 2) then look up each key on its own. Since you only look up keys present in the map, you never use the default. Much more efficient is to create the list of key/value pairs in one traversal of the map:
(d,zt_d) <- IntMap.assocs m
Then if id in flip $ Ord.comparing id is indeed the identity function, that would be more readable (and possibly more efficient) as sortBy (flip compare).
Depending on the type of the summed elements (and possibly the optimisation level), it might be better to use Data.List.foldl' (+) 0 instead of sum.

Any way to create the unmemo-monad?

Suppose someone makes a program to play chess, or solve sudoku. In this kind of program it makes sense to have a tree structure representing game states.
This tree would be very large, "practically infinite". Which isn't by itself a problem as Haskell supports infinite data structures.
An familiar example of an infinite data structure:
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
Nodes are only allocated when first used, so the list takes finite memory. One may also iterate over an infinite list if they don't keep references to its head, allowing the garbage collector to collect its parts which are not needed anymore.
Back to the tree example - suppose one does some iteration over the tree, the tree nodes iterated over may not be freed if the root of the tree is still needed (for example in an iterative deepening search, the tree would be iterated over several times and so the root needs to be kept).
One possible solution for this problem that I thought of is using an "unmemo-monad".
I'll try to demonstrate what this monad is supposed to do using monadic lists:
import Control.Monad.ListT (ListT) -- cabal install List
import Data.Copointed -- cabal install pointed
import Data.List.Class
import Prelude hiding (enumFromTo)
nums :: ListT Unmemo Int -- What is Unmemo?
nums = enumFromTo 0 1000000
main = print $ div (copoint (foldlL (+) 0 nums)) (copoint (lengthL nums))
Using nums :: [Int], the program would take a lot of memory as a reference to nums is needed by lengthL nums while it is being iterated over foldlL (+) 0 nums.
The purpose of Unmemo is to make the runtime not keep the nodes iterated over.
I attempted using ((->) ()) as Unmemo, but it yields the same results as nums :: [Int] does - the program uses a lot of memory, as evident by running it with +RTS -s.
Is there anyway to implement Unmemo that does what I want?
Same trick as with a stream -- don't capture the remainder directly, but instead capture a value and a function which yields a remainder. You can add memoization on top of this as necessary.
data UTree a = Leaf a | Branch a (a -> [UTree a])
I'm not in the mood to figure it out precisely at the moment, but this structure arises, I'm sure, naturally as the cofree comonad over a fairly straightforward functor.
Edit
Found it: http://hackage.haskell.org/packages/archive/comonad-transformers/1.6.3/doc/html/Control-Comonad-Trans-Stream.html
Or this is perhaps simpler to understand: http://hackage.haskell.org/packages/archive/streams/0.7.2/doc/html/Data-Stream-Branching.html
In either case, the trick is that your f can be chosen to be something like data N s a = N (s -> (s,[a])) for an appropriate s (s being the type of your state parameter of the stream -- the seed of your unfold, if you will). That might not be exactly correct, but something close should do...
But of course for real work, you can scrap all this and just write the datatype directly as above.
Edit 2
The below code illustrates how this can prevent sharing. Note that even in the version without sharing, there are humps in the profile indicating that the sum and length calls aren't running in constant space. I'd imagine that we'd need an explicit strict accumulation to knock those down.
{-# LANGUAGE DeriveFunctor #-}
import Data.Stream.Branching(Stream(..))
import qualified Data.Stream.Branching as S
import Control.Arrow
import Control.Applicative
import Data.List
data UM s a = UM (s -> Maybe a) deriving Functor
type UStream s a = Stream (UM s) a
runUM s (UM f) = f s
liftUM x = UM $ const (Just x)
nullUM = UM $ const Nothing
buildUStream :: Int -> Int -> Stream (UM ()) Int
buildUStream start end = S.unfold (\x -> (x, go x)) start
where go x
| x < end = liftUM (x + 1)
| otherwise = nullUM
sumUS :: Stream (UM ()) Int -> Int
sumUS x = S.head $ S.scanr (\x us -> maybe 0 id (runUM () us) + x) x
lengthUS :: Stream (UM ()) Int -> Int
lengthUS x = S.head $ S.scanr (\x us -> maybe 0 id (runUM () us) + 1) x
sumUS' :: Stream (UM ()) Int -> Int
sumUS' x = last $ usToList $ liftUM $ S.scanl (+) 0 x
lengthUS' :: Stream (UM ()) Int -> Int
lengthUS' x = last $ usToList $ liftUM $ S.scanl (\acc _ -> acc + 1) 0 x
usToList x = unfoldr (\um -> (S.head &&& S.tail) <$> runUM () um) x
maxNum = 1000000
nums = buildUStream 0 maxNum
numsL :: [Int]
numsL = [0..maxNum]
-- All these need to be run with increased stack to avoid an overflow.
-- This generates an hp file with two humps (i.e. the list is not shared)
main = print $ div (fromIntegral $ sumUS' nums) (fromIntegral $ lengthUS' nums)
-- This generates an hp file as above, and uses somewhat less memory, at the cost of
-- an increased number of GCs. -H helps a lot with that.
-- main = print $ div (fromIntegral $ sumUS nums) (fromIntegral $ lengthUS nums)
-- This generates an hp file with one hump (i.e. the list is shared)
-- main = print $ div (fromIntegral $ sum $ numsL) (fromIntegral $ length $ numsL)

Resources