Haskell: unexpected time-complexity in the computation involving large lists - performance

I am dealing with the computation which has as an intermediate result a list A=[B], which is a list of K lists of the length L. The time-complexity to compute an element of B is controlled by the parameter M and is theoretically linear in M. Theoretically I would expect the time-complexity for the computation of A to be O(K*L*M). However, this is not the case and I don't understand why?
Here is the simple complete sketch program which exhibits the problem I have explained
import System.Random (randoms, mkStdGen)
import Control.Parallel.Strategies (parMap, rdeepseq)
import Control.DeepSeq (NFData)
import Data.List (transpose)
type Point = (Double, Double)
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b in b * (q - fromIntegral (floor q))
standardMap :: Double -> Point -> Point
standardMap k (q, p) = (fmod (q + p) (2 * pi), fmod (p + k * sin(q)) (2 * pi))
trajectory :: (Point -> Point) -> Point -> [Point]
trajectory map initial = initial : (trajectory map $ map initial)
justEvery :: Int -> [a] -> [a]
justEvery n (x:xs) = x : (justEvery n $ drop (n-1) xs)
justEvery _ [] = []
subTrace :: Int -> Int -> [a] -> [a]
subTrace n m = take (n + 1) . justEvery m
ensemble :: Int -> [Point]
ensemble n = let qs = randoms (mkStdGen 42)
ps = randoms (mkStdGen 21)
in take n $ zip qs ps
ensembleTrace :: NFData a => (Point -> [Point]) -> (Point -> a) ->
Int -> Int -> [Point] -> [[a]]
ensembleTrace orbitGen observable n m =
parMap rdeepseq ((map observable . subTrace n m) . orbitGen)
main = let k = 100
l = 100
m = 100
orbitGen = trajectory (standardMap 7)
observable (p, q) = p^2 - q^2
initials = ensemble k
mean xs = (sum xs) / (fromIntegral $ length xs)
result = (map mean)
$ transpose
$ ensembleTrace orbitGen observable l m
$ initials
in mapM_ print result
I am compiling with
$ ghc -O2 stdmap.hs -threaded
and runing with
$ ./stdmap +RTS -N4 > /dev/null
on the intel Q6600, Linux 3.6.3-1-ARCH, with GHC 7.6.1 and get the following results
for the different sets of the parameters K, L, M (k, l, m in the code of the program)
(K=200,L=200,N=200) -> real 0m0.774s
user 0m2.856s
sys 0m0.147s
(K=2000,L=200,M=200) -> real 0m7.409s
user 0m28.102s
sys 0m1.080s
(K=200,L=2000,M=200) -> real 0m7.326s
user 0m27.932s
sys 0m1.020s
(K=200,L=200,M=2000) -> real 0m10.581s
user 0m38.564s
sys 0m3.376s
(K=20000,L=200,M=200) -> real 4m22.156s
user 7m30.007s
sys 0m40.321s
(K=200,L=20000,M=200) -> real 1m16.222s
user 4m45.891s
sys 0m15.812s
(K=200,L=200,M=20000) -> real 8m15.060s
user 23m10.909s
sys 9m24.450s
I don't quite understand where the problem of such a pure scaling might be. If I understand correctly the lists are lazy and should not be constructed, since they are consumed in the head-tail direction? As could be observed from the measurements there is a correlation between the excessive real-time consumption and the excessive system-time consumption as the excess would be on the system account. But if there is some memory management wasting time, this should still scale linearly in K, L, M.
Help!
EDIT
I made changes in the code according to the suggestions given by Daniel Fisher, which indeed solved the bad scaling with respect to M. As pointed out, by forcing the strict evaluation in the trajectory, we avoid the construction of large thunks. I understand the performance improvement behind that, but I still don't understand the bad scaling of the original code, because (if I understand correctly) the space-time-complexity of the construction of the thunk should be linear in M?
Additionally, I still have problems understanding the bad scaling with respect to K (the size of the ensemble). I performed two additional measurements with the improved code for K=8000 and K=16000, keeping L=200, M=200. Scaling up to K=8000 is as expected but for K=16000 it is already abnormal. The problem seems to be in the number of overflowed SPARKS, which is 0 for K=8000 and 7802 for K=16000. This probably reflects in the bad concurrency which I quantify as a quotient Q = (MUT cpu time) / (MUT real time) which would be ideally equal to the number of CPU-s. However, Q ~ 4 for K = 8000 and Q ~ 2 for K = 16000.
Please help me understand the origin of this problem and the possible solutions.
K = 8000:
$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null
56,905,405,184 bytes allocated in the heap
503,501,680 bytes copied during GC
53,781,168 bytes maximum residency (15 sample(s))
6,289,112 bytes maximum slop
151 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 27893 colls, 27893 par 7.85s 1.99s 0.0001s 0.0089s
Gen 1 15 colls, 14 par 1.20s 0.30s 0.0202s 0.0558s
Parallel GC work balance: 23.49% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 8000 (8000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 95.90s ( 24.28s elapsed)
GC time 9.04s ( 2.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 104.95s ( 26.58s elapsed)
Alloc rate 593,366,811 bytes per MUT second
Productivity 91.4% of total user, 360.9% of total elapsed
gc_alloc_block_sync: 315819
and
K = 16000:
$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null
113,809,786,848 bytes allocated in the heap
1,156,991,152 bytes copied during GC
114,778,896 bytes maximum residency (18 sample(s))
11,124,592 bytes maximum slop
300 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 135521 colls, 135521 par 22.83s 6.59s 0.0000s 0.0190s
Gen 1 18 colls, 17 par 2.72s 0.73s 0.0405s 0.1692s
Parallel GC work balance: 18.05% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 16000 (8198 converted, 7802 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 221.77s (139.78s elapsed)
GC time 25.56s ( 7.32s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 247.34s (147.10s elapsed)
Alloc rate 513,176,874 bytes per MUT second
Productivity 89.7% of total user, 150.8% of total elapsed
gc_alloc_block_sync: 814824

M. A. D.'s point about fmod is a good one, but it is not necessary to call out to C, and we can do better staying in Haskell land (the ticket the linked thread was about is meanwhile fixed). The trouble in
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b in b * (q - fromIntegral (floor q))
is that type defaulting leads to floor :: Double -> Integer (and consequently fromIntegral :: Integer -> Double) being called. Now, Integer is a comparatively complicated type, with slow operations, and the conversion from Integer to Double is also relatively complicated. The original code (with parameters k = l = 200 and m = 5000) produced the stats
./nstdmap +RTS -s -N2 > /dev/null
60,601,075,392 bytes allocated in the heap
36,832,004,184 bytes copied during GC
2,435,272 bytes maximum residency (13741 sample(s))
887,768 bytes maximum slop
9 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46734 colls, 46734 par 41.66s 20.87s 0.0004s 0.0058s
Gen 1 13741 colls, 13740 par 23.18s 11.62s 0.0008s 0.0041s
Parallel GC work balance: 60.58% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 34.99s ( 17.60s elapsed)
GC time 64.85s ( 32.49s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 99.84s ( 50.08s elapsed)
Alloc rate 1,732,048,869 bytes per MUT second
Productivity 35.0% of total user, 69.9% of total elapsed
on my machine (-N2 because I have only two physical cores). Simply changing the code to use a type signature floor q :: Int brings that down to
./nstdmap +RTS -s -N2 > /dev/null
52,105,495,488 bytes allocated in the heap
29,957,007,208 bytes copied during GC
2,440,568 bytes maximum residency (10481 sample(s))
893,224 bytes maximum slop
8 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 36979 colls, 36979 par 32.96s 16.51s 0.0004s 0.0066s
Gen 1 10481 colls, 10480 par 16.65s 8.34s 0.0008s 0.0018s
Parallel GC work balance: 68.64% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.01s ( 0.01s elapsed)
MUT time 29.78s ( 14.94s elapsed)
GC time 49.61s ( 24.85s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 79.40s ( 39.80s elapsed)
Alloc rate 1,749,864,775 bytes per MUT second
Productivity 37.5% of total user, 74.8% of total elapsed
a reduction of about 20% in elapsed time, 13% in MUT time. Not bad. If we look at the code for floor that you get with optimisations, we can see why:
floorDoubleInt :: Double -> Int
floorDoubleInt (D# x) =
case double2Int# x of
n | x <## int2Double# n -> I# (n -# 1#)
| otherwise -> I# n
floorDoubleInteger :: Double -> Integer
floorDoubleInteger (D# x) =
case decodeDoubleInteger x of
(# m, e #)
| e <# 0# ->
case negateInt# e of
s | s ># 52# -> if m < 0 then (-1) else 0
| otherwise ->
case TO64 m of
n -> FROM64 (n `uncheckedIShiftRA64#` s)
| otherwise -> shiftLInteger m e
floor :: Double -> Int just uses the machine conversion, while floor :: Double -> Integer needs an expensive decodeDoubleInteger and more branches. But where floor is called here, we know that all involved Doubles are nonnegative, hence floor is the same as truncate, which maps directly to the machine conversion double2Int#, so let's try that instead of floor:
INIT time 0.00s ( 0.00s elapsed)
MUT time 29.29s ( 14.70s elapsed)
GC time 49.17s ( 24.62s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 78.45s ( 39.32s elapsed)
a really small reduction (to be expected, the fmod isn't really the bottleneck). For comparison, calling out to C:
INIT time 0.01s ( 0.01s elapsed)
MUT time 31.46s ( 15.78s elapsed)
GC time 54.05s ( 27.06s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 85.52s ( 42.85s elapsed)
is a bit slower (unsurprisingly, you can execute a number of primops in the time calling out to C takes).
But that's not where the big fish swim. The bad thing is that picking only every m-th element of the trajectories leads to large thunks that cause a lot of allocation and take long to evaluate when the time comes. So let's eliminate that leak and make the trajectories strict:
{-# LANGUAGE BangPatterns #-}
trajectory :: (Point -> Point) -> Point -> [Point]
trajectory map !initial#(!a,!b) = initial : (trajectory map $ map initial)
That reduces the allocations and GC time drastically, and as a consequence also the MUT time:
INIT time 0.00s ( 0.00s elapsed)
MUT time 21.83s ( 10.95s elapsed)
GC time 0.72s ( 0.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 22.55s ( 11.31s elapsed)
with the original fmod,
INIT time 0.00s ( 0.00s elapsed)
MUT time 18.26s ( 9.18s elapsed)
GC time 0.58s ( 0.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 18.84s ( 9.47s elapsed)
with floor q :: Int, and within measuring accuracy the same times for truncate q :: Int (the allocation figures are a bit lower for truncate).
The problem seems to be in the number of overflowed SPARKS, which is 0 for K=8000 and 7802 for K=16000. This probably reflects in the bad concurrency
Yes (though as far as I know the more correct term here would be parallelism instead of concurrency), there is a spark pool, and when that's full, any further sparks are not scheduled for being evaluated in whatever thread next has time when its turn comes, the computation is then done without parallelism, from the parent thread. In this case that means after an initial parallel phase, the computation falls back to sequential.
The size of the spark pool is apparently about 8K (2^13).
If you watch the CPU load via top, you will see that it drops from (close to 100%)*(number of cores) to a much lower value after a while (for me, it was ~100% with -N2 and ~130% with -N4).
The cure is to avoid sparking too much, and letting each spark do some more work. With the quick-and-dirty modification
ensembleTrace orbitGen observable n m =
withStrategy (parListChunk 25 rdeepseq) . map ((map observable . subTrace n m) . orbitGen)
I'm back to 200% with -N2 for practically the entire run and a good productivity,
INIT time 0.00s ( 0.00s elapsed)
MUT time 57.42s ( 29.02s elapsed)
GC time 5.34s ( 2.69s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 62.76s ( 31.71s elapsed)
Alloc rate 1,982,155,167 bytes per MUT second
Productivity 91.5% of total user, 181.1% of total elapsed
and with -N4 it's also fine (even a wee bit faster on the wall-clock - not much because all threads do basically the same, and I have only 2 physical cores),
INIT time 0.00s ( 0.00s elapsed)
MUT time 99.17s ( 26.31s elapsed)
GC time 16.18s ( 4.80s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 115.36s ( 31.12s elapsed)
Alloc rate 1,147,619,609 bytes per MUT second
Productivity 86.0% of total user, 318.7% of total elapsed
since now the spark pool doesn't overflow.
The proper fix is to make the size of the chunks a parameter that is computed from the number of trajectories and available cores so that the number of sparks doesn't exceed the pool size.

After doing some quick profiling I found that these are the serial offenders:
ghc --make -O2 MainNonOpt.hs -threaded -prof -auto-all -caf-all -fforce-recomp
./MainNonOpt +RTS -N4 -p > /dev/null
>>>
COST CENTRE MODULE %time %alloc
fmod Main 46.3 33.3
standardMap Main 28.5 0.0
trajectory Main 23.8 66.6
What's surprising about fmod is the large number of allocations it does considering it is mostly a numerical function. So the next step would be to annotate fmod to see where is the problem:
fmod :: Double -> Double -> Double
fmod a b | a < 0 = {-# SCC "negbranch" #-} b - fmod (abs a) b
| otherwise = {-# SCC "posbranch" #-} if a < b then a
else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - {-# SCC "floor" #-} fromIntegral (floor q))
This gives us:
ghc --make -O2 MainNonOpt.hs -threaded -prof -caf-all -fforce-recomp
./MainNonOpt +RTS -N4 -p > /dev/null
COST CENTRE MODULE %time %alloc
MAIN MAIN 61.5 70.0
posbranch Main 16.6 0.0
floor Main 14.9 30.0
expression Main 4.5 0.0
negbranch Main 1.9 0.0
So the bit with floor is the one which causes the issues. After looking around it turns out that the Prelude does not implement some Double RealFrac functions as best as it should(see here), probably causing some boxing/unboxing.
So by following the advice from the link I used a modified version of floor which also made the call to fromIntegral unnecessary:
floor' :: Double -> Double
floor' x = c_floor x
{-# INLINE floor' #-}
foreign import ccall unsafe "math.h floor"
c_floor :: Double -> Double
fmod :: Double -> Double -> Double
fmod a b | a < 0 = {-# SCC "negbranch" #-} b - fmod (abs a) b
| otherwise = {-# SCC "posbranch" #-} if a < b then a
else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - ({-# SCC "floor" #-} floor' q))
EDIT:
As Daniel Fisher Points out, there is no need to inline C code to improve the performance. An analogous Haskell function already exists. I'll leave the answer anyway, for further reference.
This does make a difference. On my machine, for k=l=200, M=5000 here are the number for the non-optimized and the optimized version:
Non optimized:
real 0m20.635s
user 1m17.321s
sys 0m4.980s
Optimized:
real 0m14.858s
user 0m55.271s
sys 0m3.815s
The trajectory function may have similar problems and you can use profiling like it was used above to pin-point the issue.
A great starting point for profiling in Haskell can be found in this chapter of Real World Haskell.

Related

Why do nested MaybeT's cause exponential allocation

I have a program.
import Control.Monad
import Control.Monad.Identity
import Control.Monad.Trans.Maybe
import System.Environment
tryR :: Monad m => ([a] -> MaybeT m [a]) -> ([a] -> m [a])
tryR f x = do
m <- runMaybeT (f x)
case m of
Just t -> return t
Nothing -> return x
check :: MonadPlus m => Int -> m Int
check x = if x `mod` 2 == 0 then return (x `div` 2) else mzero
foo :: MonadPlus m => [Int] -> m [Int]
foo [] = return []
foo (x:xs) = liftM2 (:) (check x) (tryR foo xs)
runFoo :: [Int] -> [Int]
runFoo x = runIdentity $ tryR foo x
main :: IO ()
main = do
[n_str] <- getArgs
let n = read n_str :: Int
print $ runFoo [2,4..n]
The main interesting thing about this program is that it can have many nested layers of MaybeT's. Here, doing so serves absolutely no purpose, but it did in the original program where I encountered this problem.
Care to take a guess of the time complexity of this program?
Okay, you cheated by reading the title of this question. Yes, it's exponential:
[jkoppel#dhcp-18-189-103-38:~/tmp]$ time ./ExpAlloc 50 (03-31 17:15)
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]
./ExpAlloc 50 8.10s user 0.06s system 99% cpu 8.169 total
[jkoppel#dhcp-18-189-103-38:~/tmp]$ time ./ExpAlloc 52 (03-31 17:15)
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]
./ExpAlloc 52 16.10s user 0.12s system 99% cpu 16.227 total
[jkoppel#dhcp-18-189-103-38:~/tmp]$ time ./ExpAlloc 54 (03-31 17:16)
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]
./ExpAlloc 54 32.32s user 0.23s system 99% cpu 32.561 total
Some further inspection shows the reason is because it allocates an exponential amount of memory, which naturally takes an exponential amount of time:
[jkoppel#dhcp-18-189-103-38:~/tmp]$ time ./ExpAlloc 40 +RTS -s (03-31 17:17)
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
939,634,520 bytes allocated in the heap
5,382,816 bytes copied during GC
75,808 bytes maximum residency (2 sample(s))
66,592 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1796 colls, 0 par 0.008s 0.009s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.243s ( 0.246s elapsed)
GC time 0.008s ( 0.009s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.252s ( 0.256s elapsed)
%GC time 3.2% (3.6% elapsed)
Alloc rate 3,869,930,149 bytes per MUT second
Productivity 96.8% of total user, 95.3% of total elapsed
./ExpAlloc 40 +RTS -s 0.25s user 0.00s system 98% cpu 0.260 total
[jkoppel#dhcp-18-189-103-38:~/tmp]$ time ./ExpAlloc 42 +RTS -s (03-31 17:17)
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]
1,879,159,424 bytes allocated in the heap
10,767,048 bytes copied during GC
95,504 bytes maximum residency (3 sample(s))
71,152 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 3593 colls, 0 par 0.016s 0.018s 0.0000s 0.0000s
Gen 1 3 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.493s ( 0.498s elapsed)
GC time 0.016s ( 0.018s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.510s ( 0.517s elapsed)
%GC time 3.1% (3.5% elapsed)
Alloc rate 3,810,430,292 bytes per MUT second
Productivity 96.8% of total user, 95.7% of total elapsed
./ExpAlloc 42 +RTS -s 0.51s user 0.01s system 99% cpu 0.521 total
[jkoppel#dhcp-18-189-103-38:~/tmp]$ time ./ExpAlloc 44 +RTS -s (03-31 17:17)
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]
3,758,208,408 bytes allocated in the heap
21,499,312 bytes copied during GC
102,056 bytes maximum residency (5 sample(s))
73,784 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7186 colls, 0 par 0.032s 0.037s 0.0000s 0.0009s
Gen 1 5 colls, 0 par 0.000s 0.001s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.979s ( 0.987s elapsed)
GC time 0.033s ( 0.038s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.013s ( 1.024s elapsed)
%GC time 3.2% (3.7% elapsed)
Alloc rate 3,840,757,815 bytes per MUT second
Productivity 96.7% of total user, 95.6% of total elapsed
./ExpAlloc 44 +RTS -s 1.01s user 0.01s system 99% cpu 1.029 total
I cannot for the life of me figure out why it does this. I'd appreciate any light people could shed on the situation.
The transformers package (currently at version 0.5.4.0) implements MonadTrans using liftM:
lift :: Monad m => m a -> MaybeT m a
lift = MaybeT . liftM Just
where liftM is a combinator defined as
liftM :: Monad m => (a -> b) -> m a -> m b
liftM f m = m >>= \a -> return (f a)
Furthermore, return is defined for MaybeT as
return :: Monad m => a -> MaybeT m a
return a = lift . return
Reduce the definition:
return :: Monad m => a -> MaybeT m a
return a = MaybeT (return a >>= \a -> return (Just a))
where the two inner return are instantiated at type m.
One call to return #(MaybeT m) calls return #m twice, hence the exponential behavior you observe for a tower of MaybeT.
This is fixable by using fmap instead of liftM, but this is backwards incompatible, when Functor was not a superclass of Monad.
EDIT: Other transformers do not have this issue, as return is not defined using lift, which provides an even better fix.
return = MaybeT . return . Just
Here is a more minimal test case:
{-# LANGUAGE RankNTypes, ScopedTypeVariables #-}
import Control.Monad.Trans.Maybe
import System.Environment
f :: forall m proxy. Monad m => proxy m -> Int -> ()
f _ 0 = (return () :: m ()) `seq` ()
f _ n = f (undefined :: proxy (MaybeT m)) (n - 1)
main = do
n : _ <- getArgs
f (undefined :: proxy []) (read n) `seq` return ()
Output
> for i in {18..21} ; time ./b $i
./b $i 0.35s user 0.04s system 99% cpu 0.390 total
./b $i 0.71s user 0.07s system 99% cpu 0.782 total
./b $i 1.38s user 0.18s system 99% cpu 1.565 total
./b $i 2.82s user 0.32s system 100% cpu 3.139 total

Using list generator for memory efficient code in Haskell

I would like to get a handle on writing memory efficient haskell code. One thing I ran across is that there is no dead easy way to make python style list generators/iterators (that I could find).
Small example:
Finding the sum of the integers from 1 to 100000000 without using the closed form formula.
Python that can be done quickly with minimal use of memory as sum(xrange(100000000). In Haskell the analogue would be sum [1..100000000]. However this uses up a lot of memory. I thought using foldl or foldr would be fine but even that uses a lot of memory and is slower than python. Any suggestions?
TL;DR - I think the culprit in this case is - defaulting of GHC to Integer.
Admittedly I do not know enough about python, but my first guess would be that python switches to "bigint" only if necessary - therefore all calculations are done with Int a.k.a. 64-bit integer on my machine.
A first check with
$> ghci
GHCi, version 7.10.3: http://www.haskell.org/ghc/ :? for help
Prelude> maxBound :: Int
9223372036854775807
reveals that the result of the sum (5000000050000000) is less than that number so we have no fear of Int overflow.
I have guessed your example programs to look roughly like this
sum.py
print(sum(xrange(100000000)))
sum.hs
main :: IO ()
main = print $ sum [1..100000000]
To make things explicit I added the type annotation (100000000 :: Integer), compiling it with
$ > stack build --ghc-options="-O2 -with-rtsopts=-sstderr"
and ran your example,
$ > stack exec -- time sum
5000000050000000
3,200,051,872 bytes allocated in the heap
208,896 bytes copied during GC
44,312 bytes maximum residency (2 sample(s))
21,224 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6102 colls, 0 par 0.013s 0.012s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.725s ( 1.724s elapsed)
GC time 0.013s ( 0.012s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.739s ( 1.736s elapsed)
%GC time 0.7% (0.7% elapsed)
Alloc rate 1,855,603,449 bytes per MUT second
Productivity 99.3% of total user, 99.4% of total elapsed
1.72user 0.00system 0:01.73elapsed 99%CPU (0avgtext+0avgdata 4112maxresident)k
and indeed the ~3GB of memory consumption is reproduced.
Changing the annotation to (100000000 :: Int) - altered the behaviour drastically
$ > stack build
$ > stack exec -- time sum
5000000050000000
51,872 bytes allocated in the heap
3,408 bytes copied during GC
44,312 bytes maximum residency (1 sample(s))
17,128 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.034s ( 0.034s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.036s ( 0.035s elapsed)
%GC time 0.2% (0.2% elapsed)
Alloc rate 1,514,680 bytes per MUT second
Productivity 99.4% of total user, 102.3% of total elapsed
0.03user 0.00system 0:00.03elapsed 91%CPU (0avgtext+0avgdata 3496maxresident)k
0inputs+0outputs (0major+176minor)pagefaults 0swaps
For the interested
The behaviour of the haskell version does not change a lot if you use libraries like conduit or vector (both boxed and unboxed).
sample programs
sumC.hs
import Data.Conduit
import Data.Conduit.List as CL
main :: IO ()
main = do res <- CL.enumFromTo 1 100000000 $$ CL.fold (+) (0 :: Int)
print res
sumV.hs
import Data.Vector.Unboxed as V
{-import Data.Vector as V-}
main :: IO ()
main = print $ V.sum $ V.enumFromTo (1::Int) 100000000
funny enough the version using
main = print $ V.sum $ V.enumFromN (1::Int) 100000000
is doing worse than the above - even though the documentation says otherwise.
enumFromN :: (Unbox a, Num a) => a -> Int -> Vector a
O(n) Yield a vector of the given length containing the values x, x+1
etc. This operation is usually more efficient than enumFromTo.
Update
#Carsten's comment made me curious - so I had a look into the sources for integer - well integer-simple to be precise, because for Integer there exists other versions integer-gmp and integer-gmp2 using libgmp.
data Integer = Positive !Positive | Negative !Positive | Naught
-------------------------------------------------------------------
-- The hard work is done on positive numbers
-- Least significant bit is first
-- Positive's have the property that they contain at least one Bit,
-- and their last Bit is One.
type Positive = Digits
type Positives = List Positive
data Digits = Some !Digit !Digits
| None
type Digit = Word#
data List a = Nil | Cons a (List a)
so when using Integer there is quite a bit of memory overhead compared to Int or rather unboxed Int# - I guess as this should be optimized, (though I have not confirmed that).
So Integer is (if I calculate correctly)
1 x Word for the sum-type-tag (here Positive
n x (Word + Word) for Some and the Digit part
1 x Word for the last None
a memory overhead of (2 + floor(log_10(n)) for each Integer in that calculation + a bit more for the accumulator.

Packed large bit vector with efficient xor and bit count in Haskell

I am looking for an efficient (in both space and time) data type which can hold a 384 bit vector and supports efficient XOR and "bit count" (number of bits set to 1) operations.
Below, please find my demo program. The operations I need are all in the SOQuestionOps type class and I have implemented it for Natural and Data.Vector.Unboxed.Bit. Especially the latter seems perfect as it has a zipWords operation which should allow me to do operations like "bit count" and XOR word-by-word instead of bit-by-bit. Also it claims to store the bits packed (8 bits per byte).
{-# LANGUAGE FlexibleInstances #-}
import Data.Bits
import Data.List (foldl')
import Numeric.Natural
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed.Bit as BV
class SOQuestionOps a where
soqoXOR :: a -> a -> a
soqoBitCount :: a -> Int
soqoFromList :: [Bool] -> a
alternating :: Int -> [Bool]
alternating n =
let c = n `mod` 2 == 0
in if n == 0
then []
else c : alternating (n-1)
instance SOQuestionOps Natural where
soqoXOR = xor
soqoBitCount = popCount
soqoFromList v =
let oneIdxs = map snd $ filter fst (zip v [0..])
in foldl' (\acc n -> acc `setBit` n) 0 oneIdxs
instance SOQuestionOps (BV.Vector BV.Bit) where
soqoXOR = BV.zipWords xor
soqoBitCount = BV.countBits
soqoFromList v = BV.fromList (map BV.fromBool v)
main =
let initialVec :: BV.Vector BV.Bit
initialVec = soqoFromList $ alternating 384
lotsOfVecs = V.replicate 10000000 (soqoFromList $ take 384 $ repeat True)
xorFolded = V.foldl' soqoXOR initialVec lotsOfVecs
sumBitCounts = V.foldl' (\n v -> n + soqoBitCount v) 0 lotsOfVecs
in putStrLn $ "folded bit count: " ++ show (soqoBitCount xorFolded) ++ ", sum: " ++ show sumBitCounts
So let's calculate numbers for the best case: lotsOfVecs shouldn't need to allocate much because it's just 10,000,000 times the same vector initialVec. The foldl obviously creates one of these vectors per fold operation, so it should create 10,000,000 bit vectors. The bit counting should create anything but 10,000,000 Ints. So in the best case, my program should use very little (and constant) memory and the total allocations should roughly be 10,000,000 * sizeof(bit vector) + 10,000,000 * sizeof(int) = 520,000,000 bytes .
Ok, let's run the program for Natural:
let's make initialVec :: Natural, compile with
ghc --make -rtsopts -O3 MemStuff.hs
result (this is with GHC 7.10.1):
$ ./MemStuff +RTS -sstderr
folded bit count: 192, sum: 3840000000
1,280,306,112 bytes allocated in the heap
201,720 bytes copied during GC
80,106,856 bytes maximum residency (2 sample(s))
662,168 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2321 colls, 0 par 0.056s 0.059s 0.0000s 0.0530s
Gen 1 2 colls, 0 par 0.065s 0.069s 0.0346s 0.0674s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.579s ( 0.608s elapsed)
GC time 0.122s ( 0.128s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 0.702s ( 0.738s elapsed)
%GC time 17.3% (17.3% elapsed)
Alloc rate 2,209,576,763 bytes per MUT second
Productivity 82.7% of total user, 78.7% of total elapsed
real 0m0.754s
user 0m0.704s
sys 0m0.037s
which has 1,280,306,112 bytes allocated in the heap, that's in the ballpark (2x) of the expected figure. Btw on GHC 7.8 this allocates 353,480,272,096 bytes and runs for absolute ages as popCount isn't very efficient on GHC 7.8's Naturals.
EDIT: I changed the code a bit. In the original version, every other vector was 0 in the fold. Which gave a lot better allocation figures for the Natural version. I changed it so the vector alternates between to different representations (with many bits set) and now we see 2x allocations of the expected. That's another downside of Natural (and Integer): The allocation rate depends on the values.
But maybe we can do better, let's try the densely packed Data.Vector.Unboxed.Bit:
That's initialVec :: BV.Vector BV.Bit and re-compile and re-run with the same options.
$ time ./MemStuff +RTS -sstderr
folded bit count: 192, sum: 1920000000
75,120,306,536 bytes allocated in the heap
54,914,640 bytes copied during GC
80,107,368 bytes maximum residency (2 sample(s))
664,128 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 145985 colls, 0 par 0.543s 0.627s 0.0000s 0.0577s
Gen 1 2 colls, 0 par 0.065s 0.070s 0.0351s 0.0686s
INIT time 0.000s ( 0.000s elapsed)
MUT time 27.679s ( 28.228s elapsed)
GC time 0.608s ( 0.698s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 28.288s ( 28.928s elapsed)
%GC time 2.1% (2.4% elapsed)
Alloc rate 2,714,015,097 bytes per MUT second
Productivity 97.8% of total user, 95.7% of total elapsed
real 0m28.944s
user 0m28.290s
sys 0m0.456s
That's very slow and roughly 100 times the allocations :(.
Ok, then lets recompile and profile both runs (ghc --make -rtsopts -O3 -prof -auto-all -caf-all -fforce-recomp MemStuff.hs):
The Natural version:
COST CENTRE MODULE %time %alloc
main.xorFolded Main 51.7 76.0
main.sumBitCounts.\ Main 25.4 16.0
main.sumBitCounts Main 12.1 0.0
main.lotsOfVecs Main 10.4 8.0
The Data.Vector.Unboxed.Bit version:
COST CENTRE MODULE %time %alloc
soqoXOR Main 96.7 99.3
main.sumBitCounts.\ Main 1.9 0.2
Is Natural really the best option for a fixed size bit vector? And what about GHC 6.8? And is there anything better which can implement my SOQuestionOps type class?
Have a look at the Data.LargeWord module in the Crypto package:
http://hackage.haskell.org/package/Crypto-4.2.5.1/docs/Data-LargeWord.html
It provides Bits instances for large words of various sizes, e.g. 96 through 256 bits.

Fastest way to generate a billion random doubles in Haskell

I'm doing Monte-Carlo simulations, and currently using System.Random.
import System.Random
main = do
g <- newStdGen
let xs = randoms g :: [Double]
-- normally, I'd do other magic here
putStrLn $ show $ length $ take 10^9 xs
Unfortunately, this takes a really long time, at least 5x slower than Python's random.random(), to say nothing of the C rand() call.
With ghc -O2 -optc-ffast-math -optc-O3
import System.Random
main = do
g <- newStdGen
let xs = randoms h :: [Double]
putStrLn $ show $ length $ take (10^7) xs
takes ~8s vs. (in iPython)
import random
%timeit len([random.random() for _ in range(10 ** 7)])
takes ~1.3s. My goal is one billion, but Haskell cannot generate them in a reasonable amount of time.
I also have a C++ program that generates floats with rand(). It does 10^7 samples in 0.2s.
How can I generate random doubles in the range [0-1) quickly in Haskell?
Ideally, the program GHC generates will just blast rand() POSIX calls and collect into a list. The answer with the cleanest & fastest code wins. (No, having 10x the code for 1% speedup isn't worth it.)
Here's Mersenne which surprisingly seemed to be faster than MWC and beats C++ although we are on different computers ;-). It's tempting to see how much parallelising it would buy but I had better go back to work.
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -Wall #-}
{-# OPTIONS_GHC -fno-warn-name-shadowing #-}
{-# OPTIONS_GHC -fno-warn-type-defaults #-}
import System.Random.Mersenne.Pure64
testUniform :: Int -> Double -> PureMT -> Double
testUniform 0 !x _ = x
testUniform n !x gen =
testUniform (n - 1) (x + y) gen'
where
(y, gen') = randomDouble gen
n :: Int
n = 10^7
total :: Double
total = testUniform n 0 (pureMT $ fromIntegral arbSeed)
arbSeed :: Int
arbSeed = 8
mean :: Double
mean = total / fromIntegral n
main :: IO ()
main = print mean
~/Dropbox/Private/Stochastic $ ./MersennePure +RTS -s
0.4999607889729769
802,924,992 bytes allocated in the heap
164,240 bytes copied during GC
44,312 bytes maximum residency (2 sample(s))
21,224 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1634 colls, 0 par 0.00s 0.01s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.11s ( 0.11s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.12s ( 0.12s elapsed)
%GC time 4.2% (5.4% elapsed)
Alloc rate 7,336,065,126 bytes per MUT second
Productivity 95.7% of total user, 93.5% of total elapsed

improve function performance

I am putting together a small program that checks for solutions for Brocard's Problem or so called Brown Numbers and I first created a draft in ruby:
class Integer
def factorial
f = 1; for i in 1..self; f *= i; end; f
end
end
boundary = 1000
m = 0
# Brown Numbers - pair of integers (m,n) where n factorial is equal with square root of m
while m <= boundary
n = 0
while n <= boundary
puts "(#{m},#{n})" if ((n.factorial + 1) == (m ** 2))
n += 1
end
m += 1
end
But I discovered that Haskell is much more appropriate for doing mathematical operations, therefore I have asked a question previously and I got an answer pretty quick on how to translate my ruby code to Haskell:
results :: [(Integer, Integer)] --Use instead of `Int` to fix overflow issue
results = [(x,y) | x <- [1..1000], y <- [1..1000] , 1 + fac x == y*y]
where fac n = product [1..n]
I changed that slightly so I can run the same operation from and up to whatever number I want, because the above will do it from 1 up to 1000 or any hardcoded number but I would like to be able to decide the interval it should go through, ergo:
pairs :: (Integer, Integer) -> [(Integer, Integer)]
pairs (lower, upper) = [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m] where factorial n = product [1..n]
If possible, I would like some examples or pointers on optimisations for improving the speed of the operations, because at this point if I run this operation for an interval such as [100..10000] it takes a long time (I stopped it after 45mins).
PS The performance optimisations are to be applied to the Haskell implementation of the calculations (pairs function), not the ruby one, in case some may wonder which function I am talking about.
Well, how would you speed up the ruby implementation? Even though while they're different languages similar optimizations can be applied, namely memoization, and smarter algorithms.
1. Memoization
Memoization prevents you from calculating the factorial over and over.
Here's your version of pairs:
pairs :: (Integer, Integer) -> [(Integer, Integer)]
pairs (lower, upper) = [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m]
where factorial n = product [1..n]
How often does factorial get called? Well, we can say that it gets called at least upper - lower times, although it could be that we don't remember the values from previous calls. In this case, we need (upper - lower)² calls of factorial. Even though a factorial is fairly simple to compute, it doesn't come for free.
What if we instead generate a infinite list of factorials and simply pick the right ones?
pairsMem :: (Integer, Integer) -> [(Integer, Integer)]
pairsMem (lower, upper) = [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m]
where factorial = (factorials!!) . fromInteger
factorials = scanl (*) 1 [1..]
Now factorials is the list [1,1,2,6,24,…], and factorial simply looks up the corresponding value. How do both versions compare?
Your version
main = print $ pairs (0,1000)
> ghc --make SO.hs -O2 -rtsopts > /dev/null
> ./SO.hs +RTS -s
[(5,4),(11,5),(71,7)]
204,022,149,768 bytes allocated in the heap
220,119,948 bytes copied during GC
41,860 bytes maximum residency (2 sample(s))
20,308 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 414079 colls, 0 par 2.39s 2.23s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 67.33s ( 67.70s elapsed)
GC time 2.39s ( 2.23s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 69.72s ( 69.93s elapsed)
%GC time 3.4% (3.2% elapsed)
Alloc rate 3,030,266,322 bytes per MUT second
Productivity 96.6% of total user, 96.3% of total elapsed
Around 68 seconds.
pairsMem
main = print $ pairsMem (0,1000)
> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO.hs +RTS -s
[(5,4),(11,5),(71,7)]
551,558,988 bytes allocated in the heap
644,420 bytes copied during GC
231,120 bytes maximum residency (2 sample(s))
71,504 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1159 colls, 0 par 0.00s 0.01s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.17s ( 2.18s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.17s ( 2.18s elapsed)
%GC time 0.0% (0.3% elapsed)
Alloc rate 253,955,217 bytes per MUT second
Productivity 100.0% of total user, 99.5% of total elapsed
Around two seconds or only 3% of the original time. Not bad for an almost trivial change. However, as you can see, we use twice the memory. After all, we're saving the factorials in a list. However, the total amount of allocated bytes is 0.27% of the non-memoized variant, since we don't need to regenerate the product.
pairsMem (100,10000)
What about large numbers? You said that with (100,1000) you stopped after 45 minutes. How fast is the memoized version?
main = print $ pairsMem (100,10000)
> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO.hs +RTS -s
… 20 minutes later Ctrl+C…
That still takes too long. What else can we do?
2. Smarter pairs
Lets go back to the drawing board. You're checking all pairs (n,m) in (lower,upper). Is this reasonable?
Actually, no, since factorials grow tremendously fast. So for any natural number let f(m) be the greatest natural number such that f(m)! <= m. Now, for any m, we only need to check the f(m) first factorials - all other's will be greater.
Just for the record, f(10^100) is 70.
Now the strategy is clear: we generate as many factorials as we need and simply check if m * m - 1 is in the list of factorials:
import Data.Maybe (isJust)
import Data.List (elemIndex)
pairsList :: (Integer, Integer) -> [(Integer, Integer)]
pairsList (lower, upper) = [(m, fromIntegral ret)
| m <- [lower..upper],
let l = elemIndex (m*m - 1) fs,
isJust l,
let Just ret = l
]
where fs = takeWhile (<upper*upper) $ scanl (*) 1 [1..]
How well does this version fare against pairsMemLim?
main = print $ pairsList (1, 10^8)
> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO +RTS -s
[(5,4),(11,5),(71,7)]
21,193,518,276 bytes allocated in the heap
2,372,136 bytes copied during GC
58,672 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 40823 colls, 0 par 0.06s 0.11s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 38.17s ( 38.15s elapsed)
GC time 0.06s ( 0.11s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 38.23s ( 38.26s elapsed)
%GC time 0.2% (0.3% elapsed)
Alloc rate 555,212,922 bytes per MUT second
Productivity 99.8% of total user, 99.8% of total elapsed
Allright, down to 40s. But what if we use a data structure which provides an even more efficient lookup?
3. Using the correct data structure
Since we want efficient lookup, we're going to use Set. The function almost stays the same, however, fs is going to be Set Integer, and the lookup is done via lookupIndex:
import Data.Maybe (isJust)
import qualified Data.Set as S
pairsSet :: (Integer, Integer) -> [(Integer, Integer)]
pairsSet (lower, upper) = [(m, 1 + fromIntegral ret)
| m <- [lower..upper],
let l = S.lookupIndex (m*m - 1) fs,
isJust l,
let Just ret = l
]
where fs = S.fromList . takeWhile (<upper*upper) $ scanl (*) 1 [1..]
And here the performance of pairsSet:
> ./SO +RTS -s
[(5,4),(11,5),(71,7)]
18,393,520,096 bytes allocated in the heap
2,069,872 bytes copied during GC
58,752 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 35630 colls, 0 par 0.06s 0.08s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 18.52s ( 18.52s elapsed)
GC time 0.06s ( 0.08s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 18.58s ( 18.60s elapsed)
%GC time 0.3% (0.4% elapsed)
Alloc rate 993,405,304 bytes per MUT second
Productivity 99.7% of total user, 99.5% of total elapsed
This concludes our journey through optimization. By the way, we have reduced the complexity from 𝒪(n³) to 𝒪(n log n) since our data structure gives us a logarithmic search.
From your code, It seems that a memorization could used to speed up the calculation of factorial.
For every m, the code need to compute every n's factorial, which I think is unnecessary.

Resources