Benefits of differential lists with lazy evaluation - performance

I struggle to understend why ++ is considered O(n) while differential lists are considered "O(1)".
In case of ++ let's assume it's defined as:
(++) :: [a] -> [a] -> [a]
(a:as) ++ b = a:(as ++ b)
[] ++ b = b
Now if we need to get an access first element in a ++ b we can do it in O(1) (assuming that a can be made HNF in 1 step), similarly the second etc. It changes with appending multiple lists setting to Ω(1)/O(m), where m is number of unevaluated appendings. Accessing last element can be done with Θ(n + m), where n is length of list, unless I missed something. If we have differential list we also have access to first element in Θ(m) while last element is in Θ(n + m).
What do I miss?

Performance in theory
The O(1) refers to the fact that append for DLists is just (.) which takes one reduction, wheras (++) is O(n).
Worst case
++ has quadratic performance when you use it to repeatedly add to the end of an existing string, because each time you add another list you iterate through the existing list, so
"Existing long ...... answer" ++ "newbit"
traverses "Existing long ....... answer" each time you append a new bit.
On the other hand,
("Existing long ..... answer" ++ ) . ("newbit"++)
is only going to actually traverse "Existing long ...... answer" once, when the function chain is applied to [] to convert to a list.
Experience says
Years ago when I was a young Haskeller, I wrote a program that was searching for a counterexample to a conjecture, so was outputting data to disk constantly until I stopped it, except that once I took off the testing brakes, it output precisely nothing because of my left-associative tail recursive build-up of a string, and I realised my program was insufficiently lazy - it couldn't output anything until it had appended the final string, but there was no final string! I rolled my own DList (this was in the millenium preceding the one in which the DList library was written), and lo my program ran beautifully and happily churned out reams and reams of non-counterexamples on the server for days until we gave up on the project.
If you mess with large enough examples, you can see the performance difference, but it doesn't matter for small finite output. It certainly taught me the benefits of laziness.
Toy example
Silly example to prove my point:
plenty f = f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f.f
alot f = plenty f.plenty f.plenty f
Let's do the two sorts of appending, first the DList way
compose f = f . ("..and some more.."++)
append xs = xs ++ "..and some more.."
insufficiently_lazy = alot append []
sufficiently_lazy = alot compose id []
gives:
ghci> head $ sufficiently_lazy
'.'
(0.02 secs, 0 bytes)
ghci> head $ insufficiently_lazy
'.'
(0.02 secs, 518652 bytes)
and
ghci> insufficiently_lazy
-- (much output skipped)
..and some more....and some more....and some more.."
(0.73 secs, 61171508 bytes)
ghci> sufficiently_lazy
-- (much output skipped)
..and some more....and some more....and some more.."
(0.31 secs, 4673640 bytes).
-- less than a tenth the space and half the time
so it's faster in practice as well as in theory.

DLists are often most useful if you're repeatedly appending list fragments. To wit,
foldl1 (++) [a,b,c,d,e] == (((a ++ b) ++ c) ++ d) ++ e
is really bad while
foldr1 (++) [a,b,c,d,e] == a ++ (b ++ (c ++ (d ++ e)))
still is n steps away from the nth position. Unfortunately, you often build strings by traversing a structure and appending to the end of the accumulating string, so the left fold scenario isn't uncommon. For this reason, DLists are most useful in situations where you're repeatedly building up a string such as the Blaze/ByteString Builder libraries.

[After further thinking and reading other answers I believe I know what went wrong - but I don't think either explained it fully so I'm adding my own.]
Assume you had the lists a1:a2:[] b1:b2:[] and c1:c2:[]. Now you append them (a ++ b) ++ c. That gives:
(a1:a2:[] ++ b1:b2:[]) ++ c1:c2:[]
Now to take a head you need to take O(m) steps where m is number of appendings. This gives thunks as follows:
a1:((a2:[] ++ b1:b2:[]) ++ c1:c2:[])
To give next element you need to perform m or m-1 steps (I assumed it to be free in my reasoning). So after 2m or 2m-1 steps the view is as follows:
a1:a2(([] ++ b1:b2:[]) ++ c1:c2:[])
And so on. In worst case it gives m*n time to traverse the list as the traversal of the thunks is done each time.
EDIT - it looks like the answer to duplicate have even better pictures.

Related

Find bottleneck in Scala merge algorithm

I am learning Scala and as a starting point I am trying to write a mergeSort algorithm. I am having problem with the performance of the merge part of it.
I know that there are other implementations on this site but I would like to know why my one is not working well.
This is my code:
#tailrec
def merge(l1:List[Int], l2:List[Int], acc:List[Int]): List[Int] = {
if(l1.isEmpty || l2.isEmpty) l1 ++ l2 ++ acc
else if(l1.last> l2.last) merge(l1.init, l2, l1.last :: acc)
else merge(l1, l2.init, l2.last :: acc)
}
val a1 = List(1,4,65,52151)
val a2 = List(2,52,124,5251,124125125)
println(merge(a1, a2, List()))
How can you see the merge function is tail recursive and (if I am not wrong) the list methods that I am using should take constant time.
The code gets very slow with a list of 100000 elements.
last and init are terribly expensive on List: O(N). The efficient operations are head and tail: O(1). If you can't work at the start, reverse the lists up front (O(N) but just once, not at each iteration), or reverse your output at the end, but you need to work at the beginning of the list.
The best way to find bottlenecks is with a profiler. I understand netbeans has a free one; if you can get jprofiler or yourkit they're very nice to use. In this specific case I'd point out that last and init are O(n), because List is a (singly) linked list.

Improving performance on chunked lists

I have a simple problem: Given a list of integers, read the first line as N. Then, read the next N lines and return the sum of them. Repeat until N = 0.
My first approach was using this:
main = interact $ unlines . f . (map read) . lines
f::[Int] -> [String]
f (n:ls)
| n == 0 = []
| otherwise = [show rr] ++ (f rest)
where (xs, rest) = splitAt n ls
rr = sum xs
f _ = []
But it's relatively slow. I've profiled it using
ghc -O2 --make test.hs -prof -auto-all -caf-all -fforce-recomp -rtsopts
time ./test +RTS -hc -p -i0.001 < input.in
Where input.in is a test input where the first line is 100k, followed by 100k random numbers, followed by 0. We can see in the Figure below that it's using O(N) memory:
EDITED: My original question was comparing 2 similarly slow approaches. I've updated it to compare with an optimized approach below
Now, if I do the sum iteratively, instead of calling sum, I get constant amount of memory
{-# LANGUAGE BangPatterns #-}
main = interact $ unlines . g . (map read) . lines
g::[Int] -> [String]
g (n:ls)
| n == 0 = []
| otherwise = g' n ls 0
g _ = []
g' n (l:ls) !cnt
| n == 0 = [show cnt] ++ (g (l:ls))
| otherwise = g' (n-1) ls (cnt + l)
I'm trying to understand what is causing the performance loss in the first example. I would guess everything there could be lazily evaluated?
I don't know precisely what is causing the difference. But I can show you this:
Data.Map> sum [1 .. 1e8]
Out of memory.
Data.Map> foldl' (+) 0 [1 .. 1e8]
5.00000005e15
For some reason, sum = foldl (+) 0, rather than foldl' (with the apostrophe). The difference is that the latter function is more strict, so it uses virtually no memory. The lazy version, by contrast, does this:
sum [1..100]
1 + sum [2..100]
1 + 2 + sum [3..100]
1 + 2 + 3 + sum [4.100]
...
In other words, it creates a giant expression that says 1 + 2 + 3 + ... And then, right at the end, it tries to evaluate it all. Well, obviously, that's going to eat a lot of RAM. By using foldl' instead of foldl, you make it do the additions immediately, rather than pointlessly storing them in RAM.
You probably also want to do I/O using ByteString rather than String; but the laziness difference will probably give you a big speed boost on its own.
I think that laziness is what prevents your first and second version from being equivalent.
Consider the result created from the input "numbers"
1
garbage_here
2
3
5
0
The first version would give a result list [error "...some parse error", 8], which you can safely look at the second element of, while the second version errors near immediately. It seems hard to achieve the first in a streaming way.
Even without laziness, though, getting from the first to the second version may be more than GHC can handle - it would need to have fusion rewriting rules combining foldl/foldl' on the first element of a tuple with splitAt. And GHC has only recently got to the point where it can fuse foldl/foldl' at all.

Reversible permutations algorithm

Which algorithm for permutation of list is predictable?
For example, i can get number of i-th permutation
(Haskell code)
--List of all possible permutations
permut [] = [[]]
permut xs = [x:ys|x<-xs,ys<-permut (delete x xs)]
--In command line call:
> permut "abc" !! 2
"bac"
but i don't know how to reverse it.
I want to o something like this:
> getNumOfPermut "abc" "bac"
2
Any reversible algorithm goes!
Thank you in advance!
Okay, I wanted to wait until you answered my question about what you had tried, but I had so much fun working out the answer that I just had to write it up and share it. Nerd sniping, I guess! I'm sure I'm not the first to have invented the algorithm below, but I hope you enjoy the presentation.
Our first step is to give an actual runnable implementation of permut (which you have not done). Our implementation strategy will be a simple one: choose some element of the list, choose some permutation of the remaining elements, and concatenate the two.
chooseFrom [] = []
chooseFrom (x:xs) = (x,xs) : [(y, x:ys) | (y, ys) <- chooseFrom xs]
permut [] = [[]]
permut xs = do
(element, remaining) <- chooseFrom xs
permutation <- permut remaining
return (element:permutation)
If we run this on a sample list, it's pretty clear how it behaves:
> permut [1..4]
[[0,1,2,3],[0,1,3,2],[0,2,1,3],[0,2,3,1],[0,3,1,2],[0,3,2,1],[1,0,2,3],[1,0,3,2],[1,2,0,3],[1,2,3,0],[1,3,0,2],[1,3,2,0],[2,0,1,3],[2,0,3,1],[2,1,0,3],[2,1,3,0],[2,3,0,1],[2,3,1,0],[3,0,1,2],[3,0,2,1],[3,1,0,2],[3,1,2,0],[3,2,0,1],[3,2,1,0]]
The result has a lot of structure; for example, if we group by the first element of the contained lists, there are four groups, each containing 6 (which is 3!) elements:
> mapM_ print $ groupBy ((==) `on` head) it
[[0,1,2,3],[0,1,3,2],[0,2,1,3],[0,2,3,1],[0,3,1,2],[0,3,2,1]]
[[1,0,2,3],[1,0,3,2],[1,2,0,3],[1,2,3,0],[1,3,0,2],[1,3,2,0]]
[[2,0,1,3],[2,0,3,1],[2,1,0,3],[2,1,3,0],[2,3,0,1],[2,3,1,0]]
[[3,0,1,2],[3,0,2,1],[3,1,0,2],[3,1,2,0],[3,2,0,1],[3,2,1,0]]
So! The first digit of the list tells us "how many 6s to add". Additionally, each list in the above grouping exhibits similar structure: the lists in the first group have three groups of 2! elements each containing 1, 2, and 3 as their second element; the lists in each of those groups have 2 groups of 1! elements each starting with each of the remaining digits; and each of those groups have 1 group(s) of 0! elements each starting with the only remaining digit. So the second digit tells us "how many 2s to add", the third digit tells us "how many 1s to add", and the last digit tells us "how many 1s to add" (but always tells us to add 0 1s).
If you have implemented a change-of-base function on numbers before (e.g. decimal to hexadecimal or similar) you may recognize this pattern. Indeed, we can treat this as a change-of-base operation with a sliding base: instead of 1s, 10s, 100s, 1000s, and so on columns, we have 0!s, 1!s, 2!s, 3!s, 4!s, and so on columns. Let's write it! For efficiency, we'll compute all the sliding bases up front with a factorials function.
import Data.List
factorials n = scanr (*) 1 [n,n-1..1]
deleteAt i xs = case splitAt i xs of (b, e) -> b ++ drop 1 e
permutIndices permutation original
= go (factorials (length permutation - 1))
permutation
original
where
go _ [] [] = [0]
go _ [] _ = []
go _ _ [] = []
go (base:bases) (x:xs) ys = do
i <- elemIndices x ys
remainder <- go bases xs (deleteAt i ys)
return (i*base + remainder)
go [] _ _ = error "the impossible happened!"
Here's a sample sanity-check:
> map (`permutIndices` [1..4]) (permut [1..4])
[[0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17],[18],[19],[20],[21],[22],[23]]
And, for fun, here you can see it handling ambiguity correctly:
> permutIndices "acbba" "aabbc"
[21,23,45,47]
> map (permut "aabbc"!!) it
["acbba","acbba","acbba","acbba"]
...and showing that it's significantly more efficient than elemIndices:
> :set +s
> elemIndices "zyxwvutsr" (permut "rstuvwxyz")
[362879]
(2.65 secs, 1288004848 bytes)
> permutIndices "zyxwvutsr" "rstuvwxyz"
[362879]
(0.00 secs, 1030304 bytes)
Less than one thousandth the allocation/time. Seems like a win!
So, to be clear, you are looking for a way to find the position of a given permution-
"bac"
in a list of given permutions-
["abc", "acb", "bac", ....]
This problem actually has nothing inherently to do with permutions themselves. You want to find the location of an element in an array.
As #raymonad mentioned in his comment, stackoverflow.com/questions/20641772/ deals with this question, and the answer there was, use elemIndex.
elemIndex thePermutionToFind $ permut theString
Keep in mind, that if letters repeat, a value might appear more than once in the output, if your "permut" function doesn't remove these duplicates (ie- Note that permut "aa" = ["aa", "aa"]).... In this case the elemIndices function will come in useful.
If elemIndex returns Nothing, it means the string you supplied wasn't a permution.
(this isn't the most effecient algorithm for large strings, since the number of permutions grows like the factorial of the size of the string.... Which is worse than exponential.)

How does one write efficient Dynamic Programming algorithms in Haskell?

I've been playing around with dynamic programming in Haskell. Practically every tutorial I've seen on the subject gives the same, very elegant algorithm based on memoization and the laziness of the Array type. Inspired by those examples, I wrote the following algorithm as a test:
-- pascal n returns the nth entry on the main diagonal of pascal's triangle
-- (mod a million for efficiency)
pascal :: Int -> Int
pascal n = p ! (n,n) where
p = listArray ((0,0),(n,n)) [f (i,j) | i <- [0 .. n], j <- [0 .. n]]
f :: (Int,Int) -> Int
f (_,0) = 1
f (0,_) = 1
f (i,j) = (p ! (i, j-1) + p ! (i-1, j)) `mod` 1000000
My only problem is efficiency. Even using GHC's -O2, this program takes 1.6 seconds to compute pascal 1000, which is about 160 times slower than an equivalent unoptimized C++ program. And the gap only widens with larger inputs.
It seems like I've tried every possible permutation of the above code, along with suggested alternatives like the data-memocombinators library, and they all had the same or worse performance. The one thing I haven't tried is the ST Monad, which I'm sure could be made to run the program only slighter slower than the C version. But I'd really like to write it in idiomatic Haskell, and I don't understand why the idiomatic version is so inefficient. I have two questions:
Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
Thanks a lot.
Edit: The array module used is the standard Data.Array
Well, the algorithm could be designed a little better. Using the vector package and being smart about only keeping one row in memory at a time, we can get something that's idiomatic in a different way:
{-# LANGUAGE BangPatterns #-}
import Data.Vector.Unboxed
import Prelude hiding (replicate, tail, scanl)
pascal :: Int -> Int
pascal !n = go 1 ((replicate (n+1) 1) :: Vector Int) where
go !i !prevRow
| i <= n = go (i+1) (scanl f 1 (tail prevRow))
| otherwise = prevRow ! n
f x y = (x + y) `rem` 1000000
This optimizes down very tightly, especially because the vector package includes some rather ingenious tricks to transparently optimize array operations written in an idiomatic style.
1 Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
The problem is that the code writes thunks to the array. Then when entry (n,n) is read, the evaluation of the thunks jumps all over the array again, recurring until finally a value not needing further recursion is found. That causes a lot of unnecessary allocation and inefficiency.
The C++ code doesn't have that problem, the values are written, and read directly without requiring further evaluation. As it would happen with an STUArray. Does
p = runSTUArray $ do
arr <- newArray ((0,0),(n,n)) 1
forM_ [1 .. n] $ \i ->
forM_ [1 .. n] $ \j -> do
a <- readArray arr (i,j-1)
b <- readArray arr (i-1,j)
writeArray arr (i,j) $! (a+b) `rem` 1000000
return arr
really look so bad?
2 Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
I don't know of one. But there might be.
Addendum:
Once one uses STUArrays or unboxed Vectors, there's still a significant difference to the equivalent C implementation. The reason is that gcc replaces the % by a combination of multiplications, shifts and subtractions (even without optimisations), since the modulus is known. Doing the same by hand in Haskell (since GHC doesn't [yet] do that),
-- fast modulo 1000000
-- for nonnegative Ints < 2^31
-- requires 64-bit Ints
fastMod :: Int -> Int
fastMod n = n - 1000000*((n*1125899907) `shiftR` 50)
gets the Haskell versions on par with C.
The trick is to think about how to write the whole damn algorithm at once, and then use unboxed vectors as your backing data type. For example, the following runs about 20 times faster on my machine than your code:
import qualified Data.Vector.Unboxed as V
combine :: Int -> Int -> Int
combine x y = (x+y) `mod` 1000000
pascal n = V.last $ go n where
go 0 = V.replicate (n+1) 1
go m = V.scanl1 combine (go (m-1))
I then wrote two main functions that called out to yours and mine with an argument of 4000; these ran in 10.42s and 0.54s respectively. Of course, as I'm sure you know, they both get blown out of the water (0.00s) by the version that uses a better algorithm:
pascal' :: Integer -> Integer
pascal :: Int -> Int
pascal' n = product [n+1..n*2] `div` product [2..n]
pascal = fromIntegral . (`mod` 1000000) . pascal' . fromIntegral

variant of pascal's triangle in haskell - problem with lazy evaluation

To solve some problem I need to compute a variant of the pascal's triangle which is defined like this:
f(1,1) = 1,
f(n,k) = f(n-1,k-1) + f(n-1,k) + 1 for 1 <= k < n,
f(n,0) = 0,
f(n,n) = 2*f(n-1,n-1) + 1.
For n given I want to efficiently get the n-th line (f(n,1) .. f(n,n)). One further restriction: f(n,k) should be -1 if it would be >= 2^32.
My implementation:
next :: [Int64] -> [Int64]
next list#(x:_) = x+1 : takeWhile (/= -1) (nextRec list)
nextRec (a:rest#(b:_)) = boundAdd a b : nextRec rest
nextRec [a] = [boundAdd a a]
boundAdd x y
| x < 0 || y < 0 = -1
| x + y + 1 >= limit = -1
| otherwise = (x+y+1)
-- start shoud be [1]
fLine d start = until ((== d) . head) next start
The problem: for very large numbers I get a stack overflow. Is there a way to force haskell to evaluate the whole list? It's clear that each line can't contain more elements than an upper bound, because they eventually become -1 and don't get stored and each line only depends on the previous one. Due to the lazy evaluation only the head of each line is computed until the last line needs it's second element and all the trunks along the way are stored...
I have a very efficient implementation in c++ but I am really wondering if there is a way to get it done in haskell, too.
Works for me: What Haskell implementation are you using? A naive program to calculate this triangle works fine for me in GHC 6.10.4. I can print the 1000th row just fine:
nextRow :: [Integer] -> [Integer]
nextRow row = 0 : [a + b + 1 | (a, b) <- zip row (tail row ++ [last row])]
tri = iterate nextRow [0]
main = putStrLn $ show $ tri !! 1000 -- print 1000th row
I can even print the first 10 numbers in row 100000 without overflowing the stack. I'm not sure what's going wrong for you. The global name tri might be keeping the whole triangle of results alive, but even if it is, that seems relatively harmless.
How to force order of evaluation: You can force thunks to be evaluated in a certain order using the Prelude function seq (which is a magic function that can't be implemented in terms of Haskell's other basic features). If you tell Haskell to print a `seq` b, it first evaluates the thunk for a, then evaluates and prints b.
Note that seq is shallow: it only does enough evaluation to force a to no longer be a thunk. If a is of a tuple type, the result might still be a tuple of thunks. If it's a list, the result might be a cons cell having thunks for both the head and the tail.
It seems like you shouldn't need to do this for such a simple problem; a few thousand thunks shouldn't be too much for any reasonable implementation. But it would go like this:
-- Evaluate a whole list of thunks before calculating `result`.
-- This returns `result`.
seqList :: [b] -> a -> a
seqList lst result = foldr seq result lst
-- Exactly the same as `nextRow`, but compute every element of `row`
-- before calculating any element of the next row.
nextRow' :: [Integer] -> [Integer]
nextRow' row = row `seqList` nextRow row
tri = iterate nextRow' [0]
The fold in seqList basically expands to lst!!0 `seq` lst!!1 `seq` lst!!2 `seq` ... `seq` result.
This is much slower for me when printing just the first 10 elements of row 100,000. I think that's because it requires computing 99,999 complete rows of the triangle.

Resources