Related
I have defined a data type called FS the following way:
type Name= String
data Ext where { Txt::Ext ; Mp3::Ext ; Jar::Ext ; Doc::Ext ; Hs::Ext }
deriving (Eq, Show)
data FS where { A :: (Name,Ext) -> FS;
Dir :: Name-> [FS] -> FS }
deriving (Eq, Show)
(A stands for file and Dir for directory)
And I'm trying to make a function that given a FS (directory) it returns the same FS but ordered alphabetically at all levels, my attempt so far is the following:
orderFS :: FS-> FS
orderFS (Dir x y) = Dir x (map orderFS (sort y));
orderFS (A (x,y)) = A (x,y);
The only piece I'm missing is a function called "sort" that takes a [FS] and returns it ordered alphabetically by the Name field.
I read that there are functions like sort from Data.List that can help, but I have to do this without using anything else than Prelude.
So how should I implement such function? Thanks in advance
I do not believe that there are any sorting functions in Prelude but not in a module like Data.List. Note that Data.List is in the base package that is part of GHC, so in basically any situation where Prelude is available, I would imagine that Data.List would be as well---you shouldn't need to download/include any other packages in order to use it.
That said, if you do want to write your own sorting function, you are probably best off taking an existing simple sorting algorithm and using it. There are very neat/simple ways of writing quicksorts and merge sorts in Haskell, although the obvious implementations sometimes don't have the same exact performance characteristics as you would expect. Merge sort, for example, has roughly the same asymptotics, but partitioning the list into two actually takes some time, since the list is singly-linked and you therefore have to walk through half of it in order to split it. But, it can be a very nice short function that looks a lot like the algorithm, and is probably worth doing as a learning exercise.
Also, I noticed that you are defining your Ext and FS types as GADTs, which I'm not really sure about the motivation for; I would suggest using the non-GADT syntax, which is much simpler for this example:
type Name = String
data Ext = Txt | Mp3 | Jar | Doc | Hs deriving (Eq, Show)
data FS = A Name Ext | Dir Name [FS] deriving (Eq, Show)
In order to sort them by name, it would probably be worth writing a simple accessor function that can get the name of an FS:
name :: FS -> Name
name (A n _) = n
name (Dir n _) = n
Another approach would be to factor out the thing (Name) that is common to both cases:
data FS = NamedFS { name :: Name, fs :: UnnamedFS }
data UnnamedFS = A Ext | Dir [FS]
The first entry here uses record syntax, which, among other things, will automatically make a name :: FS -> Name accessor, as well as an fs :: FS -> UnnamedFS.
For the actual sort, it looks a lot like the algorithmic description of merge sort. To start with, let's write a function to divide a list in two:
split :: [a] -> ([a], [a])
split xs = splitAt (length xs `div` 2) xs
We also need a function to merge two sorted lists:
merge :: Ord a => [a] -> [a] -> [a]
merge [] x = x
merge x [] = x
merge (x:xs) (y:ys) | x < y = x:merge xs (y:ys)
| otherwise = y:merge (x:xs) ys
Actually, this is not what we want, because it always uses the < from the Ord instance; instead, we want something that takes in a comparison function. In this case, we assume that if the comparison function returns true when called with x and y, x is conceptually less than y.
merge :: (a -> a -> Bool) -> [a] -> [a] -> [a]
merge _ [] x = x
merge _ x [] = x
merge le (x:xs) (y:ys) | x `le` y = x:merge le xs (y:ys)
| otherwise = y:merge le (x:xs) ys
Now, we can implement mergesort like usual:
mergeSort :: (a -> a -> Bool) -> [a] -> [a]
mergeSort _ [] = []
mergeSort _ [x] = [x]
mergeSort f l = merge f (mergeSort f left) (mergeSort f right)
where (left, right) = split l
And just call it like:
-- fss is some [FS]
mergeSort (\x y -> name x < name y) fss
If we could use Data.Ord, this could be further reduced to:
mergeSort (comparing name) fss
The function in Data.List that could help you is sortOn
this together with a function getName :: FS -> Name would allow you to sort by comparing the Names.
If you cannot use functions from Data.List you will have to implement a sorting algorithm yourself (of which there are many to choose from). One example is "QuickSort" as implemented in the Learn you a Haskell book:
quicksort :: [FS] -> [FS]
quicksort [] = []
quicksort (x:xs) =
let smallerSorted = quicksort [a | a <- xs, getName a <= getName x]
biggerSorted = quicksort [a | a <- xs, getName a > getName x]
in smallerSorted ++ [x] ++ biggerSorted
Note that I changed the comparisons to compare the getNames instead of the whole nodes.
Another thing: you are using GADT syntax to define your datatypes and I cannot see any reason for you to do so. Here is how I would write them instead using ordinary datatype declarations:
data Ext
= Txt
| Mp3
| Jar
| Doc
| Hs
deriving (Eq, Show)
data FS
= File Name Ext
| Dir Name [FS]
deriving (Eq, Show)
In my opinion, the most natural list-sorting function in Haskell is the bottom-up mergesort (Peter Amidon's answer gives a top-down mergesort). Suppose you're starting with the list
[25,1,22,2,10,8,6,20,13,28,5,3,11]
The first step is to turn the list into a list of lists. The simplest way is with map (:[]), which yields
[[25],[1],[22],[2],[10],[8],[6],[20],[13],[28],[5],[3],[11]]
Next, you merge lists pairwise:
[[1,25],[2,22],[8,10],[6,20],[13,28],[3,5],[11]]
Repeat!
[[1,2,22,25],[6,8,10,20],[3,5,13,28],[11]]
And again!
[[1,2,6,8,10,20,22,25],[3,5,11,13,28]]
And once more:
[[1,2,3,5,6,8,10,11,13,20,22,25,28]]
As we now have just one list, we extract it.
To implement this, I think it's best to use a custom type for the lists of lists in this case, to express the appropriate strictness:
infixr 5 :::
data LL a = ![a] ::: LL a | Nil
Implementation proceeds step by step. While breaking the list up into singletons is the simplest way to start off the process, it's a bit wasteful and will slow down access to the first few elements. So let's go by pairs there too:
breakUp :: (a -> a -> Ordering) -> [a] -> LL a
breakUp _cmp [] = Nil
breakUp _cmp xs#[a] = xs ::: Nil
breakUp cmp (x1 : x2 : xs)
| GT <- cmp x1 x2 = [x2,x1] ::: breakUp cmp xs
| otherwise = [x1,x2] ::: breakUp cmp xs
Now we need to write a merge function:
merge :: (a -> a -> Ordering)
-> [a] -> [a] -> [a]
merge _cmp [] ys = ys
merge _cmp xs [] = xs
merge cmp xss#(x:xs) yss#(y:ys)
| GT <- cmp x y = y : merge cmp xss ys
| otherwise = x : merge cmp xs yss
Next we need to be able to merge a list of lists pairwise:
mergePairwise :: (a -> a -> Ordering)
-> LL a -> LL a
mergePairwise _cmp Nil = Nil
mergePairwise _cmp xs#(_ ::: Nil) = xs
mergePairwise cmp (xs1 ::: xs2 ::: xss)
= merge cmp xs1 xs2 ::: mergePairwise cmp xss
Then we must merge up until done:
mergeAll :: (a -> a -> Ordering)
-> LL a -> [a]
mergeAll _cmp Nil = []
mergeAll _cmp (xs ::: Nil) = xs
mergeAll cmp xss = mergeAll cmp (mergePairwise cmp xss)
Now we're cooking with glass grass bass!
sortBy :: (a -> a -> Ordering) -> [a] -> [a]
sortBy cmp = mergeAll cmp . breakUp cmp
sort :: Ord a => [a] -> [a]
sort = sortBy compare
There's a bit of room to improve the above implementation by recognizing that the lists stored in the lists of lists are never empty. So we might get a small performance improvement using
data NonEmpty a = a :| [a]
data LL a = (:::) {-# UNPACK #-} !(NonEmpty a) (LL a) | Nil
I was doing a few of the 99 Haskell Problems earlier and I thought that exercise 27 ("write a function to enumerate the possible combinations") was interesting as it's a simple concept and it lends itself to multiple implementations.
I was curious about relative efficiency so I decided to run a couple of different implementations - results are in the table below. (For reference: Emacs bash ansi-term in LXDE (Ubuntu 14.04) running on VirtualBox; Thinkpad X220; 8gb RAM, i5 64bit 2.4ghz.)
TL;DR:
(i) Why are combination-generating techniques #7 and #8 (from the table below; code included at bottom of post) so much faster than the rest?
(ii) Also, what do the figures in the Bytes column actually represent?
(i) It's odd because function #7 works by filtering the powerset (which is waaaay larger than the combinations list); I suspect this is laziness at work, i.e., that this is the function which is most effectively exploiting the fact that we've only asked for the length of the list and not the list itself. (Also, its 'memory usage' is lower than that of the other functions, but, then again, I'm not sure exactly what memory-related stat is being shown.)
Regarding function #8: kudos to Bergi for that ridiculously fast implementation and thanks to user5402 for suggesting the addition. Still trying to wrap my ahead around the speed difference of this one.
(ii) The figures in the Bytes column are reported by GHCi after running the :set +s command; they clearly don't represent max memory usage as I only have ~25gb of RAM + free HD space.)?
Code:
import Data.List
--algorithms to generate combinations
--time required to compute the following: length $ 13 "abcdefghijklmnopqrstuvwxyz"
--(90.14 secs, 33598933424 bytes)
combDC1 :: (Eq a) => Int -> [a] -> [[a]]
combDC1 n xs = filter (/= []) $ combHelper n n xs []
combHelper :: Int -> Int -> [a] -> [a] -> [[a]]
combHelper n _ [] chosen = if length chosen == n
then [chosen]
else [[]]
combHelper n i remaining chosen
| length chosen == n = [chosen]
| n - length chosen > length remaining = [[]]
| otherwise = combHelper n (i-1) (tail remaining) ((head remaining):chosen) ++
combHelper n i (tail remaining) chosen
--(167.63 secs, 62756587760 bytes)
combSoln1 :: Int -> [a] -> [([a],[a])]
combSoln1 0 xs = [([],xs)]
combSoln1 n [] = []
combSoln1 n (x:xs) = ts ++ ds
where
ts = [ (x:ys,zs) | (ys,zs) <- combSoln1 (n-1) xs ]
ds = [ (ys,x:zs) | (ys,zs) <- combSoln1 n xs ]
--(71.40 secs, 30480652480 bytes)
combSoln2 :: Int -> [a] -> [[a]]
combSoln2 0 _ = [ [] ]
combSoln2 n xs = [ y:ys | y:xs' <- tails xs
, ys <- combSoln2 (n-1) xs']
--(83.75 secs, 46168207528 bytes)
combSoln3 :: Int -> [a] -> [[a]]
combSoln3 0 _ = return []
combSoln3 n xs = do
y:xs' <- tails xs
ys <- combSoln3 (n-1) xs'
return (y:ys)
--(92.34 secs, 40541644232 bytes)
combSoln4 :: Int -> [a] -> [[a]]
combSoln4 0 _ = [[]]
combSoln4 n xs = [ xs !! i : x | i <- [0..(length xs)-1]
, x <- combSoln4 (n-1) (drop (i+1) xs) ]
--(90.63 secs, 33058536696 bytes)
combSoln5 :: Int -> [a] -> [[a]]
combSoln5 _ [] = [[]]
combSoln5 0 _ = [[]]
combSoln5 k (x:xs) = x_start ++ others
where x_start = [ x : rest | rest <- combSoln5 (k-1) xs ]
others = if k <= length xs then combSoln5 k xs else []
--(61.74 secs, 33053297832 bytes)
combSoln6 :: Int -> [a] -> [[a]]
combSoln6 0 _ = [[]]
combSoln6 _ [] = []
combSoln6 n (x:xs) = (map (x:) (combSoln6 (n-1) xs)) ++ (combSoln6 n xs)
--(8.41 secs, 10785499208 bytes)
combSoln7 k ns = filter ((k==).length) (subsequences ns)
--(3.15 secs, 2889815872 bytes)
subsequencesOfSize :: Int -> [a] -> [[a]]
subsequencesOfSize n xs = let l = length xs
in if n>l then [] else subsequencesBySize xs !! (l-n)
where
subsequencesBySize [] = [[[]]]
subsequencesBySize (x:xs) = let next = subsequencesBySize xs
in zipWith (++) ([]:next) (map (map (x:)) next ++ [[]])
You should also test the algorithm found in this SO answer:
subsequences of length n from list performance
subsequencesOfSize :: Int -> [a] -> [[a]]
subsequencesOfSize n xs = let l = length xs
in if n>l then [] else subsequencesBySize xs !! (l-n)
where
subsequencesBySize [] = [[[]]]
subsequencesBySize (x:xs) = let next = subsequencesBySize xs
in zipWith (++) ([]:next) (map (map (x:)) next ++ [[]])
On my machine I get the following timing and memory usage from ghci:
ghci> length $ combSoln7 13 "abcdefghijklmnopqrstuvwxyz"
10400600
(13.42 secs, 10783921008 bytes)
ghci> length $ subsequencesOfSize 13 "abcdefghijklmnopqrstuvwxyz"
10400600
(6.52 secs, 2889807480 bytes)
fact :: (Integral a) => a -> a
fact n = product [1..n]
ncombs n k = -- to evaluate number of combinations
let n' = toInteger n
k' = toInteger k
in div (fact n') ((fact k') * (fact (n' - k')))
combinations :: Int -> [a] -> [[a]]
combinations 0 xs = [[]]
combinations 1 xs = [[x] | x <- xs]
combinations n xs =
let ps = reverse [0..n - 1]
inc (p:[])
| pn < length xs = pn:[]
| otherwise = p:[]
where pn = p + 1
inc (p:ps)
| pn < length xs = pn:ps
| (head psn) < length xs = inc ((head psn):psn)
| otherwise = (p:ps)
where pn = p + 1
psn = inc ps
amount = ncombs (length xs) n
pointers = take (fromInteger amount) (iterate inc ps)
c' xs ps = map (xs!!) (reverse ps)
in map (c' xs) pointers
I am learning Haskell and found a comparably fast implementation. I had a hard time with the type system with some functions requiring Ints and some fractional numbers and some Integers. On my computer the fastest solution presented here takes about 6,1 seconds to run and mine takes 3,5 to 2,9 seconds.
I have a simple toy example that seems to disagree with the garbage collector on what data structures can be reclaimed (aka memory leak). I am not trying to come up with more memory efficient versions of this algorithm (a good collection of better algorithms is here: Haskell Wiki - Prime numbers, rather an explanation why the garbage collector is not identifying the old, out of scope and unused portions of the list to reclaim that memory.
The code is here:
import Data.List (foldl')
erat' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat' (c,b) ((x,y):xs)
| c < x = (x,y) : erat' (c,b) xs
| c == x = (x+y,y) : erat' (c,True) xs
| c > x = (x+y,y) : erat' (c,b) xs
erat' (c,b) []
| b = []
| otherwise = [(c,c)]
erat :: [Integer] -> [(Integer,Integer)]
erat = foldl' (\a c -> erat' (c,False) a) []
primes :: Integer -> [Integer]
primes n = map snd $ erat [2..n]
In essence, calling primes with a positive integer will return a list of all prime numbers up to and including that number. A list of pairs of primes and their high water mark multiple is passed to erat', together with a pair including a candidate and a boolean (False for prime and True for non-prime). Every non-recursive call to erat' will pass a new list, and I would expect that the output would contain, at most, certain shared cells from the beginning of the list up to the point of the first change.
As soon as the modified cells in the list passed to erat' come out of scope, the memory should be flagged to be recovered, but as you can see when you try calling primes with a large enough number (1,000,000, for example), the memory utilization can quickly spike to tens of gigabytes.
Now, the question is: why is this happening? Shouldn't the generational garbage collector detect dereferenced list cells to reclaim them? And, shouldn't it be fairly easy for it to detect that they don't have references because:
a) nothing can have references from data structures older than itself;
b) there cannot be newer references because those cells/fragments are not even part of a referenceable data structure anymore, since it came out of scope?
Of course, a mutable data structure would take care of this, but I feel like resorting to mutability in a case like this is dropping some of the theoretical principles for Haskell on the floor.
Thanks to the people that commented (particularly Carl), I modified the algorithm slightly to add strictness (and the optimization of starting crossing the square of the new prime, since lower multiples will be crossed by multiples of lower primes too).
This is the new version:
import Data.List (foldl')
erat' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat' (c,b) ((x,y):xs)
| c < x = x `seq` (x,y) : erat' (c,b) xs
| c == x = x `seq` (x+y,y) : erat' (c,True) xs
| c > x = x `seq` (x+y,y) : erat' (c,b) xs
erat' (c,b) []
| b = []
| otherwise = [(c*c,c)] -- lower multiples would be covered by multiples of lower primes
erat :: [Integer] -> [(Integer,Integer)]
erat = foldl' (\a c -> erat' (c,False) a) []
primes :: Integer -> [Integer]
primes n = map snd $ erat [2..n]
The memory consumption seems to still be quite significant. Are there any other changes to this algorithm that could help reduce the total memory utilization?
Since Will pointed out that I didn't provide full statistics, these are the numbers for a run of the updated version of primes listed just above, with 100000 as the parameter:
And after applying the changes that Will proposed, the memory usage is now down considerably. See, for example, on a run of primes for 100000 again:
And last, this is the final code after the proposed changes were incorporated:
import Data.List (foldl')
erat'' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat'' (c,b) ((x,y):xs)
| c < x = (x, y) : if x==y*y then (if b then xs
else xs++[(c*c,c)])
else erat'' (c,b) xs
| c == x = (x+y,y) : if x==y*y then xs
else erat'' (c,True) xs
| c > x = (x+y,y) : erat'' (c,b) xs
erat'' (c,True) [] = []
erat'' (c,False) [] = [(c*c,c)]
primes'' :: Integer -> [Integer]
primes'' n = map snd $ foldl' (\a c -> (if null a then 0 else
case last a of (x,y) -> y) `seq` erat'' (c,False) a) [] [2..n]
And finally a run for 1,000,000 to have a feeling for performance in this new version:
Assumption a) is false in the presence of laziness. And in fact, your code consists almost entirely of generating cons cells pointed to by older cons cells. erat' consumes a list element, then produces a (:) constructor pointing to a tuple and an unevaluated thunk which will perform a recursive call to erat'. Only when that thunk is later evaluated will the (:) list constructor actually point to its tail as a data structure. So yes, nearly every (:) you allocate in erat' is in fact pointing forward in time. (The only exception is the last one - [foo] is going to point to the pre-existing [] constructor when its (:) constructor is allocated.)
Assumption b) is nonsense in the presence of laziness. Scope determines visibility in Haskell, not lifetime. Lifetime depends on evaluation and reachability.
So what happens at runtime is that you build up pipeline of erat' calls in erat. Each one of them holds on to as much of its input as has been evaluated, slowly consuming it. The interesting part is that your code doesn't evaluate anything in advance - it seems like it should actually stream pretty well - except for the fact that the pipeline is too deep. The pipeline created is approximately n stages - this is (inefficient!) trial division, not the sieve of Eratosthenes. You should only be adding prime numbers to the pipeline, not every number.
breaking update: You should use
map snd $ foldl' (\a c -> (if null a then 0 else
case last a of (x,y) -> y) `seq` erat' (c,False) a) [] [2..n]
to force the list fully on each iteration. It will consume less memory and run faster.
The reason for the above is that foldl' only forces the accumulator to weak head normal form, and even using last a isn't enough, as it would be forced just to a pair (_,_), without forcing its constituents.
But when your erat' function is changed so that it stops scanning the interim list of primes and their multiples as soon as possible, and shares its tail whenever possible (as described below), it is faster without the forcing, even if using more memory.
Your (updated) code, edited a little for legibility:
g :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
g (c,b) ((x,y):xs)
| c < x = (x, y) : g (c,b) xs -- `c < x` forces the x already,
| c == x = (x+y,y) : g (c,True) xs -- no need for `seq`
| c > x = (x+y,y) : g (c,b) xs
g (c,True) [] = []
g (c,False) [] = [(c*c,c)]
primes :: Integer -> [Integer]
primes n = map snd $ foldl' (\a c -> g (c,False) a) [] [2..n]
So, your primes n is actually a little like a right fold on the reversed [2..n] list. Writing h for flip $ foldl' (\a c -> g (c,False) a), it is
= map snd $ h [2..n] $ []
= map snd $ h [3..n] $ [(2*2,2)]
= map snd $ h [4..n] $ (4,2) :(g (3,False) [])
= map snd $ h [5..n] $ (4+2,2):(g (4,True ) $ g (3,False) [])
= map snd $ h [6..n] $ (6,2) :(g (5,False) $ g (4,True ) $ g (3,False) [])
....
The strictness of foldl' has limited effect here as the accumulator is forced only to the weak head normal form.
Folding with (\a c -> last a `seq` g (c,False) a) would give us
= map snd $ ... $ g (3,False) [(2*2,2)]
= map snd $ ... $ g (4,False) [(4,2),(3*3,3)]
= map snd $ ... $ g (5,False) [(4+2,2),(9,3)]
= map snd $ ... $ g (6,False) [(6,2),(9,3),(5*5,5)]
= map snd $ ... $ g (7,False) [(6+2,2),(9,3),(25,5)]
= map snd $ ... $ g (8,False) [(8,2),(9,3),(25,5),(7*7,7)]
= map snd $ ... $ g (9,False) [(8+2,2),(9,3),(25,5),(49,7)]
= map snd $ ... $ g (10,False) [(10,2),(9+3,3),(25,5),(49,7)]
= map snd $ ... $ g (11,False) [(10+2,2),(12,3),(25,5),(49,7)]
....
= map snd $ ... $ g (49,False)
[(48+2,2),(48+3,3),(50,5),(49,7),(121,11)...(2209,47)]
....
but all these changes will be pushed through to the list by the final print anyway, so the laziness is not the immediate problem here (it causes stack overflow for bigger inputs, but that's secondary here). The problem is that your erat' (renamed g above) eventually pushes each entry through the whole list needlessly, recreating the whole list for each candidate number. This is a very heavy memory usage pattern.
It should stop as early as possible, and share the list's tail whenever possible:
g :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
g (c,b) ((x,y):xs)
| c < x = (x, y) : if x==y*y then (if b then xs
else xs++[(c*c,c)])
else g (c,b) xs
| c == x = (x+y,y) : if x==y*y then xs
else g (c,True) xs
| c > x = (x+y,y) : g (c,b) xs
g (c,True) [] = []
g (c,False) [] = [(c*c,c)]
Compiled with -O2 and run standalone, it runs under ~ N1.9 vs your original function's ~ N2.4..2.8..and rising, producing primes up to N.
(of course a "normal" sieve of Eratosthenes should run at about ~ N1.1, ideally, its theoretical time complexity being N log (log N)).
I am relatively new to Haskell, but I am trying to learn both by reading and trying to solve problems on Project Euler. I am currently trying to implement a function that takes an infinite list of integers and returns the ordered list of pairwise sums of elements in said list. I am really looking for solutions to the specific issue I am facing, rather than advice on different strategies or approaches, but those are welcome as well, as being a coder doesn't mean knowing how to implement a strategy, but also choosing the best strategy available.
My approach relies on traversing an infinite list of infinite generators and retrieving elements in order, with several mathematical properties that are useful in implementing my solution.
If I were trying to obtain the sequence of pairwise sums of the natural numbers, for example, this would be my code:
myList :: [Integer]
myList = [1..]
myGens :: [[Integer]]
myGens = gens myList
where
gens = \xs -> map (\x -> [x+y|y<-(dropWhile (<x) xs)]) xs
Regardless of the number set used, provided that it is sorted, the following conditions hold:
∀ i ≥ 0, head (gens xs !! i) == 2*(myList !! i)
∀ i,j,k ≥ 0, l > 0, (((gens xs) !! i) !! j) < (((gens xs) !! i+k) !! j+l)
Special cases for the second condition are:
∀ i,j ≥ 0, (((gens xs) !! i) !! j) < (((gens xs) !! i+1) !! j)
∀ i,j ≥ 0, k > 0, (((gens xs) !! i) !! j) < (((gens xs) !! i+k) !! j)
Here is the particular code I am trying to modify:
stride :: [Integer] -> [Int] -> [[Integer]] -> [Integer]
stride xs cs xss = x : stride xs counts streams
where
(x,i) = step xs cs xss
counts = inc i cs
streams = chop i xss
step :: [Integer] -> [Int] -> [[Integer]] -> (Integer,Int)
step xs cs xss = pace xs (defer cs xss)
pace :: [Integer] -> [(Integer,Int)] -> (Integer,Int)
pace hs xs#((x,i):xt) = minim (x,i) hs xt
where
minim :: (Integer,Int) -> [Integer] -> [(Integer,Int)] -> (Integer,Int)
minim m _ [] = m
minim m#(g,i) hs (y#(h,n):ynt) | g > h && 2*(hs !! n) > h = y
| g > h = minim y hs ynt
| 2*(hs !! n) > g = m
| otherwise = minim m hs ynt
defer :: [Int] -> [[a]] -> [(a,Int)]
defer cs xss = (infer (zip cs (zip (map head xss) [0..])))
infer :: [(Int,(a,Int))] -> [(a,Int)]
infer [] = []
infer ((c,xi):xis) | c == 0 = xi:[]
| otherwise = xi:(infer (dropWhile (\(p,(q,r)) -> p>=c) xis))
The set in question I am using has the property that multiple distinct pairs produce an identical sum. I want an efficient method of handling all duplicate elements at once, in order to avoid an increased cost of computing all the pairwise sums up to N, as it requires M more tests if M is the number of duplicates.
Does anyone have any suggestions?
EDIT:
I made some changes to the code, independently of what was suggested, and would appreciate feedback on the relative efficiencies of my original code, my revised code, and the proposals so far.
stride :: [Integer] -> [Int] -> [[Integer]] -> [Integer]
stride xs cs xss = x : stride xs counts streams
where
(x,is) = step xs cs xss
counts = foldr (\i -> inc i) cs is
streams = foldr (\i -> chop i) xss is
step :: [Integer] -> [Int] -> [[Integer]] -> (Integer,[Int])
step xs cs xss = pace xs (defer cs xss)
pace :: [Integer] -> [(Integer,Int)] -> (Integer,[Int])
pace hs xs#((x,i):xt) = minim (x,(i:[])) hs xt
where
minim :: (Integer,[Int]) -> [Integer] -> [(Integer,Int)] -> (Integer,[Int])
minim m _ [] = m
minim m#(g,is#(i:_)) hs (y#(h,n):ynt) | g > h && 2*(hs !! n) > h = (h,[n])
| g > h = minim (h,[n]) hs ynt
| g == h && 2*(hs !! n) > h = (g,n:is)
| g == h = minim (g,n:is) hs ynt
| g < h && 2*(hs !! n) > g = m
| g < h = minim m hs ynt
Also, I left out the code for inc and chop:
alter :: (a->a) -> Int -> [a] -> [a]
alter = \f -> \n -> \xs -> (take (n) xs) ++ [f (xs !! n)] ++ (drop (n+1) xs)
inc :: Int -> [Int] -> [Int]
inc = alter (1+)
chop :: Int -> [[a]] -> [[a]]
chop = alter (tail)
I'm going to present a solution that uses an infinite pairing heap. We'll have logarithmic overhead per element constructed, but no one knows how to do better (in a model with comparison-based methods and real numbers).
The first bit of code is just the standard pairing heap.
module Queue where
import Data.Maybe (fromMaybe)
data Queue k = E
| T k [Queue k]
deriving Show
fromOrderedList :: (Ord k) => [k] -> Queue k
fromOrderedList [] = E
fromOrderedList [k] = T k []
fromOrderedList (k1 : ks'#(k2 : _ks''))
| k1 <= k2 = T k1 [fromOrderedList ks']
mergePairs :: (Ord k) => [Queue k] -> Queue k
mergePairs [] = E
mergePairs [q] = q
mergePairs (q1 : q2 : qs'') = merge (merge q1 q2) (mergePairs qs'')
merge :: (Ord k) => Queue k -> Queue k -> Queue k
merge (E) q2 = q2
merge q1 (E) = q1
merge q1#(T k1 q1's) q2#(T k2 q2's)
= if k1 <= k2 then T k1 (q2 : q1's) else T k2 (q1 : q2's)
deleteMin :: (Ord k) => Queue k -> Maybe (k, Queue k)
deleteMin (E) = Nothing
deleteMin (T k q's) = Just (k, mergePairs q's)
toOrderedList :: (Ord k) => Queue k -> [k]
toOrderedList q
= fromMaybe [] $
do (k, q') <- deleteMin q
return (k : toOrderedList q')
Note that fromOrderedList accepts infinite lists. I think that this can be justified theoretically by pretending as though the infinite list of descendants effectively are merged "just in time". This feels like the kind of thing that should be in the literature on purely functional data structures already, but I'm going to be lazy and not look right now.
The function mergeOrderedByMin takes this one step further and merges a potentially infinite list of queues, where the min element in each queue is nondecreasing. I don't think that we can reuse merge, since merge appears to be insufficiently lazy.
mergeOrderedByMin :: (Ord k) => [Queue k] -> Queue k
mergeOrderedByMin [] = E
mergeOrderedByMin (E : qs') = mergeOrderedByMin qs'
mergeOrderedByMin (T k q's : qs')
= T k (mergeOrderedByMin qs' : q's)
The next function removes duplicates from a sorted list. It's in the library that m09 suggested, but for the sake of completeness, I'll define it here.
nubOrderedList :: (Ord k) => [k] -> [k]
nubOrderedList [] = []
nubOrderedList [k] = [k]
nubOrderedList (k1 : ks'#(k2 : _ks''))
| k1 < k2 = k1 : nubOrderedList ks'
| k1 == k2 = nubOrderedList ks'
Finally, we put it all together. I'll use the squares as an example.
squares :: [Integer]
squares = map (^ 2) [0 ..]
sumsOfTwoSquares :: [Integer]
sumsOfTwoSquares
= nubOrderedList $ toOrderedList $
mergeOrderedByMin
[fromOrderedList (map (s +) squares) | s <- squares]
If you don't want to modify your code that much, you can use the nub function of Data.List.Ordered (installable by cabal install data-ordlist) to filter duplicates out.
It runs in linear time, ie complexity wise your algorithm won't change.
for your example [1..] the result is just [2..]. A "very smart compiler" could deduce this from the general solution with implicit heap, that follows.
gens xs is better expressed as
gens xs = map (\t#(x:_) -> map (x+) t) $ tails xs -- or should it be
-- map (\(x:ys) -> map (x+) ys) $ tails xs -- ?
Its resulting list of lists is easily merged without duplicates by tree-like folding1 (pictured here), with
pairsums xs = foldi (\(x:l) r-> x : union l r) $ gens xs
This assumes the input list is ordered in increasing order. If it's merely in non-decreasing order (with only finite runs of equals in it, of course), you'll need to slap an orderedNub on top of that (as m09 mentions),
pairsums' = orderedNub . pairsums
Just by using foldi where foldr would work, we often get an algorithmic improvement in complexity from a factor of n to log n, a pretty significant speedup. I use it as a general tool all the time.
1The code, adjusted for infinite lists only:
foldi f (x:xs) = f x (foldi f (pairs f xs))
pairs f (x:y:t) = f x y : pairs f t
union (x:xs) (y:ys) = case compare x y of
LT -> x : union xs (y:ys)
EQ -> x : union xs ys
GT -> y : union (x:xs) ys
See also:
mergesort as foldtree (by Heinrich Apfelmus)
infinite tree folding (by Dave Bayer)
Implicit Heap (by apfelmus)
I propose to build the pairs above the diagonal, that way a lot of duplicates are not even generated:
sums xs = zipWith (map . (+)) hs ts where
(hs:ts) = tails xs
Now you have a list of lists, each containing sorted sums. Because they are sorted, it is possible to determine the next element of the sequence in a finite number of steps:
filtermerge :: (Ord a) => [[a]]->[a]
filtermerge ((h:t):ts) = h : filtermerge (insert t ts) where
insert [] ts = ts
insert xs [] = [xs]
insert h ([]:t) = insert h t
insert (h:t) ts#((h1:t1):t2)
| h < h1 = (h:t):ts
| h == h1 = insert (h:t) $ insert t1 t2
| otherwise = insert (h1:t1) $ insert (h:t) t2
filtermerge _ = []
I'm beginning to try and get my head round haskell performance, and what makes things fast and slow, and I'm a little confused by this.
I have two implementations of a function that generates a list of primes up to a certain value. The first is straight off the Haskell wiki:
primesTo :: (Ord a, Num a, Enum a) => a -> [a]
primesTo m = eratos [2..m] where
eratos [] = []
eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+p..m])
The second is the same, but using an infinite list internally:
primes2 :: (Ord a, Num a, Enum a) => a -> [a]
primes2 m = takeWhile (<= m) (eratos [2..]) where
eratos [] = []
eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+p..])
In both cases, the minus function is:
minus :: (Ord a) => [a] -> [a] -> [a]
minus (x:xs) (y:ys) = case (compare x y) of
LT -> x : minus xs (y:ys)
EQ -> minus xs ys
GT -> minus (x:xs) ys
minus xs _ = xs
The latter implementation is significantly (~100x) slower than the former, and I don't get why. I would have thought that haskell's lazy evalutation would make them fairly equivalent under the hood.
This is obviously a reduced test case for the purposes of the question - in real life the optimisation would be no problem (although I don't understand why it is needed), but to me a function that just generates an infinite list of primes is more generically useful than a finite list, but appears slower to work with.
Looks like to me that there's a big difference between
(xs `minus` [p*p, p*p+p..m]) -- primesTo
(xs `minus` [p*p, p*p+p..]) -- primes2
The function minus steps through lists pairwise and terminates when one list reaches the end. In the first minus expression above, this occurs in no more than (m-p*p)/p steps when the latter list is exhausted. In the second one, it will always take steps on the order of length xs.
So your infinite lists have disabled at least one meaningful optimization.
One difference is that in the second case you need to generate one extra prime. You need to generate the first prime greater than m before takeWhile knows its time to stop.
Additionally, the [..m] bounds on both the list to filter and the lists of multiples help reduce the number of calculations. Whenever one of these lists gets empty minus immediately returns via its secons clause while in the infinite case the minus gets stuck in the first case. You can explore this a bit better if you also test the cases where only one of the lists is infinite:
--this is also slow
primes3 :: (Ord a, Num a, Enum a) => a -> [a]
primes3 m = takeWhile (<= m) (eratos [2..m]) where
eratos [] = []
eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+p..])
--this fast
primes4 :: (Ord a, Num a, Enum a) => a -> [a]
primes4 m = takeWhile (<= m) (eratos [2..]) where
eratos [] = []
eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+p..m])