is this implementation of merge sort good? - sorting

I've just started to learn Haskell last night and I've never used a functional programming language before.
I just want to know if my implemention of merge sort is good or bad and what exactly is good or bad.
Maybe it's even wrong - Well it does sort but maybe the Algorithm is not what I think what merge sort is.
Just tell me everything I could improve here. I by myself think its a pretty clear and simple implementation.
Thanks for your advice, here's the code :)
merge [] ys = ys
merge xs [] = xs
merge xs ys = sorted : merge left right
where
sorted = if head(xs) < head(ys) then head(xs) else head(ys)
left = if head(xs) <= head(ys) then tail(xs) else xs
right = if head(xs) > head(ys) then tail(ys) else ys
msort [] = []
msort [x] = [x]
msort xs = merge (msort left) (msort right)
where
left = take (div (length xs) 2) xs
right = drop (div (length xs) 2) xs

Well, first of all, we can rewrite merge to be a little more elegant using pattern matching
merge [] ys = ys
merge xs [] = xs
merge xs#(x:xs1) ys#(y:ys1)
| x <= y = x : merge xs1 ys
| otherwise = y : merge xs ys1
In general you should avoid using head and tail since they are a bit unsafe (they raise an error for the empty list) and use pattern matching whenever possible.
The implementation of msort is pretty much spot on, except that we can split the list in a more efficient way. That's because length xs - takes O(N) to complete. The compiler might save you and cache the result of the length call so that the second call to length won't traverse the list again. But the take and drop will pretty much cause another two traversals thus splitting the list using 3 traversals which may prove to be expensive. We can do better by splitting the list in two lists - the first one containing the elements on the odd positions and the second list with the elements placed on the even positions, like so:
msort [] = []
msort [x] = [x]
msort xs = merge (msort first) (msort second)
where
(first, second) = splitInHalves xs
splitInHalves [] = ([], [])
splitInHalves [x] = ([x], [])
splitInHalves (x:y:xs) =
let (xs1, ys1) = splitInHalves xs
in (x:xs1, y:ys1)
This gets you the same Merge Sort in O(NlogN) time. It feels different because you would probably implement it in place (by modifying the original list) in an imperative language such as C. This version is slightly more costly on the memory, but it does have it's advantages - it is more easy to reason about, so it is more maintainable, and also it is very easy to parallelize without being concerned of anything else except the algorithm itself - which is exactly what a good programming language should provide for the developers that use it.
EDIT 1 :
If the syntax is a bit much, here are some resources:
Pattern Matching - the bit with the # symbol is called an as-pattern. You'll find it in there
let is a keyword used to declare a variable to be used in the expression that follows it (whereas where binds a variable in the expression that precedes it). More on Haskell syntax, including guards (the things with | condition = value) can be found here, in this chapter of Learn You a Haskell
EDIT 2 :
#is7s proposed a far more concise version of splitInHalves using the foldr function:
splitInHalves = foldr (\x (l,r) -> (x:r,l)) ([],[])
EDIT 3 :
Here is another answer which provides an alternative implementation of merge sort, which also has the property of being stable:
Lazy Evaluation and Time Complexity
Hope this helps and welcome to the wonderful world of Functional Programming !

Related

Optimize "list" indexing in Haskell

Say you have a very deterministic algorithm that produces a list, like inits in Data.List. Is there any way that a Haskell compiler can optimally perform an "indexing" operation on this algorithm without actually generating all the intermediate results?
For example, inits [1..] !! 10000 is pretty slow. Could a compiler somehow deduce what inits would produce on the 10000th element without any recursion, etc? Of course, this same idea could be generalized beyond lists.
Edit: While inits [1..] !! 10000 is constant, I am wondering about any "index-like" operation on some algorithm. For example, could \i -> inits [1..] !! i be optimized such that no [or minimal] recursion is performed to reach the result for any i?
Yes and no. If you look at the definition for Data.List.inits:
inits :: [a] -> [[a]]
inits xs = [] : case xs of
[] -> []
x : xs' -> map (x :) (inits xs')
you'll see that it's defined recursively. That means that each element of the resulting list is built on the previous element of the list. So if you want any nth element, you have to build all n-1 previous elements.
Now you could define a new function
inits' xs = [] : [take n xs | (n, _) <- zip [1..] xs]
which has the same behavior. If you try to take inits' [1..] !! 10000, it finishes very quickly because the successive elements of the list do not depend on the previous ones. Of course, if you were actually trying to generate a list of inits instead of just a single element, this would be much slower.
The compiler would have to know a lot of information to be able to optimize away recursion from a function like inits. That said, if a function really is "very deterministic", it should be trivial to rewrite it in a non recursive way.

Recursion confusion in Haskell again - subsets with an inclusion test

I'm testing a simple program to generate subsets with an inclusion test. For example, given
*Main Data.List> factorsets 7
[([2],2),([2,3],1),([3],1),([5],1),([7],1)]
calling chooseP 3 (factorsets 7), I would like to get (read from right to left, a la cons)
[[([5],1),([3],1),([2],2)]
,[([7],1),([3],1),([2],2)]
,[([7],1),([5],1),([2],2)]
,[([7],1),([5],1),([2,3],1)]
,[([7],1),([5],1),([3],1)]]
But my program is returning an extra [([7],1),([5],1),([3],1)] (and missing a [([7],1),([5],1),([2],2)]):
[[([5],1),([3],1),([2],2)]
,[([7],1),([3],1),([2],2)]
,[([7],1),([5],1),([3],1)]
,[([7],1),([5],1),([2,3],1)]
,[([7],1),([5],1),([3],1)]]
The inclusion test is: members' first part of the tuple must have a null intersection.
Once tested as working, the plan is to sum the internal products of each subset's snds, rather than accumulate them.
Since I've asked a similar question before, I imagine that an extra branch is generated since when the recursion splits at [2,3], the second branch runs over the same possibilities once it passes the skipped section. Any pointers on how to resolve that would be appreciated; and if you'd like to share ideas about how to enumerate and sum such product combinations more efficiently, that would be great, too.
Haskell code:
chooseP k xs = chooseP' xs [] 0 where
chooseP' [] product count = if count == k then [product] else []
chooseP' yys product count
| count == k = [product]
| null yys = []
| otherwise = f ++ g
where (y:ys) = yys
(factorsY,numY) = y
f = let zzs = dropWhile (\(fs,ns) -> not . and . map (null . intersect fs . fst) $ product) yys
in if null zzs
then chooseP' [] product count
else let (z:zs) = zzs in chooseP' zs (z:product) (count + 1)
g = if and . map (null . intersect factorsY . fst) $ product
then chooseP' ys product count
else chooseP' ys [] 0
Your code is complicated enough that I might recommend starting over. Here's how I would proceed.
Write a specification. Let it be as stupidly inefficient as necessary -- for example, the spec I choose below will build all combinations of k elements from the list, then filter out the bad ones. Even the filter will be stupidly slow.
sorted xs = sort xs == xs
unique xs = nub xs == xs
disjoint xs = and $ liftM2 go xs xs where
go x1 x2 = x1 == x2 || null (intersect x1 x2)
-- check that x is valid according to all the validation functions in fs
-- (there are other fun ways to spell this, but this is particularly
-- readable and clearly correct -- just what we want from a spec)
allFuns fs x = all ($x) fs
choosePSpec k = filter good . replicateM k where
good pairs = allFuns [unique, disjoint, sorted] (map fst pairs)
Just to make sure it's right, we can test it at the prompt:
*Main> mapM_ print $ choosePSpec 3 [([2],2),([2,3],1),([3],1),([5],1),([7],1)]
[([2],2),([3],1),([5],1)]
[([2],2),([3],1),([7],1)]
[([2],2),([5],1),([7],1)]
[([2,3],1),([5],1),([7],1)]
[([3],1),([5],1),([7],1)]
Looks good.
Now that we have a spec, we can try to improve the speed one refactoring at a time, always checking that it matches the spec. The first thing I'd want to do is notice that we can ensure uniqueness and sortedness just by sorting the input and picking things "in an increasing way". To do this, we can define a function which chooses subsequences of a given length. It piggy-backs on the tails function, which you can think of as nondeterministically choosing a place to split its input list.
subseq 0 xs = [[]]
subseq n xs = do
x':xt <- tails xs
xs' <- subseq (n-1) xt
return (x':xs')
Here's an example of this function in action:
*Main> subseq 3 [1..4]
[[1,2,3],[1,2,4],[1,3,4],[2,3,4]]
Now we can write a slightly faster chooseP by replacing replicateM with subseq. Recall that we're assuming the inputs are already sorted and unique, though.
choosePSlow k = filter good . subseq k where
good pairs = disjoint $ map fst pairs
We can sanity-check that it's working by running it on the particular input we have from above:
*Main> let i = [([2],2),([2,3],1),([3],1),([5],1),([7],1)]
*Main> choosePSlow 3 i == choosePSpec 3 i
True
Or, better yet, we can stress-test it with QuickCheck. We'll need a tiny bit more code. The condition k < 5 is just because the spec is so hopelessly slow that bigger values of k take forever.
propSlowMatchesSpec :: NonNegative Int -> OrderedList ([Int], Int) -> Property
propSlowMatchesSpec (NonNegative k) (Ordered xs)
= k < 5 && unique (map fst xs)
==> choosePSlow k xs == choosePSpec k xs
*Main> quickCheck propSlowMatchesSpec
+++ OK, passed 100 tests.
There are several more opportunities to make things faster. For instance, the disjoint test could be sped up using choose 2 instead of liftM2; or we might be able to ensure disjointness during element selection and prune the search even earlier; etc. How you want to improve it from here I leave to you -- but the basic technique (start with stupid and slow, then make it smarter, testing as you go) should be helpful to you.

Finding unique (as in only occurring once) element haskell

I need a function which takes a list and return unique element if it exists or [] if it doesn't. If many unique elements exists it should return the first one (without wasting time to find others).
Additionally I know that all elements in the list come from (small and known) set A.
For example this function does the job for Ints:
unique :: Ord a => [a] -> [a]
unique li = first $ filter ((==1).length) ((group.sort) li)
where first [] = []
first (x:xs) = x
ghci> unique [3,5,6,8,3,9,3,5,6,9,3,5,6,9,1,5,6,8,9,5,6,8,9]
ghci> [1]
This is however not good enough because it involves sorting (n log n) while it could be done in linear time (because A is small).
Additionally it requires the type of list elements to be Ord while all which should be needed is Eq. It would also be nice if amount of comparisons was as small as possible (ie if we traverse a list and encounter element el twice we don't test subsequent elements for equality with el)
This is why for example this: Counting unique elements in a list doesn't solve the problem - all answers involve either sorting or traversing the whole list to find count of all elements.
The question is: how to do it correctly and efficiently in Haskell ?
Okay, linear time, from a finite domain. The running time will be O((m + d) log d), where m is the size of the list and d is the size of the domain, which is linear when d is fixed. My plan is to use the elements of the set as the keys of a trie, with the counts as values, then look through the trie for elements with count 1.
import qualified Data.IntTrie as IntTrie
import Data.List (foldl')
import Control.Applicative
Count each of the elements. This traverses the list once, builds a trie with the results (O(m log d)), then returns a function which looks up the result in the trie (with running time O(log d)).
counts :: (Enum a) => [a] -> (a -> Int)
counts xs = IntTrie.apply (foldl' insert (pure 0) xs) . fromEnum
where
insert t x = IntTrie.modify' (fromEnum x) (+1) t
We use the Enum constraint to convert values of type a to integers in order to index them in the trie. An Enum instance is part of the witness of your assumption that a is a small, finite set (Bounded would be the other part, but see below).
And then look for ones that are unique.
uniques :: (Eq a, Enum a) => [a] -> [a] -> [a]
uniques dom xs = filter (\x -> cts x == 1) dom
where
cts = counts xs
This function takes as its first parameter an enumeration of the entire domain. We could have required a Bounded a constraint and used [minBound..maxBound] instead, which is semantically appealing to me since finite is essentially Enum+Bounded, but quite inflexible since now the domain needs to be known at compile time. So I would choose this slightly uglier but more flexible variant.
uniques traverses the domain once (lazily, so head . uniques dom will only traverse as far as it needs to to find the first unique element -- not in the list, but in dom), for each element running the lookup function which we have established is O(log d), so the filter takes O(d log d), and building the table of counts takes O(m log d). So uniques runs in O((m + d) log d), which is linear when d is fixed. It will take at least Ω(m log d) to get any information from it, because it has to traverse the whole list to build the table (you have to get all the way to the end of the list to see if an element was repeated, so you can't do better than this).
There really isn't any way to do this efficiently with just Eq. You'd need to use some much less efficient way to build the groups of equal elements, and you can't know that only one of a particular element exists without scanning the whole list.
Also, note that to avoid useless comparisons you'd need a way of checking to see if an element has been encountered before, and the only way to do that would be to have a list of elements known to have multiple occurrences, and the only way to check if the current element is in that list is... to compare it for equality with each.
If you want this to work faster than O(something really horrible) you need that Ord constraint.
Ok, based on the clarifications in comments, here's a quick and dirty example of what I think you're looking for:
unique [] _ _ = Nothing
unique _ [] [] = Nothing
unique _ (r:_) [] = Just r
unique candidates results (x:xs)
| x `notElem` candidates = unique candidates results xs
| x `elem` results = unique (delete x candidates) (delete x results) xs
| otherwise = unique candidates (x:results) xs
The first argument is a list of candidates, which should initially be all possible elements. The second argument is the list of possible results, which should initially be empty. The third argument is the list to examine.
If it runs out of candidates, or reaches the end of the list with no results, it returns Nothing. If it reaches the end of the list with results, it returns the one at the front of the result list.
Otherwise, it examines the next input element: If it's not a candidate, it ignores it and continues. If it's in the result list we've seen it twice, so remove it from the result and candidate lists and continue. Otherwise, add it to the results and continue.
Unfortunately, this still has to scan the entire list for even a single result, since that's the only way to be sure it's actually unique.
First off, if your function is intended to return at most one element, you should almost certainly use Maybe a instead of [a] to return your result.
Second, at minimum, you have no choice but to traverse the entire list: you can't tell for sure if any given element is actually unique until you've looked at all the others.
If your elements are not Ordered, but can only be tested for Equality, you really have no better option than something like:
firstUnique (x:xs)
| elem x xs = firstUnique (filter (/= x) xs)
| otherwise = Just x
firstUnique [] = Nothing
Note that you don't need to filter out the duplicated elements if you don't want to -- the worst case is quadratic either way.
Edit:
The above misses the possibility of early exit due to the above-mentioned small/known set of possible elements. However, note that the worst case will still require traversing the entire list: all that is necessary is for at least one of these possible elements to be missing from the list...
However, an implementation that provides an early out in case of set exhaustion:
firstUnique = f [] [<small/known set of possible elements>] where
f [] [] _ = Nothing -- early out
f uniques noshows (x:xs)
| elem x uniques = f (delete x uniques) noshows xs
| elem x noshows = f (x:uniques) (delete x noshows) xs
| otherwise = f uniques noshows xs
f [] _ [] = Nothing
f (u:_) _ [] = Just u
Note that if your list has elements which shouldn't be there (because they aren't in the small/known set), they will be pointedly ignored by the above code...
As others have said, without any additional constraints, you can't do this in less than quadratic time, because without knowing something about the elements, you can't keep them in some reasonable data structure.
If we are able to compare elements, an obvious O(n log n) solution to compute the count of elements first and then find the first one with count equal to 1:
import Data.List (foldl', find)
import Data.Map (Map)
import qualified Data.Map as Map
import Data.Maybe (fromMaybe)
count :: (Ord a) => Map a Int -> a -> Int
count m x = fromMaybe 0 $ Map.lookup x m
add :: (Ord a) => Map a Int -> a -> Map a Int
add m x = Map.insertWith (+) x 1 m
uniq :: (Ord a) => [a] -> Maybe a
uniq xs = find (\x -> count cs x == 1) xs
where
cs = foldl' add Map.empty xs
Note that the log n factor comes from the fact that we need to operate on a Map of size n. If the list has only k unique elements then the size of our map will be at most k, so the overall complexity will be just O(n log k).
However, we can do even better - we can use a hash table instead of a map to get an O(n) solution. For this we'll need the ST monad to perform mutable operations on the hash map, and our elements will have to be Hashable. The solution is basically the same as before, just a little bit more complex due to working within the ST monad:
import Control.Monad
import Control.Monad.ST
import Data.Hashable
import qualified Data.HashTable.ST.Basic as HT
import Data.Maybe (fromMaybe)
count :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s Int
count ht x = liftM (fromMaybe 0) (HT.lookup ht x)
add :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s ()
add ht x = count ht x >>= HT.insert ht x . (+ 1)
uniq :: (Eq a, Hashable a) => [a] -> Maybe a
uniq xs = runST $ do
-- Count all elements into a hash table:
ht <- HT.newSized (length xs)
forM_ xs (add ht)
-- Find the first one with count 1
first (\x -> liftM (== 1) (count ht x)) xs
-- Monadic variant of find which exists once an element is found.
first :: (Monad m) => (a -> m Bool) -> [a] -> m (Maybe a)
first p = f
where
f [] = return Nothing
f (x:xs') = do
b <- p x
if b then return (Just x)
else f xs'
Notes:
If you know that there will be only a small number of distinct elements in the list, you could use HT.new instead of HT.newSized (length xs). This will save you some memory and one pass over xs but in the case of many distinct elements the hash table will be have to resized several times.
Here is a version that does the trick:
unique :: Eq a => [a] -> [a]
unique = select . collect []
where
collect acc [] = acc
collect acc (x : xs) = collect (insert x acc) xs
insert x [] = [[x]]
insert x (ys#(y : _) : yss)
| x == y = (x : ys) : yss
| otherwise = ys : insert x yss
select [] = []
select ([x] : _) = [x]
select ((_ : _) : xss) = select xss
So, first we traverse the input list (collect) while maintaining a list of buckets of equal elements that we update with insert. Then we simply select the first element that appears in a singleton bucket (select).
The bad news is that this takes quadratic time: for every visited element in collect we need to go over the list of buckets. I am afraid that is the price you will have to pay for only being able to constrain the element type to be in Eq.
Something like this look pretty good.
unique = fst . foldl' (\(a, b) c -> if (c `elem` b)
then (a, b)
else if (c `elem` a)
then (delete c a, c:b)
else (c:a, b)) ([],[])
The first element of the resulted tuple of the fold, contain what you are expecting, a list containing unique element. The second element of the tuple is the memory of the process remembered if an element has already been discarded or not.
About space performance.
As your problem is design, all the element of the list should be traversed at least one time, before a result can be display. And the internal algorithm must keep trace of discarded value in addition to the good one, but discarded value will appears only one time. Then in the worst case the required amount of memory is equal to the size of the inputted list. This sound goods as you said that expected input are small.
About time performance.
As the expected input are small and not sorted by default, trying to sort the list into the algorithm is useless, or before to apply it is useless. In fact statically we can almost said, that the extra operation to place an element at its ordered place (into the sub list a and b of the tuple (a,b)) will cost the same amount of time than to check if this element appear into the list or not.
Below a nicer and more explicit version of the foldl' one.
import Data.List (foldl', delete, elem)
unique :: Eq a => [a] -> [a]
unique = fst . foldl' algorithm ([], [])
where
algorithm (result0, memory0) current =
if (current `elem` memory0)
then (result0, memory0)
else if (current`elem` result0)
then (delete current result0, memory)
else (result, memory0)
where
result = current : result0
memory = current : memory0
Into the nested if ... then ... else ... instruction the list result is traversed twice in the worst case, this can be avoid using the following helper function.
unique' :: Eq a => [a] -> [a]
unique' = fst . foldl' algorithm ([], [])
where
algorithm (result, memory) current =
if (current `elem` memory)
then (result, memory)
else helper current result memory []
where
helper current [] [] acc = ([current], [])
helper current [] memory acc = (acc, memory)
helper current (r:rs) memory acc
| current == r = (acc ++ rs, current:memory)
| otherwise = helper current rs memory (r:acc)
But the helper can be rewrite using fold as follow, which is definitely nicer.
helper current [] _ = ([current],[])
helper current memory result =
foldl' (\(r, m) x -> if x==current
then (r, current:m)
else (current:r, m)) ([], memory) $ result

Breaking lists at index

I have a performance question today.
I am making a (Haskell) program and, when profiling, I saw that most of the time is spent in the function you can find below. Its purpose is to take the nth element of a list and return the list without it besides the element itself. My current (slow) definition is as follows:
breakOn :: Int -> [a] -> (a,[a])
breakOn 1 (x:xs) = (x,xs)
breakOn n (x:xs) = (y,x:ys)
where
(y,ys) = breakOn (n-1) xs
The Int argument is known to be in the range 1..n where n is the length of the (never null) list (x:xs), so the function never arises an error.
However, I got a poor performance here. My first guess is that I should change lists for another structure. But, before start picking different structures and testing code (which will take me lot of time) I wanted to ask here for a third person opinion. Also, I'm pretty sure that I'm not doing it in the best way. Any pointers are welcome!
Please, note that the type a may not be an instance of Eq.
Solution
I adapted my code tu use Sequences from the Data.Sequence module. The result is here:
import qualified Data.Sequence as S
breakOn :: Int -> Seq a -> (a,Seq a)
breakOn n xs = (S.index zs 0, ys <> (S.drop 1 zs))
where
(ys,zs) = S.splitAt (n-1) xs
However, I still accept further suggestions of improvement!
Yes, this is inefficient. You can do a bit better by using splitAt (which unboxes the number during the recursive bit), a lot better by using a data structure with efficient splitting, e.g. a fingertree, and best by massaging the context to avoid needing this operation. If you post a bit more context, it may be possible to give more targeted advice.
Prelude functions are generally pretty efficient. You could rewrite your function using splitAt, as so:
breakOn :: Int -> [a] -> (a,[a])
breakOn n xs = (z,ys++zs)
where
(ys,z:zs) = splitAt (n-1) xs

In Haskell, how can you sort a list of infinite lists of strings?

So basically, if I have a (finite or infinite) list of (finite or infinite) lists of strings, is it possible to sort the list by length first and then by lexicographic order, excluding duplicates? A sample input/output would be:
Input:
[["a", "b",...], ["a", "aa", "aaa"], ["b", "bb", "bbb",...], ...]
Output:
["a", "b", "aa", "bb", "aaa", "bbb", ...]
I know that the input list is not a valid haskell expression but suppose that there is an input like that. I tried using merge algorithm but it tends to hang on the inputs that I give it. Can somebody explain and show a decent sorting function that can do this? If there isn't any function like that, can you explain why?
In case somebody didn't understand what I meant by the sorting order, I meant that shortest length strings are sorted first AND if one or more strings are of same length then they are sorted using < operator.
Thanks!
Ultimately, you can't sort an infinite list, because items at the tail of the list could percolate all the way to the front of the result, so you can't finish sorting an infinite list until you've seen the last item, but your list is infinite, so you'll never get there.
The only way that you could even try to sort an infinite list would require constraints on the inhabitants of the list. If the values of the list items comes from a well-founded set and the contents of the list are unique then you could at least make some progress in returning elements the initial elements of the list. For instance if the list was of distinct natural numbers, you could return the first 0 you see, then the first 1, etc. but you couldn't make any headway in the result until you saw 2, no matter how far down the list you went. Ultimately, if you ever skipped an element in the set because it wasn't present in the source, you'd cease to produce new output elements until you had the entire input in hand.
You can do the same thing with strings, because they are well founded, but that is only even remotely viable if you plan on returning all possible strings.
In short, if you need this, you're going about solving the problem you have in the wrong way. This isn't a tractable path to any solution you will want to use.
As yairchu noted, merging a finite number of sorted infinite lists works fine though.
In general it is impossible to sort infinite lists. Because the smallest item could be at infinite position and we must find it before we output it.
Merging infinite sorted lists is possible.
In general, merging an infinite list of sorted lists is impossible. For same reason that sorting them is.
Merging an infinite list of sorted lists, which are sorted by heads (forall i j. i < j => head (lists !! i) <= head (lists !! j)), is possible.
So I'm guessing that what you really want is to merge a sorted infinite list of sorted lists. It's even a task that makes some sense. There's even some existing code that uses it, implemented for monadic lists there - kinda ugly syntax-wise etc. So here's a version for plain lists:
mergeOnSortedHeads :: Ord b => (a -> b) -> [[a]] -> [a]
mergeOnSortedHeads _ [] = []
mergeOnSortedHeads f ([]:xs) = mergeOnSortedHeads f xs
mergeOnSortedHeads f ((x:xs):ys) =
x : mergeOnSortedHeads f (bury xs ys)
where
bury [] ks = ks
bury js [] = [js]
bury js ([]:ks) = bury js ks
bury jj#(j:js) ll#(kk#(k:ks):ls)
| f j <= f k = jj : ll
| otherwise = kk : bury jj ls
ghci> take 20 $ mergeOnSortedHeads id $ [[0,4,6], [2,3,9], [3,5..], [8]] ++ map repeat [12..]
[0,2,3,3,4,5,6,7,8,9,9,11,12,12,12,12,12,12,12,12]
btw: what do you need this for?
Well, I'm going to ignore your request for sorting infinite data.
To sort by length of the sublists, then by lexicographic order, we can do this pretty easily. Oh, and you want duplicates removed.
We'll start with a sample:
> s
[["a","b"],["a","aa","aaa"],["b","bb","bbb"]]
And then build the program incrementally.
First sorting on length (using Data.Ord.comparing to build the sort body):
> sortBy (comparing length) s
[["a","b"],["a","aa","aaa"],["b","bb","bbb"]]
Ok. That looks reasonable. So let's just concat, and sortBy length then alpha:
> sortBy (comparing length) . nub . concat $ s
["a","b","aa","bb","aaa","bbb"]
If your input is sorted. Otherwise you'll need a sligtly different body to sortBy.
Thanks to everyone for their inputs and sorry for the late reply. Turns out I was just approaching the problem in a wrong way. I was trying to do what Yairchu showed but I was using the built in function length to do the merging but length doesnt work on an infinite list for obvious reasons. Anyways, I solved my problem by sorting as I created the list on the go, not on the end result. I wonder what other languages offer infinite lists? Such a weird but useful concept.
Here is an algorithm that let you online sort:
it is not efficient, but it is lazy enough to let you goto different sort generations, even if you sort infinite lists. It is a nice gimmick, but not very usable. For example sorting the infinite list [10,9..]:
*Main> take 10 $ sortingStream [10,9..] !! 0
[9,8,7,6,5,4,3,2,1,0]
*Main> take 10 $ sortingStream [10,9..] !! 1
[8,7,6,5,4,3,2,1,0,-1]
*Main> take 10 $ sortingStream [10,9..] !! 2
[7,6,5,4,3,2,1,0,-1,-2]
*Main> take 10 $ sortingStream [10,9..] !! 3
[6,5,4,3,2,1,0,-1,-2,-3]
*Main> take 10 $ sortingStream [10,9..] !! 4
[5,4,3,2,1,0,-1,-2,-3,-4]
*Main> take 10 $ sortingStream [10,9..] !! 1000
[-991,-992,-993,-994,-995,-996,-997,-998,-999,-1000]
As you can see the sorting improves each generation. The code:
produce :: ([a] -> [a]) -> [a] -> [[a]]
produce f xs = f xs : (produce f (f xs))
sortingStream :: (Ord a) => [a] -> [[a]]
sortingStream = produce ss
ss :: (Ord a) => [a] -> [a]
ss [] = []
ss [x] = [x]
ss [x,y] | x <= y = [x,y]
| otherwise = [y,x]
ss (x:y:xs) | x <= y = x: (ss (y:xs))
| otherwise = y:(ss (x:xs))
Whether it can be done depends very much on the nature of your input data. If you can 'stop looking' for lists of a certain length when you've seen a longer one and there are only a finite number of lists of each length, then you can go through the lengths in ascending order, sort those and concatenate the results. Something like this should work:
listsUptoLength n xss = takeWhile (\xs -> length xs <= n) $ xss
listsUptoLength' n [] = []
listsUptoLength' n (xss:xsss) = case listsUptoLength n xss of
[] -> []
xss' -> xss' : listsUptoLength' n xsss
listsOfLength n xsss = concatMap (\xss -> (filter (\xs -> length xs == n) xss)) (listsUptoLength' n xsss)
sortInfinite xsss = concatMap (\n -> sort . nub $ (listsOfLength n xsss)) [0..]
f xs y = [xs ++ replicate n y | n <- [1..]]
test = [ map (\x -> [x]) ['a'..'e'], f "" 'a', f "" 'b', f "b" 'a', f "a" 'b' ] ++ [f start 'c' | start <- f "" 'a']
(The names could probably be chosen to be more illuminating :)
I'm guessing you're working with regular expressions, so I think something like this could be made to work; I repeat the request for more background!

Resources