Recursion confusion in Haskell again - subsets with an inclusion test - algorithm

I'm testing a simple program to generate subsets with an inclusion test. For example, given
*Main Data.List> factorsets 7
[([2],2),([2,3],1),([3],1),([5],1),([7],1)]
calling chooseP 3 (factorsets 7), I would like to get (read from right to left, a la cons)
[[([5],1),([3],1),([2],2)]
,[([7],1),([3],1),([2],2)]
,[([7],1),([5],1),([2],2)]
,[([7],1),([5],1),([2,3],1)]
,[([7],1),([5],1),([3],1)]]
But my program is returning an extra [([7],1),([5],1),([3],1)] (and missing a [([7],1),([5],1),([2],2)]):
[[([5],1),([3],1),([2],2)]
,[([7],1),([3],1),([2],2)]
,[([7],1),([5],1),([3],1)]
,[([7],1),([5],1),([2,3],1)]
,[([7],1),([5],1),([3],1)]]
The inclusion test is: members' first part of the tuple must have a null intersection.
Once tested as working, the plan is to sum the internal products of each subset's snds, rather than accumulate them.
Since I've asked a similar question before, I imagine that an extra branch is generated since when the recursion splits at [2,3], the second branch runs over the same possibilities once it passes the skipped section. Any pointers on how to resolve that would be appreciated; and if you'd like to share ideas about how to enumerate and sum such product combinations more efficiently, that would be great, too.
Haskell code:
chooseP k xs = chooseP' xs [] 0 where
chooseP' [] product count = if count == k then [product] else []
chooseP' yys product count
| count == k = [product]
| null yys = []
| otherwise = f ++ g
where (y:ys) = yys
(factorsY,numY) = y
f = let zzs = dropWhile (\(fs,ns) -> not . and . map (null . intersect fs . fst) $ product) yys
in if null zzs
then chooseP' [] product count
else let (z:zs) = zzs in chooseP' zs (z:product) (count + 1)
g = if and . map (null . intersect factorsY . fst) $ product
then chooseP' ys product count
else chooseP' ys [] 0

Your code is complicated enough that I might recommend starting over. Here's how I would proceed.
Write a specification. Let it be as stupidly inefficient as necessary -- for example, the spec I choose below will build all combinations of k elements from the list, then filter out the bad ones. Even the filter will be stupidly slow.
sorted xs = sort xs == xs
unique xs = nub xs == xs
disjoint xs = and $ liftM2 go xs xs where
go x1 x2 = x1 == x2 || null (intersect x1 x2)
-- check that x is valid according to all the validation functions in fs
-- (there are other fun ways to spell this, but this is particularly
-- readable and clearly correct -- just what we want from a spec)
allFuns fs x = all ($x) fs
choosePSpec k = filter good . replicateM k where
good pairs = allFuns [unique, disjoint, sorted] (map fst pairs)
Just to make sure it's right, we can test it at the prompt:
*Main> mapM_ print $ choosePSpec 3 [([2],2),([2,3],1),([3],1),([5],1),([7],1)]
[([2],2),([3],1),([5],1)]
[([2],2),([3],1),([7],1)]
[([2],2),([5],1),([7],1)]
[([2,3],1),([5],1),([7],1)]
[([3],1),([5],1),([7],1)]
Looks good.
Now that we have a spec, we can try to improve the speed one refactoring at a time, always checking that it matches the spec. The first thing I'd want to do is notice that we can ensure uniqueness and sortedness just by sorting the input and picking things "in an increasing way". To do this, we can define a function which chooses subsequences of a given length. It piggy-backs on the tails function, which you can think of as nondeterministically choosing a place to split its input list.
subseq 0 xs = [[]]
subseq n xs = do
x':xt <- tails xs
xs' <- subseq (n-1) xt
return (x':xs')
Here's an example of this function in action:
*Main> subseq 3 [1..4]
[[1,2,3],[1,2,4],[1,3,4],[2,3,4]]
Now we can write a slightly faster chooseP by replacing replicateM with subseq. Recall that we're assuming the inputs are already sorted and unique, though.
choosePSlow k = filter good . subseq k where
good pairs = disjoint $ map fst pairs
We can sanity-check that it's working by running it on the particular input we have from above:
*Main> let i = [([2],2),([2,3],1),([3],1),([5],1),([7],1)]
*Main> choosePSlow 3 i == choosePSpec 3 i
True
Or, better yet, we can stress-test it with QuickCheck. We'll need a tiny bit more code. The condition k < 5 is just because the spec is so hopelessly slow that bigger values of k take forever.
propSlowMatchesSpec :: NonNegative Int -> OrderedList ([Int], Int) -> Property
propSlowMatchesSpec (NonNegative k) (Ordered xs)
= k < 5 && unique (map fst xs)
==> choosePSlow k xs == choosePSpec k xs
*Main> quickCheck propSlowMatchesSpec
+++ OK, passed 100 tests.
There are several more opportunities to make things faster. For instance, the disjoint test could be sped up using choose 2 instead of liftM2; or we might be able to ensure disjointness during element selection and prune the search even earlier; etc. How you want to improve it from here I leave to you -- but the basic technique (start with stupid and slow, then make it smarter, testing as you go) should be helpful to you.

Related

Algorithm to find list given dot product and another list

I need to write a function findL that takes a list L1 of integers and a desired dot product n, and returns a list L2 of nonnegative integers such that L1 · L2 = n. (By "dot product" I mean the sum of the pairwise products; for example, [1,2] · [3,4] = 1·3+2·4 = 11.)
So, for example, findL(11, [1,2]) might return SOME [3,4]. If there's no possible list, I return NONE.
I'm using a functional language. (Specifically Standard ML, but the exact language isn't so important, I'm just trying to think of an FP algorithm.) What I have written so far:
Let's say I have findL(n, L1):
if L1 = [], I return NONE.
if L1 = [x] (list of length 1)
if (n >= 0 and x > 0 and n mod x = 0), return SOME [n div x]
else return NONE
If L1 has length greater than 1, I recurse on findL (n, L[1:]). If that returns a list L2, I return [1] concatenated to L2. If the recursive call returns NONE, I did another recursive call on findL (0, L[1:]) and prepended [n div x] to the result if it wasn't NONE. This works on many inputs but are failing on others.
I need to change part 3, but I'm not sure if I have the right idea. I would appreciate any tips!
Unless you need to say that empty lists in the input are always bad (even n = 0 with the list []), I'd recommend returning something different for an empty list based on whether you've reached 0 at the end (everything has been subtracted away) or not, then recurse when receiving any nonempty list rather than special-casing a one-element list.
As far as step three, you need to test every possible positive integer multiple of the first element of your input list until they exceed n, not just the first and last. The first non-None value you get is good enough, so you just prepend the multiplier (not the multiple) to the return list. If everything gives you Nones, you return None.
I don't know SML, but here's how I'd do it in Haskell:
import Data.Maybe (isJust, listToMaybe)
-- Find linear combinations of positive integers
solve :: Integer -> [Integer] -> Maybe [Integer]
-- If we've made it to the end with zero left, good!
solve 0 [] = Just []
-- Otherwise, this way isn't the way to go.
solve _ [] = Nothing
-- If one of the elements of the input list is zero, just multiply that element by one.
solve n (0:xs) = case solve n xs of
Nothing -> Nothing
Just ys -> Just (1:ys)
solve n (x:xs) = listToMaybe -- take first solution if it exists
. map (\ (m, Just ys) -> m:ys) -- put multiplier at front of list
. filter (isJust . snd) -- remove nonsolutions
. zip [1 ..] -- tuple in the multiplier
. map (\ m -> solve (n - m) xs) -- use each multiple
$ [x, x + x .. n] -- the multiples of x up to n
Here it is solving 11 with [1, 2] and 1 with [1, 2].

sort a list of numbers by their 'visual similarity'

consider a function, which rates the level of 'visual similarity' between two numbers: 666666 and 666166 would be very similar, unlike 666666 and 111111
type N = Int
type Rate = Int
similar :: N -> N -> Rate
similar a b = length . filter id . zipWith (==) a' $ b'
where a' = show a
b' = show b
similar 666666 666166
--> 5
-- high rate : very similar
similar 666666 111111
--> 0
-- low rate : not similar
There will be more sophisticated implementations for this, however this serves the purpose.
The intention is to find a function that sorts a given list of N's, so that each item is the most similar one to it's preceding item. Since the first item does not have a predecessor, there must be a given first N.
similarSort :: N -> [N] -> [N]
Let's look at some sample data: They don't need to have the same arity but it makes it easier to reason about it.
sample :: [N]
sample = [2234, 8881, 1222, 8888, 8822, 2221, 5428]
one could be tempted to implement the function like so:
similarSortWrong x xs = reverse . sortWith (similar x) $ xs
but this would lead to a wrong result:
similarSortWrong 2222 sample
--> [2221,1222,8822,2234,5428,8888,8881]
In the beginning it looks correct, but it's obvious that 8822 should rather be followed by 8881, since it's more similar that 2234.
So here's the implementation I came up with:
similarSort _ [] = []
similarSort x xs = x : similarSort a as
where (a:as) = reverse . sortWith (similar x) $ xs
similarSort 2222 sample
--> [2222,2221,2234,1222,8822,8888,8881]
It seems to work. but it also seems to do lot more more work than necessary. Every step the whole rest is sorted again, just to pick up the first element. Usually lazyness should allow this, but reverse might break this again. I'd be keen to hear, if someone know if there's a common abstraction for this problem.
It's relatively straightforward to implement the greedy algorithm you ask for. Let's start with some boilerplate; we'll use the these package for a zip-like that hands us the "unused" tail ends of zipped-together lists:
import Data.Align
import Data.These
sampleStart = "2222"
sampleNeighbors = ["2234", "8881", "1222", "8888", "8822", "2221", "5428"]
Instead of using numbers, I'll use lists of digits -- just so we don't have to litter the code with conversions between the form that's convenient for the user and the form that's convenient for the algorithm. You've been a bit fuzzy about how to rate the similarity of two digit strings, so let's make it as concrete as possible: any digits that differ cost 1, and if the digit strings vary in length we have to pay 1 for each extension to the right. Thus:
distance :: Eq a => [a] -> [a] -> Int
distance l r = sum $ alignWith elemDistance l r where
elemDistance (These l r) | l == r = 0
elemDistance _ = 1
A handy helper function will pick the smallest element of some list (by a user-specified measure) and return the rest of the list in some implementation-defined order.
minRestOn :: Ord b => (a -> b) -> [a] -> Maybe (a, [a])
minRestOn f [] = Nothing
minRestOn f (x:xs) = Just (go x [] xs) where
go min rest [] = (min, rest)
go min rest (x:xs) = if f x < f min
then go x (min:rest) xs
else go min (x:rest) xs
Now the greedy algorithm almost writes itself:
greedy :: Eq a => [a] -> [[a]] -> [[a]]
greedy here neighbors = here : case minRestOn (distance here) neighbors of
Nothing -> []
Just (min, rest) -> greedy min rest
We can try it out on your sample:
> greedy sampleStart sampleNeighbors
["2222","1222","2221","2234","5428","8888","8881","8822"]
Just eyeballing it, that seems to do okay. However, as with many greedy algorithms, this one only minimizes the local cost of each edge in the path. If you want to minimize the total cost of the path found, you need to use another algorithm. For example, we can pull in the astar package. For simplicity, I'm going to do everything in a very inefficient way, but it's not too hard to do it "right". We'll need a fair chunk more imports:
import Data.Graph.AStar
import Data.Hashable
import Data.List
import Data.Maybe
import qualified Data.HashSet as HS
Unlike before, where we only wanted the nearest neighbor, we'll now want all the neighbors. (Actually, we could probably implement the previous use of minRestOn using the following function and minimumOn or something. Give it a try if you're interested!)
neighbors :: (a, [a]) -> [(a, [a])]
neighbors (_, xs) = go [] xs where
go ls [] = []
go ls (r:rs) = (r, ls ++ rs) : go (r:ls) rs
We can now call the aStar search method with appropriate parameters. We'll use ([a], [[a]]) -- representing the current list of digits and the remaining lists that we can choose from -- as our node type. The arguments to aStar are then, in order: the function for finding neighboring nodes, the function for computing distance between neighboring nodes, the heuristic for how far we have left to go (we'll just say 1 for each unique element in the list), whether we've reached a goal node, and the initial node to start the search from. We'll call fromJust, but it should be okay: all nodes have at least one path to a goal node, just by choosing the remaining lists of digits in order.
optimal :: (Eq a, Ord a, Hashable a) => [a] -> [[a]] -> [[a]]
optimal here elsewhere = (here:) . map fst . fromJust $ aStar
(HS.fromList . neighbors)
(\(x, _) (y, _) -> distance x y)
(\(x, xs) -> HS.size (HS.fromList (x:xs)) - 1)
(\(_, xs) -> null xs)
(here, elsewhere)
Let's see it run in ghci:
> optimal sampleStart sampleNeighbors
["2222","1222","8822","8881","8888","5428","2221","2234"]
We can see that it's done better this time by adding a pathLength function that computes all the distances between neighbors in a result.
pathLength :: Eq a => [[a]] -> Int
pathLength xs = sum [distance x y | x:y:_ <- tails xs]
In ghci:
> pathLength (greedy sampleStart sampleNeighbors)
15
> pathLength (optimal sampleStart sampleNeighbors)
14
In this particular example, I think the greedy algorithm could have found the optimal path if it had made the "right" choices whenever there were ties for minimal next step; but I expect it is not too hard to cook up an example where the greedy algorithm is forced into bad early choices.

Haskell Recursion - finding largest difference between numbers in list

Here's the problem at hand: I need to find the largest difference between adjacent numbers in a list using recursion. Take the following list for example: [1,2,5,6,7,9]. The largest difference between two adjacent numbers is 3 (between 2 and 5).
I know that recursion may not be the best solution, but I'm trying to improve my ability to use recursion in Haskell.
Here's the current code I currently have:
largestDiff (x:y:xs) = if (length (y:xs) > 1) then max((x-y), largestDiff (y:xs)) else 0
Basically - the list will keep getting shorter until it reaches 1 (i.e. no more numbers can be compared, then it returns 0). As 0 passes up the call stack, the max function is then used to implement a 'King of the Hill' type algorithm. Finally - at the end of the call stack, the largest number should be returned.
Trouble is, I'm getting an error in my code that I can't work around:
Occurs check: cannot construct the infinite type:
t1 = (t0, t1) -> (t0, t1)
In the return type of a call of `largestDiff'
Probable cause: `largestDiff' is applied to too few arguments
In the expression: largestDiff (y : xs)
In the first argument of `max', namely
`((x - y), largestDiff (y : xs))'
Anyone have some words of wisdom to share?
Thanks for your time!
EDIT: Thanks everyone for your time - I ended up independently discovering a much simpler way after much trial and error.
largestDiff [] = error "List too small"
largestDiff [x] = error "List too small"
largestDiff [x,y] = abs(x-y)
largestDiff (x:y:xs) = max(abs(x-y)) (largestDiff (y:xs))
Thanks again, all!
So the reason why your code is throwing an error is because
max((x-y), largestDiff (y:xs))
In Haskell, you do not use parentheses around parameters and separate them by commas, the correct syntax is
max (x - y) (largestDiff (y:xs))
The syntax you used is getting parsed as
max ((x - y), largestDiff (y:xs))
Which looks like you're passing a tuple to max!
However, this does not solve the problem. I always got 0 back. Instead, I would recommend breaking up the problem into two functions. You want to calculate the maximum of the difference, so first write a function to calculate the differences and then a function to calculate the maximum of those:
diffs :: Num a => [a] -> [a]
diffs [] = [] -- No elements case
diffs [x] = [] -- One element case
diffs (x:y:xs) = y - x : diffs (y:xs) -- Two or more elements case
largestDiff :: (Ord a, Num a) => [a] -> a
largestDiff xs = maximum $ map abs $ diffs xs
Notice how I've pulled the recursion out into the simplest possible case. We didn't need to calculate the maximum as we traversed the list; it's possible, just more complex. Since Haskell has a handy built-in function for calculating the maximum of a list for us, we can also leverage that. Our recursive function is clean and simple, and it is then combined with maximum to implement the desired largestDiff. As an FYI, diffs is really just a function to compute the derivative of a list of numbers, it can be a very useful function for data processing.
EDIT: Needed Ord constraint on largestDiff and added in map abs before calculating maximum.
Here's my take at it.
First some helpers:
diff a b = abs(a-b)
pick a b = if a > b then a else b
Then the solution:
mdiff :: [Int] -> Int
mdiff [] = 0
mdiff [_] = 0
mdiff (a:b:xs) = pick (diff a b) (mdiff (b:xs))
You have to provide two closing clauses, because the sequence might have either even or odd number of elements.
Another solution to this problem, which circumvents your error, can be obtained
by just transforming lists and folding/reducing them.
import Data.List (foldl')
diffs :: (Num a) => [a] -> [a]
diffs x = zipWith (-) x (drop 1 x)
absMax :: (Ord a, Num a) => [a] -> a
absMax x = foldl' max (fromInteger 0) (map abs x)
Now I admit this is a bit dense for a beginner, so I will explain the above.
The function zipWith transforms two given lists by using a binary function,
which is (-) in this case.
The second list we pass to zipWith is drop 1 x, which is just another way of
describing the tail of a list, but where tail [] results in an error,
drop 1 [] just yields the empty list. So drop 1 is the "safer" choice.
So the first function calculates the adjacent differences.
The name of the second function suggests that it calculates the maximum absolute
value of a given list, which is only partly true, it results in "0" if passed an
empty list.
But how does this happen, reading from right to left, we see that map abs
transforms every list element to its absolute value, which is asserted by
the Num a constraint. Then the foldl'-function traverses the list and
accumulates the maximum of the previous accumulator and the current element of
the list traversal. Moreover I'd like to mention that foldl' is the "strict"
sister/brother of the foldl-function, where the latter is rarely of use,
because it tends to build up a bunch of unevaluated expressions called thunks.
So let's quit all this blah blah and see it in action ;-)
> let a = diffs [1..3] :: [Int]
>>> zipWith (-) [1,2,3] (drop 1 [1,2,3])
<=> zipWith (-) [1,2,3] [2,3]
<=> [1-2,2-3] -- zipWith stops at the end of the SHORTER list
<=> [-1,-1]
> b = absMax a
>>> foldl' max (fromInteger 0) (map abs [-1,-1])
-- fromInteger 0 is in this case is just 0 - interesting stuff only happens
-- for other numerical types
<=> foldl' max 0 (map abs [-1,-1])
<=> foldl' max 0 [1,1]
<=> foldl' max (max 0 1) [1]
<=> foldl' max 1 [1]
<=> foldl' max (max 1 1) []
<=> foldl' max 1 [] -- foldl' _ acc [] returns just the accumulator
<=> 1

Finding unique (as in only occurring once) element haskell

I need a function which takes a list and return unique element if it exists or [] if it doesn't. If many unique elements exists it should return the first one (without wasting time to find others).
Additionally I know that all elements in the list come from (small and known) set A.
For example this function does the job for Ints:
unique :: Ord a => [a] -> [a]
unique li = first $ filter ((==1).length) ((group.sort) li)
where first [] = []
first (x:xs) = x
ghci> unique [3,5,6,8,3,9,3,5,6,9,3,5,6,9,1,5,6,8,9,5,6,8,9]
ghci> [1]
This is however not good enough because it involves sorting (n log n) while it could be done in linear time (because A is small).
Additionally it requires the type of list elements to be Ord while all which should be needed is Eq. It would also be nice if amount of comparisons was as small as possible (ie if we traverse a list and encounter element el twice we don't test subsequent elements for equality with el)
This is why for example this: Counting unique elements in a list doesn't solve the problem - all answers involve either sorting or traversing the whole list to find count of all elements.
The question is: how to do it correctly and efficiently in Haskell ?
Okay, linear time, from a finite domain. The running time will be O((m + d) log d), where m is the size of the list and d is the size of the domain, which is linear when d is fixed. My plan is to use the elements of the set as the keys of a trie, with the counts as values, then look through the trie for elements with count 1.
import qualified Data.IntTrie as IntTrie
import Data.List (foldl')
import Control.Applicative
Count each of the elements. This traverses the list once, builds a trie with the results (O(m log d)), then returns a function which looks up the result in the trie (with running time O(log d)).
counts :: (Enum a) => [a] -> (a -> Int)
counts xs = IntTrie.apply (foldl' insert (pure 0) xs) . fromEnum
where
insert t x = IntTrie.modify' (fromEnum x) (+1) t
We use the Enum constraint to convert values of type a to integers in order to index them in the trie. An Enum instance is part of the witness of your assumption that a is a small, finite set (Bounded would be the other part, but see below).
And then look for ones that are unique.
uniques :: (Eq a, Enum a) => [a] -> [a] -> [a]
uniques dom xs = filter (\x -> cts x == 1) dom
where
cts = counts xs
This function takes as its first parameter an enumeration of the entire domain. We could have required a Bounded a constraint and used [minBound..maxBound] instead, which is semantically appealing to me since finite is essentially Enum+Bounded, but quite inflexible since now the domain needs to be known at compile time. So I would choose this slightly uglier but more flexible variant.
uniques traverses the domain once (lazily, so head . uniques dom will only traverse as far as it needs to to find the first unique element -- not in the list, but in dom), for each element running the lookup function which we have established is O(log d), so the filter takes O(d log d), and building the table of counts takes O(m log d). So uniques runs in O((m + d) log d), which is linear when d is fixed. It will take at least Ω(m log d) to get any information from it, because it has to traverse the whole list to build the table (you have to get all the way to the end of the list to see if an element was repeated, so you can't do better than this).
There really isn't any way to do this efficiently with just Eq. You'd need to use some much less efficient way to build the groups of equal elements, and you can't know that only one of a particular element exists without scanning the whole list.
Also, note that to avoid useless comparisons you'd need a way of checking to see if an element has been encountered before, and the only way to do that would be to have a list of elements known to have multiple occurrences, and the only way to check if the current element is in that list is... to compare it for equality with each.
If you want this to work faster than O(something really horrible) you need that Ord constraint.
Ok, based on the clarifications in comments, here's a quick and dirty example of what I think you're looking for:
unique [] _ _ = Nothing
unique _ [] [] = Nothing
unique _ (r:_) [] = Just r
unique candidates results (x:xs)
| x `notElem` candidates = unique candidates results xs
| x `elem` results = unique (delete x candidates) (delete x results) xs
| otherwise = unique candidates (x:results) xs
The first argument is a list of candidates, which should initially be all possible elements. The second argument is the list of possible results, which should initially be empty. The third argument is the list to examine.
If it runs out of candidates, or reaches the end of the list with no results, it returns Nothing. If it reaches the end of the list with results, it returns the one at the front of the result list.
Otherwise, it examines the next input element: If it's not a candidate, it ignores it and continues. If it's in the result list we've seen it twice, so remove it from the result and candidate lists and continue. Otherwise, add it to the results and continue.
Unfortunately, this still has to scan the entire list for even a single result, since that's the only way to be sure it's actually unique.
First off, if your function is intended to return at most one element, you should almost certainly use Maybe a instead of [a] to return your result.
Second, at minimum, you have no choice but to traverse the entire list: you can't tell for sure if any given element is actually unique until you've looked at all the others.
If your elements are not Ordered, but can only be tested for Equality, you really have no better option than something like:
firstUnique (x:xs)
| elem x xs = firstUnique (filter (/= x) xs)
| otherwise = Just x
firstUnique [] = Nothing
Note that you don't need to filter out the duplicated elements if you don't want to -- the worst case is quadratic either way.
Edit:
The above misses the possibility of early exit due to the above-mentioned small/known set of possible elements. However, note that the worst case will still require traversing the entire list: all that is necessary is for at least one of these possible elements to be missing from the list...
However, an implementation that provides an early out in case of set exhaustion:
firstUnique = f [] [<small/known set of possible elements>] where
f [] [] _ = Nothing -- early out
f uniques noshows (x:xs)
| elem x uniques = f (delete x uniques) noshows xs
| elem x noshows = f (x:uniques) (delete x noshows) xs
| otherwise = f uniques noshows xs
f [] _ [] = Nothing
f (u:_) _ [] = Just u
Note that if your list has elements which shouldn't be there (because they aren't in the small/known set), they will be pointedly ignored by the above code...
As others have said, without any additional constraints, you can't do this in less than quadratic time, because without knowing something about the elements, you can't keep them in some reasonable data structure.
If we are able to compare elements, an obvious O(n log n) solution to compute the count of elements first and then find the first one with count equal to 1:
import Data.List (foldl', find)
import Data.Map (Map)
import qualified Data.Map as Map
import Data.Maybe (fromMaybe)
count :: (Ord a) => Map a Int -> a -> Int
count m x = fromMaybe 0 $ Map.lookup x m
add :: (Ord a) => Map a Int -> a -> Map a Int
add m x = Map.insertWith (+) x 1 m
uniq :: (Ord a) => [a] -> Maybe a
uniq xs = find (\x -> count cs x == 1) xs
where
cs = foldl' add Map.empty xs
Note that the log n factor comes from the fact that we need to operate on a Map of size n. If the list has only k unique elements then the size of our map will be at most k, so the overall complexity will be just O(n log k).
However, we can do even better - we can use a hash table instead of a map to get an O(n) solution. For this we'll need the ST monad to perform mutable operations on the hash map, and our elements will have to be Hashable. The solution is basically the same as before, just a little bit more complex due to working within the ST monad:
import Control.Monad
import Control.Monad.ST
import Data.Hashable
import qualified Data.HashTable.ST.Basic as HT
import Data.Maybe (fromMaybe)
count :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s Int
count ht x = liftM (fromMaybe 0) (HT.lookup ht x)
add :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s ()
add ht x = count ht x >>= HT.insert ht x . (+ 1)
uniq :: (Eq a, Hashable a) => [a] -> Maybe a
uniq xs = runST $ do
-- Count all elements into a hash table:
ht <- HT.newSized (length xs)
forM_ xs (add ht)
-- Find the first one with count 1
first (\x -> liftM (== 1) (count ht x)) xs
-- Monadic variant of find which exists once an element is found.
first :: (Monad m) => (a -> m Bool) -> [a] -> m (Maybe a)
first p = f
where
f [] = return Nothing
f (x:xs') = do
b <- p x
if b then return (Just x)
else f xs'
Notes:
If you know that there will be only a small number of distinct elements in the list, you could use HT.new instead of HT.newSized (length xs). This will save you some memory and one pass over xs but in the case of many distinct elements the hash table will be have to resized several times.
Here is a version that does the trick:
unique :: Eq a => [a] -> [a]
unique = select . collect []
where
collect acc [] = acc
collect acc (x : xs) = collect (insert x acc) xs
insert x [] = [[x]]
insert x (ys#(y : _) : yss)
| x == y = (x : ys) : yss
| otherwise = ys : insert x yss
select [] = []
select ([x] : _) = [x]
select ((_ : _) : xss) = select xss
So, first we traverse the input list (collect) while maintaining a list of buckets of equal elements that we update with insert. Then we simply select the first element that appears in a singleton bucket (select).
The bad news is that this takes quadratic time: for every visited element in collect we need to go over the list of buckets. I am afraid that is the price you will have to pay for only being able to constrain the element type to be in Eq.
Something like this look pretty good.
unique = fst . foldl' (\(a, b) c -> if (c `elem` b)
then (a, b)
else if (c `elem` a)
then (delete c a, c:b)
else (c:a, b)) ([],[])
The first element of the resulted tuple of the fold, contain what you are expecting, a list containing unique element. The second element of the tuple is the memory of the process remembered if an element has already been discarded or not.
About space performance.
As your problem is design, all the element of the list should be traversed at least one time, before a result can be display. And the internal algorithm must keep trace of discarded value in addition to the good one, but discarded value will appears only one time. Then in the worst case the required amount of memory is equal to the size of the inputted list. This sound goods as you said that expected input are small.
About time performance.
As the expected input are small and not sorted by default, trying to sort the list into the algorithm is useless, or before to apply it is useless. In fact statically we can almost said, that the extra operation to place an element at its ordered place (into the sub list a and b of the tuple (a,b)) will cost the same amount of time than to check if this element appear into the list or not.
Below a nicer and more explicit version of the foldl' one.
import Data.List (foldl', delete, elem)
unique :: Eq a => [a] -> [a]
unique = fst . foldl' algorithm ([], [])
where
algorithm (result0, memory0) current =
if (current `elem` memory0)
then (result0, memory0)
else if (current`elem` result0)
then (delete current result0, memory)
else (result, memory0)
where
result = current : result0
memory = current : memory0
Into the nested if ... then ... else ... instruction the list result is traversed twice in the worst case, this can be avoid using the following helper function.
unique' :: Eq a => [a] -> [a]
unique' = fst . foldl' algorithm ([], [])
where
algorithm (result, memory) current =
if (current `elem` memory)
then (result, memory)
else helper current result memory []
where
helper current [] [] acc = ([current], [])
helper current [] memory acc = (acc, memory)
helper current (r:rs) memory acc
| current == r = (acc ++ rs, current:memory)
| otherwise = helper current rs memory (r:acc)
But the helper can be rewrite using fold as follow, which is definitely nicer.
helper current [] _ = ([current],[])
helper current memory result =
foldl' (\(r, m) x -> if x==current
then (r, current:m)
else (current:r, m)) ([], memory) $ result

In Haskell, how can you sort a list of infinite lists of strings?

So basically, if I have a (finite or infinite) list of (finite or infinite) lists of strings, is it possible to sort the list by length first and then by lexicographic order, excluding duplicates? A sample input/output would be:
Input:
[["a", "b",...], ["a", "aa", "aaa"], ["b", "bb", "bbb",...], ...]
Output:
["a", "b", "aa", "bb", "aaa", "bbb", ...]
I know that the input list is not a valid haskell expression but suppose that there is an input like that. I tried using merge algorithm but it tends to hang on the inputs that I give it. Can somebody explain and show a decent sorting function that can do this? If there isn't any function like that, can you explain why?
In case somebody didn't understand what I meant by the sorting order, I meant that shortest length strings are sorted first AND if one or more strings are of same length then they are sorted using < operator.
Thanks!
Ultimately, you can't sort an infinite list, because items at the tail of the list could percolate all the way to the front of the result, so you can't finish sorting an infinite list until you've seen the last item, but your list is infinite, so you'll never get there.
The only way that you could even try to sort an infinite list would require constraints on the inhabitants of the list. If the values of the list items comes from a well-founded set and the contents of the list are unique then you could at least make some progress in returning elements the initial elements of the list. For instance if the list was of distinct natural numbers, you could return the first 0 you see, then the first 1, etc. but you couldn't make any headway in the result until you saw 2, no matter how far down the list you went. Ultimately, if you ever skipped an element in the set because it wasn't present in the source, you'd cease to produce new output elements until you had the entire input in hand.
You can do the same thing with strings, because they are well founded, but that is only even remotely viable if you plan on returning all possible strings.
In short, if you need this, you're going about solving the problem you have in the wrong way. This isn't a tractable path to any solution you will want to use.
As yairchu noted, merging a finite number of sorted infinite lists works fine though.
In general it is impossible to sort infinite lists. Because the smallest item could be at infinite position and we must find it before we output it.
Merging infinite sorted lists is possible.
In general, merging an infinite list of sorted lists is impossible. For same reason that sorting them is.
Merging an infinite list of sorted lists, which are sorted by heads (forall i j. i < j => head (lists !! i) <= head (lists !! j)), is possible.
So I'm guessing that what you really want is to merge a sorted infinite list of sorted lists. It's even a task that makes some sense. There's even some existing code that uses it, implemented for monadic lists there - kinda ugly syntax-wise etc. So here's a version for plain lists:
mergeOnSortedHeads :: Ord b => (a -> b) -> [[a]] -> [a]
mergeOnSortedHeads _ [] = []
mergeOnSortedHeads f ([]:xs) = mergeOnSortedHeads f xs
mergeOnSortedHeads f ((x:xs):ys) =
x : mergeOnSortedHeads f (bury xs ys)
where
bury [] ks = ks
bury js [] = [js]
bury js ([]:ks) = bury js ks
bury jj#(j:js) ll#(kk#(k:ks):ls)
| f j <= f k = jj : ll
| otherwise = kk : bury jj ls
ghci> take 20 $ mergeOnSortedHeads id $ [[0,4,6], [2,3,9], [3,5..], [8]] ++ map repeat [12..]
[0,2,3,3,4,5,6,7,8,9,9,11,12,12,12,12,12,12,12,12]
btw: what do you need this for?
Well, I'm going to ignore your request for sorting infinite data.
To sort by length of the sublists, then by lexicographic order, we can do this pretty easily. Oh, and you want duplicates removed.
We'll start with a sample:
> s
[["a","b"],["a","aa","aaa"],["b","bb","bbb"]]
And then build the program incrementally.
First sorting on length (using Data.Ord.comparing to build the sort body):
> sortBy (comparing length) s
[["a","b"],["a","aa","aaa"],["b","bb","bbb"]]
Ok. That looks reasonable. So let's just concat, and sortBy length then alpha:
> sortBy (comparing length) . nub . concat $ s
["a","b","aa","bb","aaa","bbb"]
If your input is sorted. Otherwise you'll need a sligtly different body to sortBy.
Thanks to everyone for their inputs and sorry for the late reply. Turns out I was just approaching the problem in a wrong way. I was trying to do what Yairchu showed but I was using the built in function length to do the merging but length doesnt work on an infinite list for obvious reasons. Anyways, I solved my problem by sorting as I created the list on the go, not on the end result. I wonder what other languages offer infinite lists? Such a weird but useful concept.
Here is an algorithm that let you online sort:
it is not efficient, but it is lazy enough to let you goto different sort generations, even if you sort infinite lists. It is a nice gimmick, but not very usable. For example sorting the infinite list [10,9..]:
*Main> take 10 $ sortingStream [10,9..] !! 0
[9,8,7,6,5,4,3,2,1,0]
*Main> take 10 $ sortingStream [10,9..] !! 1
[8,7,6,5,4,3,2,1,0,-1]
*Main> take 10 $ sortingStream [10,9..] !! 2
[7,6,5,4,3,2,1,0,-1,-2]
*Main> take 10 $ sortingStream [10,9..] !! 3
[6,5,4,3,2,1,0,-1,-2,-3]
*Main> take 10 $ sortingStream [10,9..] !! 4
[5,4,3,2,1,0,-1,-2,-3,-4]
*Main> take 10 $ sortingStream [10,9..] !! 1000
[-991,-992,-993,-994,-995,-996,-997,-998,-999,-1000]
As you can see the sorting improves each generation. The code:
produce :: ([a] -> [a]) -> [a] -> [[a]]
produce f xs = f xs : (produce f (f xs))
sortingStream :: (Ord a) => [a] -> [[a]]
sortingStream = produce ss
ss :: (Ord a) => [a] -> [a]
ss [] = []
ss [x] = [x]
ss [x,y] | x <= y = [x,y]
| otherwise = [y,x]
ss (x:y:xs) | x <= y = x: (ss (y:xs))
| otherwise = y:(ss (x:xs))
Whether it can be done depends very much on the nature of your input data. If you can 'stop looking' for lists of a certain length when you've seen a longer one and there are only a finite number of lists of each length, then you can go through the lengths in ascending order, sort those and concatenate the results. Something like this should work:
listsUptoLength n xss = takeWhile (\xs -> length xs <= n) $ xss
listsUptoLength' n [] = []
listsUptoLength' n (xss:xsss) = case listsUptoLength n xss of
[] -> []
xss' -> xss' : listsUptoLength' n xsss
listsOfLength n xsss = concatMap (\xss -> (filter (\xs -> length xs == n) xss)) (listsUptoLength' n xsss)
sortInfinite xsss = concatMap (\n -> sort . nub $ (listsOfLength n xsss)) [0..]
f xs y = [xs ++ replicate n y | n <- [1..]]
test = [ map (\x -> [x]) ['a'..'e'], f "" 'a', f "" 'b', f "b" 'a', f "a" 'b' ] ++ [f start 'c' | start <- f "" 'a']
(The names could probably be chosen to be more illuminating :)
I'm guessing you're working with regular expressions, so I think something like this could be made to work; I repeat the request for more background!

Resources