How to write function for N-ary tree traversal in Haskell - algorithm

I need to traverse N-ary tree and to the each node add number when I visited in in preorder. I have n-ary tree defined like this:
data NT a = N a [NT a] deriving Show
Example:
If I have following tree:
let ntree = N "eric" [N "lea" [N "kristy" [],N "pedro" [] ,N "rafael" []],N "anna" [],N "bety" []]
I want to transform it to
let ntree = N (1,"eric") [N (2,"lea") [N (3,"kristy") [],N (4,"pedro") [] ,N (5,"rafael") []],N (6,"anna") [],N (7,"bety") []]
"Preordedness" isnt that important.
I want to see how to write a function that passes values between levels, like how to pass number down to successor list and how to pass updated number to parent and go with that number to other branches.
So far I has been able to write functions like this:
traverse :: NT String -> String
traverse (N val []) =" "++val++" "
traverse (N val list) =val++" " ++ (concat $ map traverse list)
which outputs
"eric lea kristy pedro rafael anna bety "
EDIT: Question is:
How can I write a function
numberNodes :: NT a -> NT (a,Int)
that numbers nodes according to the preorder traversal of the tree?
Hard part for me to understand is passing auxilliary data around, could you please elaborate on that ?
In this concrete case it is one Int that means "time" or order in which I traverse this tree.

First Attempt: Hard Work
For the case of n-ary trees, there are three things going on: numbering elements, numbering trees, and numbering lists of trees. It would help to treat them separately. Types first:
aNumber :: a -- thing to number
-> Int -- number to start from
-> ( (a, Int) -- numbered thing
, Int -- next available number afterwards
)
ntNumber :: NT a -- thing to number
-> Int -- number to start from
-> ( NT (a, Int) -- numbered thing
, Int -- next available number afterwards
)
ntsNumber :: [NT a] -- thing to number
-> Int -- number to start from
-> ( [NT (a, Int)] -- numbered thing
, Int -- next available number afterwards
)
Notice that all three types share the same pattern. When you see that there is a pattern that you are following, apparently by coincidence, you know you have an opportunity to learn something. But let's press on for now and learn later.
Numbering an element is easy: copy the starting number into the output and return its successor as the next available.
aNumber a i = ((a, i), i + 1)
For the other two, the pattern (there's that word again) is
split the input into its top-level components
number each component in turn, threading the numbers through
It's easy to do the first with pattern matching (inspecting the data visually) and the second with where clauses (grabbing the two parts of the output).
For trees, a top level split gives us two components: an element and a list. In the where clause, we call the appropriate numbering functions as directed by those types. In each case, the "thing" output tells us what to put in place of the "thing" input. Meanwhile, we thread the numbers through, so the starting number for the whole is the starting number for the first component, the "next" number for the first component starts the second, and the "next" number from the second is the "next" number for the whole.
ntNumber (N a ants) i0 = (N ai aints, i2) where
(ai, i1) = aNumber a i0
(aints, i2) = ntsNumber ants i1
For lists, we have two possibilities. An empty list has no components, so we return it directly without using any more numbers. A "cons" has two components, we do exactly as we did before, using the appropriate numbering functions as directed by the type.
ntsNumber [] i = ([], i)
ntsNumber (ant : ants) i0 = (aint : aints, i2) where
(aint, i1) = ntNumber ant i0
(aints, i2) = ntsNumber ants i1
Let's give it a go.
> let ntree = N "eric" [N "lea" [N "kristy" [],N "pedro" [] ,N "rafael" []],N "anna" [],N "bety" []]
> ntNumber ntree 0
(N ("eric",0) [N ("lea",1) [N ("kristy",2) [],N ("pedro",3) [],N ("rafael",4) []],N ("anna",5) [],N ("bety",6) []],7)
So we're there. But are we happy? Well, I'm not. I have the annoying sensation that I wrote pretty much the same type three times and pretty much the same program twice. And if I wanted to do more element-numbering for differently organised data (e.g., your binary trees), I'd have to write the same thing again again. Repetitive patterns in Haskell code are always missed opportunities: it's important to develop a sense of self-criticism and ask whether there's a neater way.
Second Attempt: Numbering and Threading
Two of the repetitive patterns we saw, above, are
1. the similarity of the types,
2. the similarity of the way the numbers get threaded.
If you match up the types to see what's in common, you'll notice they're all
input -> Int -> (output, Int)
for different inputs and outputs. Let's give the largest common component a name.
type Numbering output = Int -> (output, Int)
Now our three types are
aNumber :: a -> Numbering (a, Int)
ntNumber :: NT a -> Numbering (NT (a, Int))
ntsNumber :: [NT a] -> Numbering [NT (a, Int)]
You often see such types in Haskell:
input -> DoingStuffToGet output
Now, to deal with the threading, we can build some helpful tools to work with and combine Numbering operations. To see which tools we need, look at how we combine the outputs after we've numbered the components. The "thing" parts of the outputs are always built by applying some functions which don't get numbered (data constructors, usually) to some "thing" outputs from numberings.
To deal with the functions, we can build a gadget that looks a lot like our [] case, where no actual numbering was needed.
steady :: thing -> Numbering thing
steady x i = (x, i)
Don't be put off by the way the type makes it look as if steady has only one argument: remember that Numbering thing abbreviates a function type, so there really is another -> in there. We get
steady [] :: Numbering [a]
steady [] i = ([], i)
just like in the first line of ntsNumber.
But what about the other constructors, N and (:)? Ask ghci.
> :t steady N
steady N :: Numbering (a -> [NT a] -> NT a)
> :t steady (:)
steady (:) :: Numbering (a -> [a] -> [a])
We get numbering operations with functions as outputs, and we want to generate the arguments to those function by more numbering operations, producing one big overall numbering operation with the numbers threaded through. One step of that process is to feed a numbering-generated function one numbering-generated input. I'll define that as an infix operator.
($$) :: Numbering (a -> b) -> Numbering a -> Numbering b
infixl 2 $$
Compare with the type of the explicit application operator, $
> :t ($)
($) :: (a -> b) -> a -> b
This $$ operator is "application for numberings". If we can get it right, our code becomes
ntNumber :: NT a -> Numbering (NT (a, Int))
ntNumber (N a ants) i = (steady N $$ aNumber a $$ ntsNumber ants) i
ntsNumber :: [NT a] -> Numbering [NT (a, Int)]
ntsNumber [] i = steady [] i
ntsNumber (ant : ants) i = (steady (:) $$ ntNumber ant $$ ntsNumber ants) i
with aNumber as it was (for the moment). This code just does the data reconstruction, plugging together the constructors and the numbering processes for the components. We had better give the definition of $$ and make sure it gets the threading right.
($$) :: Numbering (a -> b) -> Numbering a -> Numbering b
(fn $$ an) i0 = (f a, i2) where
(f, i1) = fn i0
(a, i2) = an i1
Here, our old threading pattern gets done once. Each of fn and an is a function, expecting a starting number, and the whole of fn $$ sn is a function, which gets the starting number i0. We thread the numbers through, collecting first the function, then the argument. We then do the actual application and hand back the final "next" number.
Now, notice that in every line of code, the i input is fed in as the argument to a numbering process. We can simplify this code by just talking about the processes, not the numbers.
ntNumber :: NT a -> Numbering (NT (a, Int))
ntNumber (N a ants) = steady N $$ aNumber a $$ ntsNumber ants
ntsNumber :: [NT a] -> Numbering [NT (a, Int)]
ntsNumber [] = steady []
ntsNumber (ant : ants) = steady (:) $$ ntNumber ant $$ ntsNumber ants
One way to read this code is to filter out all the Numbering, steady and $$ uses.
ntNumber :: NT a -> ......... (NT (a, Int))
ntNumber (N a ants) = ...... N .. (aNumber a) .. (ntsNumber ants)
ntsNumber :: [NT a] -> ......... [NT (a, Int)]
ntsNumber [] = ...... []
ntsNumber (ant : ants) = ...... (:) .. (ntNumber ant) .. (ntsNumber ants)
and you see it just looks like a preorder traversal, reconstructing the original data structure after processing the elements. We're doing the right thing with the values, provided steady and $$ are correctly combining the processes.
We could try to do the same for aNumber
aNumber :: a -> Numbering a
aNumber a = steady (,) $$ steady a $$ ????
but the ???? is where we actually need the number. We could build a numbering process that fits in that hole: a numbering process that issues the next number.
next :: Numbering Int
next i = (i, i + 1)
That's the essence of numbering, the "thing" output is the number to be used now (which is the starting number), and the "next" number output is the one after. We may write
aNumber a = steady (,) $$ steady a $$ next
which simplifies to
aNumber a = steady ((,) a) $$ next
In our filtered view, that's
aNumber a = ...... ((,) a) .. next
What we've done is to bottle the idea of a "numbering process" and we've built the right tools to do ordinary functional programming with those processes. The threading pattern turns into the definitions of steady and $$.
Numbering is not the only thing that works this way. Try this...
> :info Applicative
class Functor f => Applicative (f :: * -> *) where
pure :: a -> f a
(<*>) :: f (a -> b) -> f a -> f b
...and you also get some more stuff. I just want to draw attention to the types of pure and <*>. They're a lot like steady and $$, but they are not just for Numbering. Applicative is the type class for every kind of process which works that way. I'm not saying "learn Applicative now!", just suggesting a direction of travel.
Third Attempt: Type-Directed Numbering
So far, our solution is directed towards one particular data structure, NT a, with [NT a] showing up as an auxiliary notion because it's used in NT a. We can make the whole thing a bit more plug-and-play if we focus on one layer of the type at a time. We defined numbering a list of trees in terms of numbering trees. In general, we know how to number a list of stuff if we know how to number each item of stuff.
If we know how to number an a to get b, we should be able to number a list of a to get a list of b. We can abstract over "how to process each item".
listNumber :: (a -> Numbering b) -> [a] -> Numbering [b]
listNumber na [] = steady []
listNumber na (a : as) = steady (:) $$ na a $$ listNumber na as
and now our old list-of-trees-numbering function becomes
ntsNumber :: [NT a] -> Numbering [NT (a, Int)]
ntsNumber = listNumber ntNumber
which is hardly worth naming. We can just write
ntNumber :: NT a -> Numbering (NT (a, Int))
ntNumber (N a ants) = steady N $$ aNumber a $$ listNumber ntNumber ants
We can play the same game for the trees themselves. If you know how to number stuff, you know how to number a tree of stuff.
ntNumber' :: (a -> Numbering b) -> NT a -> Numbering (NT b)
ntNumber' na (N a ants) = steady N $$ na a $$ listNumber (ntNumber' na) ants
Now we can do things like this
myTree :: NT [String]
myTree = N ["a", "b", "c"] [N ["d", "e"] [], N ["f"] []]
> ntNumber' (listNumber aNumber) myTree 0
(N [("a",0),("b",1),("c",2)] [N [("d",3),("e",4)] [],N [("f",5)] []],6)
Here, the node data is now itself a list of things, but we've been able to number those things individually. Our equipment is more adaptable because each component aligns with one layer of the type.
Now, try this:
> :t traverse
traverse :: (Applicative f, Traversable t) => (a -> f b) -> t a -> f (t b)
It's an awful lot like the thing we just did, where f is Numbering and t is sometimes lists and sometimes trees.
The Traversable class captures what it means to be a type-former that lets you thread some sort of process through the stored elements. Again, the pattern you're using is very common and has been anticipated. Learning to use traverse saves a lot of work.
Eventually...
...you'll learn that a thing to do the job of Numbering already exists in the library: it's called State Int and it belongs to the Monad class, which means it must also be in the Applicative class. To get hold of it,
import Control.Monad.State
and the operation which kicks off a stateful process with its initial state, like our feeding-in of 0, is this thing:
> :t evalState
evalState :: State s a -> s -> a
Our next operation becomes
next' :: State Int Int
next' = get <* modify (1+)
where get is the process that accesses the state, modify makes a process that changes the state, and <* means "but also do".
If you start you file with the language extension pragma
{-# LANGUAGE DeriveFunctor, DeriveFoldable, DeriveTraversable #-}
you can declare your datatype like this
data NT a = N a [NT a] deriving (Show, Functor, Foldable, Traversable)
and Haskell will write traverse for you.
Your program then becomes one line...
evalState (traverse (\ a -> pure ((,) a) <*> get <* modify (1+)) ntree) 0
-- ^ how to process one element ^^^^^^^^^^^^^^^
-- ^ how to process an entire tree of elements ^^^^^^^^^
-- ^ processing your particular tree ^^^^^^^^^^^^^^^^^^^^^^^^^^^
-- ^ kicking off the process with a starting number of 0 ^^^^^^^^^^^^^^^^
...but the journey to that one line involves a lot of "bottling the pattern" steps, which takes some (hopefully rewarding) learning.

I'll update this answer as soon as I get some progress.
Right now I reduced problem from n-ary tree to binary tree.
data T a = Leaf a | N (T a) a (T a) deriving Show
numberNodes:: T a -> T (a,Int)
numberNodes tree = snd $ numberNodes2 tree 0
numberNodes2:: T a -> Int -> (Int, T (a,Int))
numberNodes2 (Leaf a) time = (time,Leaf (a,time))
numberNodes2 (N left nodeVal right) time = (rightTime, N leftTree (nodeVal,time) rightTree )
where (leftTime,leftTree) = numberNodes2 left (time+1)
(rightTime,rightTree) = numberNodes2 right (leftTime+1)
Function numberNodes creates from this tree:
let bt = N (N (Leaf "anna" ) "leo" (Leaf "laura")) "eric" (N (Leaf "john") "joe" (Leaf "eddie"))
following tree:
N (N (Leaf ("anna",2)) ("leo",1) (Leaf ("laura",3))) ("eric",0) (N (Leaf ("john",5)) ("joe",4) (Leaf ("eddie",6)))
And now just rewrite it for n-ary tree...( which I don't know how to do, any hints? )

This answer by #pigworker is excellent, and I learned lots from it.
However, I believe we can use mapAccumL from Data.Traversable to achieve a very similar behaviour:
{-# LANGUAGE DeriveTraversable #-}
import Data.Traversable
import Data.Tuple
-- original data type from the question
data NT a = N a [NT a]
deriving (Show, Functor, Foldable, Traversable)
-- additional type from #pigworker's answer
type Numbering output = Int -> (output, Int)
-- compare this to signature of ntNumber
-- swap added to match the signature
ntNumberSimple :: (NT a) -> Numbering (NT (a, Int))
ntNumberSimple t n = swap $ mapAccumL func n t
where
func i x = (i+1, (x, i))
I believe that mapAccumL is using the very same State monad under the hood, but at the very least it's completely hidden from the caller.

Related

Iterate State Monad and Collect Results in Sequence with Good Performance

I implemented the following function:
iterateState :: Int -> (a -> State s a) -> (a -> State s [a])
iterateState 0 f a = return []
iterateState n f a = do
b <- f a
xs <- iterateState (n - 1) f b
return $ b : xs
My primary use case is for a = Double. It works, but it is very slow. It allocates 528MB of heap space to produce a list of 1M Double values and spends most of its time doing garbage collection.
I have experimented with implementations that work on the type s -> (a, s) directly as well as with various strictness annotations. I was able to reduce the heap allocation somewhat, but not even close to what one would expect from a reasonable implementation. I suspect that the resulting ([a], s) being a combination of something to be consumed lazily ([a]) and something whose WHNF forces the entire computation (s) makes optimization difficult for GHC.
Assuming that the iterative nature of lists would be unsuitable for this situation, I turned to the vector package. To my delight, it already contains
iterateNM :: (Monad m, Unbox a) => Int -> (a -> m a) -> a -> m (Vector a)
Unfortunately, this is only slightly faster than my list implementation, still allocating 328MB of heap space. I assumed that this is because it uses unstreamM, whose description reads
Load monadic stream bundle into a newly allocated vector. This function goes through a list, so prefer using unstream, unless you need to be in a monad.
Looking at its behavior for the list monad, it is understandable that there is no efficient implementation for general monads. Luckily, I only need the state monad, and I found another function that almost fits the signature of the state monad.
unfoldrExactN :: Unbox a => Int -> (b -> (a, b)) -> b -> Vector a
This function is blazingly fast and performs no excess heap allocation beyond the 8MB needed to hold the resulting unboxed vector of 1M Double values. Unfortunately, it does not return the final state at the end of the computation, so it cannot be wrapped in the State type.
I looked at the implementation of unfoldrExactN to see if I could adjust it to expose the final state at the end of the computation. Unfortunately, this seems to be difficult, as the stream constructed by
unfoldrExactN :: Monad m => Int -> (s -> (a, s)) -> s -> Stream m a
which is eventually expanded into a vector by unstream has already forgotten the state type s.
I imagine I could circumvent the entire Stream infrastructure and implement iterateState directly on mutable vectors in the ST monad (similarly to how unstream expands a stream into a vector). However, I would lose all the benefits of stream fusion, as well as turning a computation that is easily expressed as a pure function into imperative low-level mush just for performance reasons. This is particularly frustrating while knowing that the existing unfoldrExactN already calculates all the values I want, but I have no access to them.
Is there a better way?
Can this function be implemented in a purely functional way with reasonable performance and no excess heap allocations? Preferably in a way that ties into the vector package and its stream fusion infrastructure.
The following program has 12MB max residency on my computer when compiled with optimizations:
import Data.Vector.Unboxed
import Data.Vector.Unboxed.Mutable
iterateNState :: Unbox a => Int -> (a -> s -> (s, a)) -> (a -> s -> (s, Vector a))
iterateNState n f a0 s0 = createT (unsafeNew n >>= go 0 a0 s0) where
go i a s arr
| i >= n = pure (s, arr)
| otherwise = do
unsafeWrite arr i a
case f a s of
(s', a') -> go (i+1) a' s' arr
main = id
. print
. Data.Vector.Unboxed.sum
. snd
$ iterateNState 1000000 (\a s -> (s+1, a+s :: Int)) 0 0
(It continues to have a nice low residency even when the final two 0s are read from input dynamically.)

Is there such a thing as maximumWith?

Specifically I'm searching for a function 'maximumWith',
maximumWith :: (Foldable f, Ord b) => (a -> b) -> f a -> a
Which behaves in the following way:
maximumWith length [[1, 2], [0, 1, 3]] == [0, 1, 3]
maximumWith null [[(+), (*)], []] == []
maximumWith (const True) x == head x
My use case is picking the longest word in a list.
For this I'd like something akin to maximumWith length.
I'd thought such a thing existed, since sortWith etc. exist.
Let me collect all the notes in the comments together...
Let's look at sort. There are 4 functions in the family:
sortBy is the actual implementation.
sort = sortBy compare uses Ord overloading.
sortWith = sortBy . comparing is the analogue of your desired maximumWith. However, this function has an issue. The ranking of an element is given by applying the given mapping function to it. However, the ranking is not memoized, so if an element needs to compared multiple times, the ranking will be recomputed. You can only use it guilt-free if the ranking function is very cheap. Such functions include selectors (e.g. fst), and newtype constructors. YMMV on simple arithmetic and data constructors. Between this inefficiency, the simplicity of the definition, and its location in GHC.Exts, it's easy to deduce that it's not used that often.
sortOn fixes the inefficiency by decorating each element with its image under the ranking function in a pair, sorting by the ranks, and then erasing them.
The first two have analogues in maximum: maximumBy and maximum. sortWith has no analogy; you may as well write out maximumBy (comparing _) every time. There is also no maximumOn, even though such a thing would be more efficient. The easiest way to define a maximumOn is probably just to copy sortOn:
maximumOn :: (Functor f, Foldable f, Ord r) => (a -> r) -> f a -> a
maximumOn rank = snd . maximumBy (comparing fst) . fmap annotate
where annotate e = let r = rank e in r `seq` (r, e)
There's a bit of interesting code in maximumBy that keeps this from optimizing properly on lists. It also works to use
maximumOn :: (Foldable f, Ord r) => (a -> r) -> f a -> a
maximumOn rank = snd . fromJust . foldl' max' Nothing
where max' Nothing x = let r = rank x in r `seq` Just (r, x)
max' old#(Just (ro, xo)) xn = let rn = rank xn
in case ro `compare` rn of
LT -> Just (rn, xo)
_ -> old
These pragmas may be useful:
{-# SPECIALIZE maximumOn :: Ord r => (a -> r) -> [a] -> a #-}
{-# SPECIALIZE maximumOn :: (a -> Int) -> [a] -> a #-}
HTNW has explained how to do what you asked, but I figured I should mention that for the specific application you mentioned, there's a way that's more efficient in certain cases (assuming the words are represented by Strings). Suppose you want
longest :: [[a]] -> [a]
If you ask for maximumOn length [replicate (10^9) (), []], then you'll end up calculating the length of a very long list unnecessarily. There are several ways to work around this problem, but here's how I'd do it:
data MS a = MS
{ _longest :: [a]
, _longest_suffix :: [a]
, _longest_bound :: !Int }
We will ensure that longest is the first of the longest strings seen thus far, and that longest_bound + length longest_suffix = length longest.
step :: MS a -> [a] -> MS a
step (MS longest longest_suffix longest_bound) xs =
go longest_bound longest_suffix xs'
where
-- the new list is not longer
go n suffo [] = MS longest suffo n
-- the new list is longer
go n [] suffn = MS xs suffn n
-- don't know yet
go !n (_ : suffo) (_ : suffn) =
go (n + 1) suffo suffn
xs' = drop longest_bound xs
longest :: [[a]] -> [a]
longest = _longest . foldl' step (MS [] [] 0)
Now if the second to longest list has q elements, we'll walk at most q conses into each list. This is the best possible complexity. Of course, it's only significantly better than the maximumOn solution when the longest list is much longer than the second to longest.

Haskell - how to avoid messing pure with IO

I am implementing some algorithm on haskell. This algorithm requires generating some data.
I have a function of an algorithm which takes generation function as a parameter. For example, algorithm is just multiplying input data by n:
algo :: a -> ??? -> [a]
algo n dgf = map (\x -> x * n) $ dgf
dgf is used to generate data. How to write function header correctly, as dgf can be any function with any number of parameters?
Another variant is accepting not the generation function but already generated data.
algo :: a -> [b] -> [a]
algo n d = (\x -> n*x) d
So, now let's imagine I'm generation data with stdGen, which uses IO. How can I make function more generic, so that it could accept both IO instance and plain values like just [1,2,3]. This also relates to variant with function, as it can also produce IO.
All in all, which solution is better - having a generation function or a pre-generated data?
Thanks in advance.
One option is to take a stream rather than a list. If generating the values involves performing IO, and there may be many many values, this is often the best approach. There are several packages that offer streams of some sort, but I'll use the streaming package in this example.
import qualified Streaming.Prelude as S
import Streaming
algo :: Monad m => a -> Stream (Of a) m r -> Stream (Of a) m r
algo a = S.map (a +)
You can read Stream (Of a) m r as "a way to use operations in m to produce successive values of type a and finally a result of type r". This algo function doesn't commit to any particular way of generating the data; they can be created purely:
algo a (S.each [these, are, my, elements])
or within IO,
algo a $ S.takeWhile (> 3) (S.readLn :: Stream (Of Int) IO ())
or using a randomness monad, or whatever you like.
For contrast, I'm going to take the opposite approach as dfeuer's answer.
Just use lists.
Consider your first example:
algo :: a -> ??? -> [a]
algo n dgf = map (\x -> x * n) $ dgf
You ask "How to write function header correctly, as dgf can be any function with any number of parameters?"
Well, one way is to use uncurrying.
Normally, Haskell functions are curried. If we have a function like
add :: Int -> Int -> Int
add x y = x + y
And we want a function that adds two to its input we can just use add 2.
>>> map (add 2) [1..10]
[3,4,5,6,7,8,9,10,11,12]
Because add is not actually a function that takes two arguments,
it's a function of one argument that returns a function of one argument.
We could have added parentheses to the argument of add above to make this more clear:
add :: Int -> (Int -> Int)
In Haskell, all functions are functions of one argument.
However, we can also go the other way - uncurry a function
that returns a function to get a function that takes a pair:
>>> :t uncurry
uncurry :: (a -> b -> c) -> (a, b) -> c
>>> :t uncurry add
uncurry add :: (Int, Int) -> Int
This can also be useful, say if we want to find the sum of each pair in a list:
>>> map (uncurry add) [ (1,2), (3,4), (5,6), (7,8), (9,10) ]
[3,7,11,15,19]
In general, we can uncurry any function of type a0-> a1 -> ... -> aN -> b
into a function (a0, a1, ..., aN) -> b, though there might not be
a cute library function to do it for us.
With that in mind, we could implement algo by passing it an uncurried
function and a tuple of values:
algo :: Num a => a -> (t -> [a]) -> t -> [a]
algo n f t = map (\x -> x * n) $ f t
And then use anonymous functions to uncurry our argument functions:
>>> algo 2 (\(lo,hi) -> enumFromTo lo hi) (5, 10)
[10,12,14,16,18,20]
>>> algo 3 (\(a,b,c,d) -> zipWith (+) [a..b] [c..d]) (1, 5, 10, 14)
[33,39,45,51,57]
Now we could do it this way, but we don't need to. As implemented above,
algo is only using f and t once. So why not pass it the list directly?
algo' :: Num a => a -> [a] -> [a]
algo' n ns = map (\x -> x * n) ns
It calculates the same results:
>>> algo' 2 $ (\(lo,hi) -> enumFromTo lo hi) (5, 10)
[10,12,14,16,18,20]
>>> algo' 2 $ enumFromTo 5 10
[10,12,14,16,18,20]
>>> algo' 3 $ (\(a,b,c,d) -> zipWith (+) [a..b] [c..d]) (1, 5, 10, 14)
[33,39,45,51,57]
>>> algo' 3 $ zipWith (+) [1..5] [10..14]
[33,39,45,51,57]
Furthermore, since haskell is non-strict, the argument to algo' isn't evaluated
until it's actually used, so we don't have to worry about "wasting" time computing
arguments that won't actually be used:
algo'' :: Num a => a -> [a] -> [a]
algo'' n ns = [n,n,n,n]
algo'' doesn't use the list passed to it, so it's never forced, so whatever
computation is used to calculate it never runs:
>>> let isPrime n = n > 2 && null [ i | i <- [2..n-1], n `rem` i == 0 ]
>>> :set +s
>>> isPrime 10000019
True
(6.18 secs, 2,000,067,648 bytes)
>>> algo'' 5 (filter isPrime [1..999999999999999])
[5,5,5,5]
(0.01 secs, 68,936 bytes)
Now to the second part of your question - what if your data is being generated within some monad?
Rather than convince algo to operate on monadic values, you could take the stream
based approach as dfeuer explains. Or you could just use a list.
Just because you're in a monad, doesn't mean that your values suddenly become strict.
For example, want a infinite list of random numbers? No problem.
newRandoms :: Num a -> IO [a]
newRandoms = unfoldr (\g -> Just (random g)) <$> newStdGen
Now I can just pass those to some algorithm:
>>> rints <- newRandoms :: IO [Int]
(0.00 secs, 60,624 bytes)
>>> algo'' 5 rints
[5,5,5,5]
(0.00 secs, 68,920 bytes)
For a small program which is just reading input from a file or two, there's no problem
with just using readFile and lazy I/O to get a list to operate on.
For example
>>> let grep pat lines = [ line | line <- lines, pat `isInfixOf` line ]
>>> :set +s
>>> dict <- lines <$> readFile "/usr/share/dict/words"
(0.01 secs, 81,504 bytes)
>>> grep "poop" dict
["apoop","epoophoron","nincompoop","nincompoopery","nincompoophood","nincompoopish","poop","pooped","poophyte","poophytic","whisterpoop"]
(0.72 secs, 423,650,152 bytes)

sort a list of numbers by their 'visual similarity'

consider a function, which rates the level of 'visual similarity' between two numbers: 666666 and 666166 would be very similar, unlike 666666 and 111111
type N = Int
type Rate = Int
similar :: N -> N -> Rate
similar a b = length . filter id . zipWith (==) a' $ b'
where a' = show a
b' = show b
similar 666666 666166
--> 5
-- high rate : very similar
similar 666666 111111
--> 0
-- low rate : not similar
There will be more sophisticated implementations for this, however this serves the purpose.
The intention is to find a function that sorts a given list of N's, so that each item is the most similar one to it's preceding item. Since the first item does not have a predecessor, there must be a given first N.
similarSort :: N -> [N] -> [N]
Let's look at some sample data: They don't need to have the same arity but it makes it easier to reason about it.
sample :: [N]
sample = [2234, 8881, 1222, 8888, 8822, 2221, 5428]
one could be tempted to implement the function like so:
similarSortWrong x xs = reverse . sortWith (similar x) $ xs
but this would lead to a wrong result:
similarSortWrong 2222 sample
--> [2221,1222,8822,2234,5428,8888,8881]
In the beginning it looks correct, but it's obvious that 8822 should rather be followed by 8881, since it's more similar that 2234.
So here's the implementation I came up with:
similarSort _ [] = []
similarSort x xs = x : similarSort a as
where (a:as) = reverse . sortWith (similar x) $ xs
similarSort 2222 sample
--> [2222,2221,2234,1222,8822,8888,8881]
It seems to work. but it also seems to do lot more more work than necessary. Every step the whole rest is sorted again, just to pick up the first element. Usually lazyness should allow this, but reverse might break this again. I'd be keen to hear, if someone know if there's a common abstraction for this problem.
It's relatively straightforward to implement the greedy algorithm you ask for. Let's start with some boilerplate; we'll use the these package for a zip-like that hands us the "unused" tail ends of zipped-together lists:
import Data.Align
import Data.These
sampleStart = "2222"
sampleNeighbors = ["2234", "8881", "1222", "8888", "8822", "2221", "5428"]
Instead of using numbers, I'll use lists of digits -- just so we don't have to litter the code with conversions between the form that's convenient for the user and the form that's convenient for the algorithm. You've been a bit fuzzy about how to rate the similarity of two digit strings, so let's make it as concrete as possible: any digits that differ cost 1, and if the digit strings vary in length we have to pay 1 for each extension to the right. Thus:
distance :: Eq a => [a] -> [a] -> Int
distance l r = sum $ alignWith elemDistance l r where
elemDistance (These l r) | l == r = 0
elemDistance _ = 1
A handy helper function will pick the smallest element of some list (by a user-specified measure) and return the rest of the list in some implementation-defined order.
minRestOn :: Ord b => (a -> b) -> [a] -> Maybe (a, [a])
minRestOn f [] = Nothing
minRestOn f (x:xs) = Just (go x [] xs) where
go min rest [] = (min, rest)
go min rest (x:xs) = if f x < f min
then go x (min:rest) xs
else go min (x:rest) xs
Now the greedy algorithm almost writes itself:
greedy :: Eq a => [a] -> [[a]] -> [[a]]
greedy here neighbors = here : case minRestOn (distance here) neighbors of
Nothing -> []
Just (min, rest) -> greedy min rest
We can try it out on your sample:
> greedy sampleStart sampleNeighbors
["2222","1222","2221","2234","5428","8888","8881","8822"]
Just eyeballing it, that seems to do okay. However, as with many greedy algorithms, this one only minimizes the local cost of each edge in the path. If you want to minimize the total cost of the path found, you need to use another algorithm. For example, we can pull in the astar package. For simplicity, I'm going to do everything in a very inefficient way, but it's not too hard to do it "right". We'll need a fair chunk more imports:
import Data.Graph.AStar
import Data.Hashable
import Data.List
import Data.Maybe
import qualified Data.HashSet as HS
Unlike before, where we only wanted the nearest neighbor, we'll now want all the neighbors. (Actually, we could probably implement the previous use of minRestOn using the following function and minimumOn or something. Give it a try if you're interested!)
neighbors :: (a, [a]) -> [(a, [a])]
neighbors (_, xs) = go [] xs where
go ls [] = []
go ls (r:rs) = (r, ls ++ rs) : go (r:ls) rs
We can now call the aStar search method with appropriate parameters. We'll use ([a], [[a]]) -- representing the current list of digits and the remaining lists that we can choose from -- as our node type. The arguments to aStar are then, in order: the function for finding neighboring nodes, the function for computing distance between neighboring nodes, the heuristic for how far we have left to go (we'll just say 1 for each unique element in the list), whether we've reached a goal node, and the initial node to start the search from. We'll call fromJust, but it should be okay: all nodes have at least one path to a goal node, just by choosing the remaining lists of digits in order.
optimal :: (Eq a, Ord a, Hashable a) => [a] -> [[a]] -> [[a]]
optimal here elsewhere = (here:) . map fst . fromJust $ aStar
(HS.fromList . neighbors)
(\(x, _) (y, _) -> distance x y)
(\(x, xs) -> HS.size (HS.fromList (x:xs)) - 1)
(\(_, xs) -> null xs)
(here, elsewhere)
Let's see it run in ghci:
> optimal sampleStart sampleNeighbors
["2222","1222","8822","8881","8888","5428","2221","2234"]
We can see that it's done better this time by adding a pathLength function that computes all the distances between neighbors in a result.
pathLength :: Eq a => [[a]] -> Int
pathLength xs = sum [distance x y | x:y:_ <- tails xs]
In ghci:
> pathLength (greedy sampleStart sampleNeighbors)
15
> pathLength (optimal sampleStart sampleNeighbors)
14
In this particular example, I think the greedy algorithm could have found the optimal path if it had made the "right" choices whenever there were ties for minimal next step; but I expect it is not too hard to cook up an example where the greedy algorithm is forced into bad early choices.

How to fix space leak caused by laziness when your algorithm relies on laziness

I have some algorithm which generates a search tree:
data SearchTree a = Solution a | Contradiction | Search [ SearchTree a ]
deriving (Show, Functor)
The algorithm generates this tree lazily. I also defined a simple evaluator, which is really just depth first search.
simpleEval :: MonadPlus m => SearchTree a -> m a
simpleEval (Solution a) = return a
simpleEval Contradiction = mzero
simpleEval (Search ps) = foldr mplus mzero $ map simpleEval ps
I noticed that many of the solutions that my algorithm produces look something like the following search tree:
nest :: Int -> SearchTree a -> SearchTree a
nest 0 = id
nest n = nest (n-1) . Search . (:[])
tree0 = Search ts where
ts = cycle $ t0 : replicate 100 t1 ++ [t2]
t0 = nest 100 $ Solution 'a'
t1 = nest 1000 $ Contradiction
t2 = nest 4 $ Solution 'b'
Namely, they have a lot of very deep branches with no solutions, a few deep branches with a solution, and very few shallow branches with a solution. On this basis, I decided I wanted another evaluator, one which will 'give up' on branches that are too deep. Call it cutoffEval. cutoffEval 5 tree0 should find only bs because it there are infinitely many branches of depth less than 5 to consider, and they only contain bs. I implemented it like so:
cutoff :: (MonadPlus m) => Int -> SearchTree a -> (m a, [SearchTree a])
cutoff cu = go cu where
plus ~(m0, l0) ~(m1, l1) = (mplus m0 m1, l0 ++ l1)
zero = (mzero, [])
go 0 x = (mzero, [x])
go _ Contradiction = zero
go _ (Solution a) = (return a, [])
go d (Search ps) = foldr plus zero $ map (go $ d-1) ps
cutoffEval :: MonadPlus m => Int -> SearchTree a -> m a
cutoffEval cu = go where
go t = case cutoff cu t of (r,ts) -> foldr mplus mzero $ r : map go ts
But this function produces a huge space leak, as compared to simpleEval:
putStrLn $ take 4000 $ simpleEval tree0 -- 2MB residency
putStrLn $ take 4000 $ cutoffEval 10 tree0 -- 600MB residency
Profiling reveals that almost all allocation occurs in cutoff.go; and the majority of allocation is due something mysterious called main:Tree.sat_s5jg and the (,) constructor. This seems to me due to the irrefutable patterns, the tuple constructors are being built up as thunks instead of being forced by plus; and normally the solution to a space leak is to make your function stricter, but here removing the irrefutable patterns causes cutoff to hang, so I cannot do it.
I tested this with GHC 7.6, 7.8 and 7.10. The problem was found in each one.
So my questions are: Can cutoffEval be written to run in constant space like simpleEval? And more generally, how do I fix a space leak, if I can't make my implementation stricter because the algorithm depends on it?
It seems to me that the reason for the memory leak is actually a bug in the implementation. Your cutoff function mixes together cutting off too deep branches with evaluating the upper part. And then in cutoffEval, you go deeper to the bottom, cut branches, and continue to explore them recursively. Which is essentially breadth-first search, going by cu levels in each pass. This means that the whole tree will eventually be constructed and retained in memory until the end! (Unlike the case of a depth-first search, where visited subtrees can be reclaimed by the GC.)
If you want to just cut off the branches that are too deep, getting the first part of the result of cutoff is what you want.
In any case I'd suggest to separate the evaluator and the cut-off part (see below). In such a case you can just use the original evaluator on the cut-off version of the tree.
One additional remark, from the MonadPlus constraint you use just the monoidal part - mzero and mplus. It'd be cleaner and more generic to use just Monoid. There are more monoids than monads (for example Sum to just count solutoins, or Last to find the last solution).
simpleEval :: (Monoid m) => (a -> m) -> SearchTree a -> m
simpleEval f = go
where
go (Solution a) = f a
go Contradiction = mempty
go (Search ps) = mconcat $ map go ps
cutoff :: Int -> SearchTree a -> SearchTree a
cutoff cu = go cu
where
go 0 _ = Contradiction -- too deep branches are just failures
go d (Search ps) = Search $ map (go (d - 1)) ps
go _ x = x
cutoffEval :: (Monoid m) => Int -> (a -> m) -> SearchTree a -> m
cutoffEval cu f = simpleEval f . cutoff cu

Resources