I've just read about binary search trees from the "Learn You a Haskell" book, and I'm wondering whether it is effective to search more than one element using this tree? For example, suppose I have a bunch of objects where every object has some index, and
5
/ \
3 7
/ \ / \
1 4 6 8
if I need to find an element by index 8, I need to do only three steps 5 -> 7 -> 8, instead of iterating over the whole list until the end. But what if I need to find several objects, say 1, 4, 6, 8? It seems like I'd need to repeat the same action for each element 5-> 3 -> 1 5 -> 3 -> 4, 5 -> 7 -> 6 and 5 -> 7 -> 8.
So my question is: does it still make sense to use binary search tree for finding more than one element? Could it be better than checking each element for condition (which leads only to O(n) in the worst case)?
Also, what kind of data structure is better to use if I need to check more than one attribute. E.g. in the example above, I was looking only for the id attribute, but what if I also need to search by name, or color, etc?
You can share some of the work. See members, which takes in a list of values and outputs a list of exactly those values of the input list that are in the tree. Note: The order of the input list is not perserved in the output list.
EDIT: I'm actually not sure if you can get better performance (from a theoretical standpoint) with members over doing map member. I think that if the input list is sorted, then you could by splitting the list in threes (lss, eqs, gts) could be done easily.
data BinTree a
= Branch (BinTree a) a (BinTree a)
| Leaf
deriving (Show, Eq, Ord)
empty :: BinTree a
empty = Leaf
singleton :: a -> BinTree a
singleton x = Branch Leaf x Leaf
add :: (Ord a) => a -> BinTree a -> BinTree a
add x Leaf = singleton x
add x tree#(Branch left y right) = case compare x y of
EQ -> tree
LT -> Branch (add x left) y right
GT -> Branch left y (add x right)
member :: (Ord a) => a -> BinTree a -> Bool
member x Leaf = False
member x (Branch left y right) = case compare x y of
EQ -> True
LT -> member x left
GT -> member x right
members :: (Ord a) => [a] -> BinTree a -> [a]
members xs Leaf = []
members xs (Branch left y right) = eqs ++ members lts left ++ members gts right
where
comps = map (\x -> (compare x y, x)) xs
grab ordering = map snd . filter ((ordering ==) . fst)
eqs = grab EQ comps
lts = grab LT comps
gts = grab GT comps
A quite acceptable solution when searching for multiple elements is to search for them one at a time with the most efficient algorithm (which is O(log n) in your case). However, it can be quite advantageous to step through the entire tree and pool all the elements that match a certain condition, it really depends on where and how often you search inside your code. If you only search at one point in your code it would make sense to collect all the elements in the tree in one shot instead of searching for them one by one. If you decide to opt for that solution then you could feasibly use other data structures such as a list.
If you need to check for multiple attributes I suggest replacing "id" with a tuple containing all the different possible identifiers (id, color, ...). You can then unpack the tuple and compare whichever identifiers you want.
Assuming your binary tree is balanced, if you have a constant number k of search items, then k searches with a total time of O(k * log(n)) is still better than a single O(n) search, where at each character, you still have to do k comparisons, making it O(k*n). Even if the list of search items is sorted, and you can binary search in O(log(k)) time to see if your current item is a match, you're still at O(n * log(k)), which is worse than the tree unless k is Theta(n).
No.
A single search is O(log n). 4 searchs is (4 log n). A linear search, which would pick up all items, is O(n). The tree structure of a btree means finding more than one datum requires a walk (which is actually worse than a list walk).
Related
I have a function that takes an integer and returns a list of integers.
How do I efficiently map this function to an initial integer, then for each item of the resulting list that has not be previously mapped, apply the same function and essentially generate an infinite list.
E.g.
f :: Int -> [Int]
f 0 = [1,2]++(f 1)++(f 2)
Additionally, I need to be able to index the resulting list up to 10E10. How would this be optimised? memoization?
You want a breadth-first search. The basic idiom goes like this:
bfs :: (a -> [a]) -> [a] -> [a]
bfs f xs = xs ++ bfs f (concatMap f xs)
Notice how we keep the current "state" in the argument xs, output it and then recursively call with a new state which is f applied to each element of the input state.
If you want to filter out elements you haven't seen before, you need to also pass along some extra state keeping track of which elements you've seen, e.g. a Data.Set, and adjust the algorithm accordingly. I'll leave that bit to you because I'm an irritating pedagogue.
consider a function, which rates the level of 'visual similarity' between two numbers: 666666 and 666166 would be very similar, unlike 666666 and 111111
type N = Int
type Rate = Int
similar :: N -> N -> Rate
similar a b = length . filter id . zipWith (==) a' $ b'
where a' = show a
b' = show b
similar 666666 666166
--> 5
-- high rate : very similar
similar 666666 111111
--> 0
-- low rate : not similar
There will be more sophisticated implementations for this, however this serves the purpose.
The intention is to find a function that sorts a given list of N's, so that each item is the most similar one to it's preceding item. Since the first item does not have a predecessor, there must be a given first N.
similarSort :: N -> [N] -> [N]
Let's look at some sample data: They don't need to have the same arity but it makes it easier to reason about it.
sample :: [N]
sample = [2234, 8881, 1222, 8888, 8822, 2221, 5428]
one could be tempted to implement the function like so:
similarSortWrong x xs = reverse . sortWith (similar x) $ xs
but this would lead to a wrong result:
similarSortWrong 2222 sample
--> [2221,1222,8822,2234,5428,8888,8881]
In the beginning it looks correct, but it's obvious that 8822 should rather be followed by 8881, since it's more similar that 2234.
So here's the implementation I came up with:
similarSort _ [] = []
similarSort x xs = x : similarSort a as
where (a:as) = reverse . sortWith (similar x) $ xs
similarSort 2222 sample
--> [2222,2221,2234,1222,8822,8888,8881]
It seems to work. but it also seems to do lot more more work than necessary. Every step the whole rest is sorted again, just to pick up the first element. Usually lazyness should allow this, but reverse might break this again. I'd be keen to hear, if someone know if there's a common abstraction for this problem.
It's relatively straightforward to implement the greedy algorithm you ask for. Let's start with some boilerplate; we'll use the these package for a zip-like that hands us the "unused" tail ends of zipped-together lists:
import Data.Align
import Data.These
sampleStart = "2222"
sampleNeighbors = ["2234", "8881", "1222", "8888", "8822", "2221", "5428"]
Instead of using numbers, I'll use lists of digits -- just so we don't have to litter the code with conversions between the form that's convenient for the user and the form that's convenient for the algorithm. You've been a bit fuzzy about how to rate the similarity of two digit strings, so let's make it as concrete as possible: any digits that differ cost 1, and if the digit strings vary in length we have to pay 1 for each extension to the right. Thus:
distance :: Eq a => [a] -> [a] -> Int
distance l r = sum $ alignWith elemDistance l r where
elemDistance (These l r) | l == r = 0
elemDistance _ = 1
A handy helper function will pick the smallest element of some list (by a user-specified measure) and return the rest of the list in some implementation-defined order.
minRestOn :: Ord b => (a -> b) -> [a] -> Maybe (a, [a])
minRestOn f [] = Nothing
minRestOn f (x:xs) = Just (go x [] xs) where
go min rest [] = (min, rest)
go min rest (x:xs) = if f x < f min
then go x (min:rest) xs
else go min (x:rest) xs
Now the greedy algorithm almost writes itself:
greedy :: Eq a => [a] -> [[a]] -> [[a]]
greedy here neighbors = here : case minRestOn (distance here) neighbors of
Nothing -> []
Just (min, rest) -> greedy min rest
We can try it out on your sample:
> greedy sampleStart sampleNeighbors
["2222","1222","2221","2234","5428","8888","8881","8822"]
Just eyeballing it, that seems to do okay. However, as with many greedy algorithms, this one only minimizes the local cost of each edge in the path. If you want to minimize the total cost of the path found, you need to use another algorithm. For example, we can pull in the astar package. For simplicity, I'm going to do everything in a very inefficient way, but it's not too hard to do it "right". We'll need a fair chunk more imports:
import Data.Graph.AStar
import Data.Hashable
import Data.List
import Data.Maybe
import qualified Data.HashSet as HS
Unlike before, where we only wanted the nearest neighbor, we'll now want all the neighbors. (Actually, we could probably implement the previous use of minRestOn using the following function and minimumOn or something. Give it a try if you're interested!)
neighbors :: (a, [a]) -> [(a, [a])]
neighbors (_, xs) = go [] xs where
go ls [] = []
go ls (r:rs) = (r, ls ++ rs) : go (r:ls) rs
We can now call the aStar search method with appropriate parameters. We'll use ([a], [[a]]) -- representing the current list of digits and the remaining lists that we can choose from -- as our node type. The arguments to aStar are then, in order: the function for finding neighboring nodes, the function for computing distance between neighboring nodes, the heuristic for how far we have left to go (we'll just say 1 for each unique element in the list), whether we've reached a goal node, and the initial node to start the search from. We'll call fromJust, but it should be okay: all nodes have at least one path to a goal node, just by choosing the remaining lists of digits in order.
optimal :: (Eq a, Ord a, Hashable a) => [a] -> [[a]] -> [[a]]
optimal here elsewhere = (here:) . map fst . fromJust $ aStar
(HS.fromList . neighbors)
(\(x, _) (y, _) -> distance x y)
(\(x, xs) -> HS.size (HS.fromList (x:xs)) - 1)
(\(_, xs) -> null xs)
(here, elsewhere)
Let's see it run in ghci:
> optimal sampleStart sampleNeighbors
["2222","1222","8822","8881","8888","5428","2221","2234"]
We can see that it's done better this time by adding a pathLength function that computes all the distances between neighbors in a result.
pathLength :: Eq a => [[a]] -> Int
pathLength xs = sum [distance x y | x:y:_ <- tails xs]
In ghci:
> pathLength (greedy sampleStart sampleNeighbors)
15
> pathLength (optimal sampleStart sampleNeighbors)
14
In this particular example, I think the greedy algorithm could have found the optimal path if it had made the "right" choices whenever there were ties for minimal next step; but I expect it is not too hard to cook up an example where the greedy algorithm is forced into bad early choices.
While reading a snipped from Haskell for Great Good I found the following situation:
treeInsert :: (Ord a) => a -> Tree a -> Tree a
treeInsert x EmptyTree = singleton x
treeInsert x (Node a left right)
| x == a = Node x left right
| x < a = Node a (treeInsert x left) right
| x > a = Node a left (treeInsert x right)
Wouldn't it be better for performance if we just reused the given Tree when x == a?
treeInsert :: (Ord a) => a -> Tree a -> Tree a
treeInsert x EmptyTree = singleton x
treeInsert x all#(Node a left right)
| x == a = all
| x < a = Node a (treeInsert x left) right
| otherwise = Node a left (treeInsert x right)
In real life coding, what should I do? Are there any drawbacks when returning the same thing?
Let's look at the core! (Without optimisations here)
$ ghc-7.8.2 -ddump-simpl wtmpf-file13495.hs
The relevant difference is that the first version (without all#(...)) has
case GHC.Classes.> # a_aUH $dOrd_aUV eta_B2 a1_aBQ
of _ [Occ=Dead] {
GHC.Types.False ->
Control.Exception.Base.patError
# (TreeInsert.Tree a_aUH)
"wtmpf-file13495.hs:(9,1)-(13,45)|function treeInsert"#;
GHC.Types.True ->
TreeInsert.Node
# a_aUH
a1_aBQ
left_aBR
(TreeInsert.treeInsert # a_aUH $dOrd_aUV eta_B2 right_aBS)
where reusing the node with that as-pattern does just
TreeInsert.Node
# a_aUI
a1_aBR
left_aBS
(TreeInsert.treeInsert # a_aUI $dOrd_aUW eta_B2 right_aBT);
This is an avoided check that may well make a significant performance difference.
However, this difference has actually nothing to do with the as-pattern. It's just because your first snippet uses a x > a guard, which is not trivial. The second uses otherwise, which is optimised away.
If you change the first snippet to
treeInsert :: (Ord a) => a -> Tree a -> Tree a
treeInsert x EmptyTree = singleton x
treeInsert x (Node a left right)
| x == a = Node x left right
| x < a = Node a (treeInsert x left) right
| otherwise = Node a left (treeInsert x right)
then the difference boils down to
GHC.Types.True -> TreeInsert.Node # a_aUH a1_aBQ left_aBR right_aBS
vs
GHC.Types.True -> wild_Xa
Which is indeed just the difference of Node x left right vs all.
...without optimisations, that is. The versions diverge further when I turn on -O2. But I can't really make out how the performance would differ, there.
In real life coding, what should I do? Are there any drawbacks when returning the same thing?
a == b does not guarantee that f a == f b for all functions f. So, you may have to return new object even if they compare equal.
In other words, it may not be feasible to change Node x left right to Node a left right or all when a == x regardless of performance gains.
For example you may have types which carry meta data. When you compare them for equality, you may only care about the values and ignore the meta data. But if you replace them just because they compare equal then you will loose the meta data.
newtype ValMeta a b = ValMeta (a, b) -- value, along with meta data
deriving (Show)
instance Eq a => Eq (ValMeta a b) where
-- equality only compares values, ignores meta data
ValMeta (a, b) == ValMeta (a', b') = a == a'
The point is Eq type-class only says that you may compare values for equality. It does not guarantee anything beyond that.
A real-world example where a == b does not guarantee that f a == f b is when you maintain a Set of unique values within a self-balancing tree. A self-balancing tree (such as Red-Black tree) has some guarantees about the structure of tree but the actual depth and structure depends on the order that the data were added to or removed from the set.
Now when you compare 2 sets for equality, you want to compare that values within the set are equal, not that the underlying trees have the same exact structure. But if you have a function such as depth which exposes the depth of underlying tree maintaining the set then you cannot guarantee that the depths are equal even if the sets compare equal.
Here is a video of great Philip Wadler realizing live and on-stage that many useful relations do not preserve equality (starting at 42min).
Edit: Example from ghc where a == b does not imply f a == f b:
\> import Data.Set
\> let a = fromList [1, 2, 3, 4, 5, 10, 9, 8, 7, 6]
\> let b = fromList [1..10]
\> let f = showTree
\> a == b
True
\> f a == f b
False
Another real-world example is hash-table. Two hash-tables are equal if and only if their key-value pairs tie out. However, the capacity of a hash-table, i.e. the number of keys you may add before having to re-allocate and rehash, depends on the order of inserts/deletes.
So if you have a function which returns the capacity of hash table, it may return different values for hash-tables a and b even though a == b.
My two cents... perhaps not even about the original question:
Instead of writing guards with x < a and x == a, I would match compare a b against LT, EQ and GT, e.g.:
treeInsert x all#(Node a left right) =
case compare x a of
EQ -> ...
LT -> ...
GT -> ...
I would do this especially if x and a can be complex data structures, since a test like x < a could be expensive.
answer seems to be wrong. I just leave it here, for reference...
With your second function you avoid creating a new node, because the compiler cannot really understand equality (== is just some function.) If you change the first version to
-- version C
treeInsert :: (Ord a) => a -> Tree a -> Tree a
treeInsert x EmptyTree = singleton x
treeInsert x (Node a left right)
| x == a = Node a left right -- Difference here! Changed x to a.
| x < a = Node a (treeInsert x left) right
| x > a = Node a left (treeInsert x right)
the compiler will probably be able to do common subexpression elimination, because the optimizer will be able to see that Node a left right is the same as Node a left right.
On the other hand, I doubt that the compiler can deduce from a == x that Node a left right is the same as Node x left right.
So, I'm pretty sure that under -O2, version B and version C are the same, but version A is probably slower because it does an extra instantiation in the a == x case.
Well, if the first case had used a instead of x as follows, then there's at least the chance that GHC would eliminate the allocation of a new node through common subexpression elimination.
treeInsert x (Node a left right)
| x == a = Node a left right
However, this is all but irrelevant in any non-trivial use case, because the path down the tree to the node is going to be duplicated even when the element already exists. And this path is going to be significantly longer than a single node unless your use case is trivial.
In the world of ML, the fairly idiomatic way to avoid this is to throw a KeyAlreadyExists exception, and then catch that exception at the top-level insertion function and return the original tree. This would cause the stack to be unwound instead of allocating any of the Nodes on the heap.
A direct implementation of the ML idiom is basically a no-no in Haskell, for good reasons. If avoiding this duplication matters, the simplest and possibly best thing to do is to check if the tree contains the key before you insert it.
The downside of this approach, compared to a direct Haskell insert or the ML idiom, is that it involves two traversals of the path instead of one. Now, here is a non-duplicating, single-pass insert you can implement in Haskell:
treeInsert :: Ord a => a -> Tree a -> Tree a
treeInsert x original_tree = result_tree
where
(result_tree, new_tree) = loop x original_tree
loop x EmptyTree = (new_tree, singleton x)
loop x (Node a left right) =
case compare x a of
LT -> let (res, new_left) = loop x left
in (res, Node a new_left right)
EQ -> (original_tree, error "unreachable")
GT -> let (res, new_right) = loop x right
in (res, Node a left new_right)
However, older versions of GHC (roughly 7-10 years ago) don't handle this sort of recursion through lazy pairs of results very efficiently, and in my experience check-before-insert is likely to perform better. I'd be slightly surprised if this observation has really changed in the context of more recent GHC versions.
One can certainly imagine a function that directly constructs (but does not return) a new path for the tree, and decides to return the new path or the original path once it's known whether the element exists already. (The new path would immediately become garbage if it is not returned.) This conforms to the basic principles of the GHC runtime, but isn't really expressible in the source language.
Of course, any completely non-duplicating insertion function on a lazy data structure is going to have different strictness properties than a simple, duplicating insert. So no matter the implementation technique, they are different functions if laziness matters.
But of course, whether or not the path is duplicated may not matter that much. The cases where it would matter the most would be when you are using the tree persistently, because in linear use cases the old path would become garbage immediately after each insertion. And of course, this only matters when you are inserting a significant number of duplicates.
We have a definition of binary tree:
type 'a tree =
| Node of 'a tree * 'a * 'a tree
| Null;;
And also a helpful function for traversing the tree"
let rec fold_tree f a t =
match t with
| Null -> a
| Node (l, x, r) -> f x (fold_tree f a l) (fold_tree f a r);;
And here is a "magic" function which, when given a binary tree, returns a list in which we have lists of elements on particular levels, for example, when given a tree:
(source: ernet.in)
the function returns [[1];[2;3];[4;5;6;7];[8;9]].
let levels tree =
let aux x fl fp =
fun l ->
match l with
| [] -> [x] :: (fl (fp []))
| h :: t -> (x :: h) :: (fl (fp t))
in fold_tree aux (fun x -> x) tree [];;
And apparently it works, but I can't wrap my mind around it. Could anyone explain in simple terms what is going on? Why does this function work?
How do you combine two layer lists of two subtrees and get a layer list of a bugger tree? Suppose you have this tree
a
/ \
x y
where x and y are arbitrary trees, and they have their layer lists as [[x00,x01,...],[x10,x11,...],...] and [[y00,y01,...],[y10,y11,...],...] respectively.
The layer list of the new tree will be [[a],[x00,x01,...]++[y00,y01,...],[x10,x11,...]++[y10,y11,...],...]. How does this function build it?
Let's look at this definition
let rec fold_tree f a t = ...
and see what kind of arguments we are passing to fold_tree in our definition of levels.
... in fold_tree aux (fun x -> x) tree []
So the first argument, aux, is some kind of long and complicated function. We will return to it later.
The second argument is also a function — the identity function. This means that fold_tree will also return a function, because fold_tree always returns the same type of value as its second argument. We will argue that the function fold_tree applied to this set of arguments takes a list of layers, and adds layers of a given tree to it.
The third argument is our tree.
Wait, what's the fourth argument? fold_tree is only supposed to get tree? Yes, but since it returns a function (see above), that function gets applied to that fourth argument, the empty list.
So let's return to aux. This aux function accepts three arguments. One is the element of the tree, and two others are the results of the folds of the subtrees, that is, whatever fold_tree returns. In our case, these two things are functions again.
So aux gets a tree element and two functions, and returns yet another function. Which function is that? It takes a list of layers, and adds layers of a given tree to it. How it does that? It prepends the root of the tree to the first element (which is the top layer) of the list, and then adds the layers of the right subtree to the tail of the list (which is all the layers below the top) by calling the right function on it, and then adds the layers of the left subtree to the result by calling the left function on it. Or, if the incoming list is empty, it just the layers list afresh by applying the above step to the empty list.
What is the standard way of inserting an element to a specific position in a list in OCaml. Only recursion is allowed. No assignment operation is permitted.
My goal is to compress a graph in ocaml by removing vertexes with in_degree=out_degree=1. For this reason I need to remove the adjacent edges to make a single edge. Now the edges are in a list [(6,7);(1,2);(2,3);(5,4)]. So I need to remove those edges from the list and add a single edge.
so the above list will now look like [(6,7);(1,3);(5,4)]. Here we see (1,2);(2,3) is removed and (1,3) is inserted in the second position. I have devised an algorithm for this. But to do this I need to know how can I remove the edges (1,2);(2,3) from position 2,3 and insert (1,3) in position 2 without any explicit variable and in a recursive manner.
OCaml list is immutable so there's no such thing like removing and inserting elements in list operations.
What you can do is creating a new list by reusing certain part of the old list. For example, to create a list (1, 3)::xs' from (1, 2)::(2, 3)::xs' you actually reuse xs' and make the new list using cons constructor.
And pattern matching is very handy to use:
let rec transform xs =
match xs with
| [] | [_] -> xs
| (x, y1)::(y2, z)::xs' when y1 = y2 -> (x, z)::transform xs'
| (x, y1)::(y2, z)::xs' -> (x, y1)::transform ((y2, z)::xs')
You can do something like that :
let rec compress l = match l with
[] -> []
| x :: [] -> [x]
| x1 :: x2 :: xs ->
if snd x1 = fst x2 then
(fst x1, snd x2) :: compress xs
else x1 :: compress (x2 :: xs)
You are using the wrong datastructure to store your edges and your question doesnt indicate that you can't choose a different datastructure. As other posters already said: lists are immutable so repeated deletion of elements deep within them is a relatively costly (O(n)) operation.
I also dont understand why you have to reinsert the new edge at position 2. A graph is defined by G=(V,E) where V and E are sets of vertices and edges. The order of them therefor doesnt matter. This definition of graphs also already tells you a better datastructure for your edges: sets.
In ocaml, sets are represented by balanced binary trees so the average complexity of insertion and deletion of members is O(log n). So you see that for deletion of members this complexity is definitely better than the one of lists (O(n)) on the other hand it is more costly to add members to a set than it is to prepend elements to a list using the cons operation.
An alternative datastructure would be a hashtable where insertion and deletion can be done in O(1) time. Let the keys in the hashtable be your edges and since you dont use the values, just use a constant like unit or 0.