Merging Binary search tree - algorithm

I'm looking for a efficient way of merging two BST, knowing that the elements in the first tree are all lower than those in the second one.
I saw some merging methods but without that feature, i think this should simplify the algorithm.
Any idea ?

If the trees are not balanced, or the result shouldn't be balanced that's quite easy:
without loss of generality - let the first BST be smaller (in size) than the second.
1. Find the highest element in the first BST - this is done by following the right son while it is still available. Let the value be x, and the node be v
2. Remove this item (v) from the first tree
3. Create a new Root with value x, let this new root be r
4. set r.left = tree1.root, r.right = tree2.root
(*) If the first BST is bigger in size than the second, just repeat the process for finding v as the smallest node in the second tree.
(*) Complexity is O(min{|T1|,|T2|}) worst case (finding highest element if the tree is very inbalanced), and O(log(min{|T1|,|T2|})) average case - if the tree is relatively balanced.

Related

Why is the number of sub-trees gained from a range tree query is O(log(n))?

I'm trying to figure out this data structure, but I don't understand how can we
tell there are O(log(n)) subtrees that represents the answer to a query?
Here is a picture for illustration:
Thanks!
If we make the assumption that the above is a purely functional binary tree [wiki], so where the nodes are immutable, then we can make a "copy" of this tree such that only elements with a value larger than x1 and lower than x2 are in the tree.
Let us start with a very simple case to illustrate the point. Imagine that we simply do not have any bounds, than we can simply return the entire tree. So instead of constructing a new tree, we return a reference to the root of the tree. So we can, without any bounds return a tree in O(1), given that tree is not edited (at least not as long as we use the subtree).
The above case is of course quite simple. We simply make a "copy" (not really a copy since the data is immutable, we can just return the tree) of the entire tree. So let us aim to solve a more complex problem: we want to construct a tree that contains all elements larger than a threshold x1. Basically we can define a recursive algorithm for that:
the cutted version of None (or whatever represents a null reference, or a reference to an empty tree) is None;
if the node has a value is smaller than the threshold, we return a "cutted" version of the right subtree; and
if the node has a value greater than the threshold, we return an inode that has the same right subtree, and as left subchild the cutted version of the left subchild.
So in pseudo-code it looks like:
def treelarger(some_node, min):
if some_tree is None:
return None
if some_node.value > min:
return Node(treelarger(some_node.left, min), some_node.value, some_node.right)
else:
return treelarger(some_node.right, min)
This algorithm thus runs in O(h) with h the height of the tree, since for each case (except the first one), we recurse to one (not both) of the children, and it ends in case we have a node without children (or at least does not has a subtree in the direction we need to cut the subtree).
We thus do not make a complete copy of the tree. We reuse a lot of nodes in the old tree. We only construct a new "surface" but most of the "volume" is part of the old binary tree. Although the tree itself contains O(n) nodes, we construct, at most, O(h) new nodes. We can optimize the above such that, given the cutted version of one of the subtrees is the same, we do not create a new node. But that does not even matter much in terms of time complexity: we generate at most O(h) new nodes, and the total number of nodes is either less than the original number, or the same.
In case of a complete tree, the height of the tree h scales with O(log n), and thus this algorithm will run in O(log n).
Then how can we generate a tree with elements between two thresholds? We can easily rewrite the above into an algorithm treesmaller that generates a subtree that contains all elements that are smaller:
def treesmaller(some_node, max):
if some_tree is None:
return None
if some_node.value < min:
return Node(some_node.left, some_node.value, treesmaller(some_node.right, max))
else:
return treesmaller(some_node.left, max)
so roughly speaking there are two differences:
we change the condition from some_node.value > min to some_node.value < max; and
we recurse on the right subchild in case the condition holds, and on the left if it does not hold.
Now the conclusions we draw from the previous algorithm are also conclusions that can be applied to this algorithm, since again it only introduces O(h) new nodes, and the total number of nodes can only decrease.
Although we can construct an algorithm that takes the two thresholds concurrently into account, we can simply reuse the above algorithms to construct a subtree containing only elements within range: we first pass the tree to the treelarger function, and then that result through a treesmaller (or vice versa).
Since in both algorithms, we introduce O(h) new nodes, and the height of the tree can not increase, we thus construct at most O(2 h) and thus O(h) new nodes.
Given the original tree was a complete tree, then it thus holds that we create O(log n) new nodes.
Consider the search for the two endpoints of the range. This search will continue until finding the lowest common ancestor of the two leaf nodes that span your interval. At that point, the search branches with one part zigging left and one part zagging right. For now, let's just focus on the part of the query that branches to the left, since the logic is the same but reversed for the right branch.
In this search, it helps to think of each node as not representing a single point, but rather a range of points. The general procedure, then, is the following:
If the query range fully subsumes the range represented by this node, stop searching in x and begin searching the y-subtree of this node.
If the query range is purely in range represented by the right subtree of this node, continue the x search to the right and don't investigate the y-subtree.
If the query range overlaps the left subtree's range, then it must fully subsume the right subtree's range. So process the right subtree's y-subtree, then recursively explore the x-subtree to the left.
In all cases, we add at most one y-subtree in for consideration and then recursively continue exploring the x-subtree in only one direction. This means that we essentially trace out a path down the x-tree, adding in at most one y-subtree per step. Since the tree has height O(log n), the overall number of y-subtrees visited this way is O(log n). And then, including the number of y-subtrees visited in the case where we branched right at the top, we get another O(log n) subtrees for a total of O(log n) total subtrees to search.
Hope this helps!

An efficient pseudo-code that check if a given BST is a valid AVL tree

I need to write an algorithm (in pseudo-code) that chech if a given BST tree is a valid AVL tree. In doing that I need to give to each node a rank (in AVL trees rank means the height of the node) so the outcome will be a valid AVL tree.
I thought about a simple algorithm that calculates in each step the height of a node and the height of its two sons (if the sons is null then the height is -1), and then checks if the difference between the heights is 1,1 or 1,2 or 2,1. If not then it's not an AVL tree. If yes than we do the same for node.left and node.right.
My problem with that algorithm is that the time complexity is very huge and could go even to O(n^2). Is there a more efficient algorithm?
Another algorithm that I want to find is when given a valid AVL tree and the rank of each node (rank=height), I need to find a series of inserts that build the same tree. I thought about doing it by sorted order of the keys, but the outcome isn't right.
AVL-Tree check
You actually got the right idea. But you missed out on using the recursive definition of the height of a tree
height(node) = 1 + max(height(node.left), height(node.right))
which is why you got an enormous complexity (though O(2^n) is too pessimistic). Instead of directly calculating the height of a node in a top-bottom-approach, you could go the other way and calculate the height of the individual nodes bottom-top:
valid_avl(node):
if node is null then
return -1, True
left_height, left_valid = valid_avl(node.left)
right_height, right_valid = valid_avl(node.right)
if not left_valid or not right_valid or abs(left_height - right_height) > 1 then
return -1, False
else
return 1 + max(left_height, right_height), True
You might want to split the this function into two functions and avoid the use of tuples, depending on the language you use. Note that the above is albeit looking similar to python just pseudocode!!!
Rebuilding the tree
This ones actually fairly simple. Get all values in the tree in level-order and insert them exactly in this order. This way the tree never gets rebalanced and each node is automatically placed in the correct position.

Create a binary search tree with a better complexity

You are given a number which is the root of a binary search tree. Then you are given an array of N elements which you have to insert into the binary search tree. The time complexity is N^2 if the array is in the sorted order. I need to get the same tree structure in a much better complexity (say NlogN). I tried it a lot but wasn't able to solve it. Can somebody help?
I assume that all numbers are distinct (if it's not the case, you can use a pair (number, index) instead).
Let's assume that we want to insert we want to insert an element X. If it's the smallest/the largest element so far, its clear where it goes.
Let's a = max y: y in tree and y < X and b = min y: y in tree and y > X. I claim that:
One of them is an ancestor of the other.
Either a doesn't have the right child or b doesn't have the left child.
Proof:
Let it not be the case. Let l = lca(a, b). As a is in its left subtree and b is in it's right subtree, a < l < b. Contradiction.
Let a be an ancestor of b. If b has a left child c. Than a < c < b. Contradiction (the other case is handled similarly).
So the solution goes like this:
Let's a keep a set of elements that are already in a tree (I mean an efficient set with lower_bound operation like std::set in C++ or TreeSet in Java).
Let's find a and b as described above upon every insertion (in O(log N) time using the set's lower_bound operation). Exactly one of them doesn't have an appropriate child. That's where the new element goes.
The total time complexity is clearly O(N log N).
If you look up a word in a dictionary, you open the dictionary about halfway and look at the page. That then tells you if the search word is in the first or second half of the dictionary. Repeat, eliminating half the remaining words on each pass, and you soon narrow it down to a single word. 4 billion word dictionaries will take about 32 passes.
A binary search tree uses the same principle. Except as well as looking up, you can also insert. Insertion is O(log N), unless the tree becomes degenerate.
To prevent the tree going degenerate, you use a system of "red" and "black" nodes (the colours are just conventional), and you don't allow long runs of
either colour. The full explanation is in my book, Basic Algorithms
http://www.lulu.com/spotlight/bgy1mm
An implementation is here
https://github.com/MalcolmMcLean/babyxrc/blob/master/src/rbtree.c
https://github.com/MalcolmMcLean/babyxrc/blob/master/src/rbtree.h
But you will need some explanation if you want to learn about red black
trees from it.

How to efficiently check whether it's height balanced for a massively skewed binary search tree?

I was reading this answer on how to check if a BST is height balanced, and really hooked by the bonus question:
Suppose the tree is massively unbalanced. Like, a million nodes deep on one side and three deep on the other. Is there a scenario in which this algorithm blows the stack? Can you fix the implementation so that it never blows the stack, even when given a massively unbalanced tree?
What would be a good strategy here?
I am thinking to do a level order traversal and track the depth, if a leaf is found and current node depth is bigger than the leaf node depth + 2, then it's not balanced. But how to combine this with height checking?
Edit: below is the implementation in the linked answer
IsHeightBalanced(tree)
return (tree is empty) or
(IsHeightBalanced(tree.left) and
IsHeightBalanced(tree.right) and
abs(Height(tree.left) - Height(tree.right)) <= 1)
To review briefly: a tree is defined as being either null or a root node with pointers .left to a left child and .right to a right child, where each child is in turn a tree, the root node appears in neither child, and no node appears in both children. The depth of a node is the number of pointers that must be followed to reach it from the root node. The height of a tree is -1 if it's null or else the maximum depth of a node that appears in it. A leaf is a node whose children are null.
First let me note the two distinct definitions of "balanced" proposed by answerers of the linked question.
EL-balanced A tree is EL-balanced if and only if, for every node v, |height(v.left) - height(v.right)| <= 1.
This is the balance condition for AVL trees.
DF-balanced A tree is DF-balanced if and only if, for every pair of leaves v, w, we have |depth(v) - depth(w)| <= 1. As DF points out, DF-balance for a node implies DF-balance for all of its descendants.
DF-balance is used for no algorithm known to me, though the balance condition for binary heaps is very similar, requiring additionally that the deeper leaves be as far left as possible.
I'm going to outline three approaches to testing balance.
Size bounds for balanced trees
Expand the recursive function to have an extra parameter, maxDepth. For each recursive call, pass maxDepth - 1, so that maxDepth roughly tracks how much stack space is left. If maxDepth reaches 0, report the tree as unbalanced (e.g., by returning "infinity" for the height), since no balanced tree that fits in main memory could possibly be that tall.
This approach relies on an a priori size bound on main memory, which is available in practice if not in all theoretical models, and the fact that no subtrees are shared. (PROTIP: unless you're very careful, your subtrees will be shared at some point during development.) We also need height bounds on balanced trees of at most a given size.
EL-balanced Via mutual induction, we prove a lower bound, L(h), on the number of nodes belonging to an EL-balanced tree of a given height h.
The base cases are
L(-1) = 0
L(0) = 1,
more or less by definition. The inductive case is trickier. An EL-balanced tree of height h > 0 is a node with an EL-balanced child of height h - 1 and another EL-balanced child of height either h - 1 or h - 2. This means that
L(h) = 1 + L(h - 1) + min(L(h - 2), L(h - 1)).
Add 1 to both sides and rearrange.
L(h) + 1 = L(h - 1) + 1 + min(L(h - 2) + 1, L(h - 1) + 1).
A little while later (spoiler), we find that
L(h) <= phi^(h + 2)/sqrt(5),
where phi = (1 + sqrt(5))/2 ~ 1.618.
maxDepth then should be set to the floor of the base-phi logarithm of the maximum number of nodes, plus a small constant that depends on fenceposty things.
DF-balanced Rather than write out an induction proof, I'm going to appeal to your intuition that the worst case is a complete binary tree with one extra leaf on the bottom. Then the proper setting for maxDepth is the base-2 logarithm of the maximum number of nodes, plus a small constant.
Iterative deepening depth-first search
This is the theoretician's version of the answer above. Because, for some reason, we don't know how much RAM our computer has (and with logarithmic space usage, it's not as though we need a tight bound), we again include the maxDepth parameter, but this time, we use it to truncate the tree implicitly below the specified depth. If the height of the tree comes back below the bound, then we know that the algorithm ran successfully. Alternatively, if the truncated tree is unbalanced, then so is the whole tree. The problem case is when the truncated tree is balanced but with height equal to maxDepth. Then we increase maxDepth and retry.
The simplest retry strategy is to increase maxDepth by 1 every time. Since balanced trees with n nodes have height O(log n), the running time is O(n log n). In fact, for DF-balanced trees, the running time is also O(n), since, except for the last couple traversals, the size of the truncated tree increases by a factor of 2 each time, leading to a geometric series.
Another strategy, doubling maxDepth each time, gives an O(n) running time for EL-balanced trees, since the largest tree of height h, with 2^(h + 1) - 1 nodes, is much smaller than the smallest tree of height 2h, with approximately (phi^2)^h nodes. The downside of doubling is that we may use twice as much stack space. With increase-by-1, however, in the family of minimum-size EL-balanced trees we constructed implicitly in defining L(h), the number of nodes at depth h - k in the tree of height h is polynomial of degree k. Accordingly, the last few scans will incur some superlinear term.
Temporarily mutating pointers
If there are parent pointers, then it's easy to traverse depth-first in place, because the parent pointers can be used to derive the relevant information on the stack in an efficient manner. If we don't have parent pointers but can mutate the tree temporarily, then, for descent into a child, we can cannibalize the pointer to that child to store temporarily the node's parent. The problem is determining on the way up whether we came from a left or a right child. If we can sneak a bit (say because pointers are 2-byte aligned, or because there's a spare bit in the balance factor, or because we're copying the tree for stop-and-copy garbage collection and can determine which arena we're in), then that's one way. Another test assumes that the tree is a binary search tree. It turns out that we don't need additional assumptions, however: Explain Morris inorder tree traversal without using stacks or recursion .
The one fly in the ointment is that this approach only works, as far as I know, on DF-balance, since there's no space on the stack to put the partial results for EL-balance.

What is the best way to implement this type of binary tree?

Let's say I have a binary tree of height h, that have elements of x1, x2, ... xn.
xi element is initially at the ith leftmost leaf. The tree should support the following methods in O(h) time
add(i, j, k) where 1 <= i <= j =< n. This operation adds value k to the values of all leftmost nodes which are between i and j. For example, add(2,5,3) operation increments all leftmost nodes which are between 2th and 5th nodes by 3.
get(i): return the value of ith leftmost leaf.
What should be stored at the internal nodes for that properties?
Note: I am not looking for an exact answer but any hint on how to approach the problem would be great.
As I understand the question, the position of the xi'th element never changes, and the tree isn't a search tree, the search is based solely on the position of the node.
You can store an offset in the non leaves vertices, indicating the value change of its descendant.
add(i,j,k) will start from the root, and deepen in the tree, and increase a node's value by k if and only if, all its descendants are within the range [i,j]. If it had increased the value, no further deepening will occur.
Note1: In a single add() operation, you might need to add more then one number.
Note2: You actually need to add at most O(logn) = O(logh) values [convince yourself why. Hint: binary presentation of number up to n requires O(logn) bits], which later gives you [again, make sure you understand why] the O(logh) needed complexity.
get(i) is then trivial: summarize the values from the root to the i'th leaf, and return this sum.
Since it seems homework, I will not post a pseudo-code, this guidelines should get you started with this assigment.

Resources