How to prove Average height of Binary Search Tree is O(logn)? - algorithm

I had thought of a proof but unable to sketch things up on paper.
I had though of the Recurrence relation for the height of a binary search tree would be
T(n)=T(k)+T(n-k-1)+1
Where k is the Number of elements in the left subtree if root and n-k-1 are on right subtree of root (and n= total nodes)
(Correct me if the thing above is wrong)
Now what I thought, because we have to calculate for average case..
So there would be half of the cases possible...
(Now this is the point from where I am starting messing up things plz correct here..)
My claim: I would have approx half cases out of all possible..
Example
Root
(0,N-1) OR
(1, N-2) OR
.
.
(N-1,0)
Where N is total nodes.
Now I am considering half of above cases for average case calculation...
(I don't know whether I am doing this right or not..comment on it would be most appreciated)
So I get :
T(n)=T(n/2)+T(n/2)+1
T(n)=2T(n/2)+1
Now when I apply master method for getting approx answer over the obtained recurrence relation..
I get O(n).
Now how should I proceed...?
(My expectation was instead of n I should have got logn)
But that didn't work out..
So plz suggest how should I proceed further.
(Is my approach even at all correct..from start, also tell me about that?)

From "Algorithms" by Robert Sedgewick and Kevin Wayne
Definition. A binary search tree (BST) is a binary tree where each node has a
Comparable key (and an associated value) and satisfies the restriction that the key
in any node is larger than the keys in all nodes in that node’s left subtree and smaller than the keys in all nodes in that node’s right subtree.
Proposition C. Search hits in a BST built from N random keys require ~ 2 ln N
(about 1.39 lg N) compares, on the average.
Proof: The number of compares used for a search hit ending at a given node is
1 plus the depth. Adding the depths of all nodes, we get a quantity known as the
internal path length of the tree. Thus, the desired quantity is 1 plus the average internal path length of the BST, which we can analyze with the same argument that
we used for Proposition K in Section 2.3: Let CN be the total internal path length
of a BST built from inserting N randomly ordered distinct keys, so that the average
cost of a search hit is 1 CN / N. We have C0= C1= 0 and for N > 1 we can write a
recurrence relationship that directly mirrors the recursive BST structure:
CN = N 1 (C0 CN1) / N + (C1 CN2)/N . . . (CN1 C0 )/N
The N 1 term takes into account that the root contributes 1 to the path length
of each of the other N 1 nodes in the tree; the rest of the expression accounts
for the subtrees, which are equally likely to be any of the N sizes. After rearranging
terms, this recurrence is nearly identical to the one that we solved in Section 2.3
for quicksort, and we can derive the approximation CN ~ 2N ln N
I also reccommend you to check this Mit lecture Binary Search Trees, BST Sort
Also check the chapter 3.2 from Algorithms books, it explains binary search trees in depth

Related

Recurrence: T(n/4) + T(n/2) + n^2 with tree methods

I'm trying to solve this exercise with tree method but I've a doubt about two parts:
1) In the T(?) column, is it correct using (n^2/2^i) instead of (n/2^i)? I'm asking because this is the part which cause me the error;
2) Is the last multiplication correct (it's between the number of nodes and the time)? After finding the i value I have to create a serie which starts from 0 to the result of the multiplication, right? And as variable of the serie have I to use 2^i (the number of nodes)?
The column for the number of nodes is misleading.
Each node has a cost of (m/k)^2 where k is whatever the denominator of the node is. With the structure you have using, the nodes in each level will have a variety of denominators. For example, your level 2 should contain the nodes [(m/16), (m/8)], [(m/8), (m/4)].
The cost for a level is the sum of the cost of each node in that level. Since each node has a different cost, you cannot multiply the number of nodes by a value to find the cost of a level, you have to add them up individually.
The total cost is the sum of the cost of each level. The result of this calculation may result in a logarithm, or it may not. It depends on the cost of each level and the number of levels.
Hint: Pascal's Triangle

Create a binary search tree with a better complexity

You are given a number which is the root of a binary search tree. Then you are given an array of N elements which you have to insert into the binary search tree. The time complexity is N^2 if the array is in the sorted order. I need to get the same tree structure in a much better complexity (say NlogN). I tried it a lot but wasn't able to solve it. Can somebody help?
I assume that all numbers are distinct (if it's not the case, you can use a pair (number, index) instead).
Let's assume that we want to insert we want to insert an element X. If it's the smallest/the largest element so far, its clear where it goes.
Let's a = max y: y in tree and y < X and b = min y: y in tree and y > X. I claim that:
One of them is an ancestor of the other.
Either a doesn't have the right child or b doesn't have the left child.
Proof:
Let it not be the case. Let l = lca(a, b). As a is in its left subtree and b is in it's right subtree, a < l < b. Contradiction.
Let a be an ancestor of b. If b has a left child c. Than a < c < b. Contradiction (the other case is handled similarly).
So the solution goes like this:
Let's a keep a set of elements that are already in a tree (I mean an efficient set with lower_bound operation like std::set in C++ or TreeSet in Java).
Let's find a and b as described above upon every insertion (in O(log N) time using the set's lower_bound operation). Exactly one of them doesn't have an appropriate child. That's where the new element goes.
The total time complexity is clearly O(N log N).
If you look up a word in a dictionary, you open the dictionary about halfway and look at the page. That then tells you if the search word is in the first or second half of the dictionary. Repeat, eliminating half the remaining words on each pass, and you soon narrow it down to a single word. 4 billion word dictionaries will take about 32 passes.
A binary search tree uses the same principle. Except as well as looking up, you can also insert. Insertion is O(log N), unless the tree becomes degenerate.
To prevent the tree going degenerate, you use a system of "red" and "black" nodes (the colours are just conventional), and you don't allow long runs of
either colour. The full explanation is in my book, Basic Algorithms
http://www.lulu.com/spotlight/bgy1mm
An implementation is here
https://github.com/MalcolmMcLean/babyxrc/blob/master/src/rbtree.c
https://github.com/MalcolmMcLean/babyxrc/blob/master/src/rbtree.h
But you will need some explanation if you want to learn about red black
trees from it.

Finding the 3rd largest element in array of size (2^k +1) in n+2k-3 comparisons

"Find the 3rd largest element in array of size (2^k +1) in n+2k-3 comparisons."
This was a question I had in an Algorithms course final exam, which I didn't get all the points for. I'm stil not sure what is the correct answer after a thorough internet search.
I realize it is an extended version of the same problem with the second largest, but the tight comparison bound that was requested made the question to be tricky.
I also found a mathematical explanation to find the K-th element here, however it was too complicated for me to understand.
Denote the array size to n = 2^k + 1.
In the exam itself my answer was something like this:
We'll use a tournament tree. First, we leave out an arbitrary element.
Then build the tree which will consist of 2^k elements. Therefore there are K levels in the tree (log(2^k)).
Finding the winner will take us n-2 comparisons.
Find the largest element among the ones who lost to the winner. (K-1 comp.)
Find the largest element among the ones who lost to the loser of the final. (K-2 comp.)
We'll compare these and the one we left out in the beginning. (2 comp.)
The largest of the 3 is the 3rd largest of the array.
Total comparisons: n - 2 + k - 1 + k - 2 + 2 = n + 2k - 3.
I got 10 points out of 25.
I've done 2 mistakes. The major one is if the desired element is in the winner's sub-tree, my answer will be incorrect. Also, the correct answer is supposed to be the second largest of the 3 I finally compared in the end.
Another algorithm I found is as follows:
-Building a tournament tree and finding the winner (n - 2)
-Finding the second largest by comparing all the losers to the winner. (also by a tournament tree) (k - 1)
-The answer lies among the largest of the losers to the second largest, and the losers to the one who lost in the final in the first tree. (log(k+1) + K-1-1)
-This solution assumes that the element we left out is not the largest in the array. If it is, I'm not sure how it acts.
Also, I probably didn't understand the number of comparisons correctly.
I'll be happy to find out if there is a better explained algorithm.
I will also be keen to know if there is more a generalized one for L-th largest (K was taken..).
Thanks in advance,
Itay
Construct a tournament tree on n - 1 = 2k of the elements, chosen arbitrarily. (n - 2 comparisons)
At the leaf, replace the maximum of the chosen elements by the element not chosen. Rebuild the tournament tree. (k comparisons)
Take the maximum of the elements that lost to the new maximum, as in the algorithm for second largest. (k - 1 comparisons)
I'll leave the correctness proof as an exercise.

Binary Search Tree Construction of height h

Given the first n natural numbers as the keys for a BST , How can I determine the root node of all possible tree that have a height 'h' .
I already came up with a brute force method where I constructed all possible trees with n nodes and then selected the trees which have a height h but it has a time complexity of nearly O(n!) . Can someone please suggest a better method which would me more efficient ?
Problem statement. Given natural numbers n and h, determine exactly all elements root in 1..n such that root is the root of a binary search tree on 1..n of height h.
Solution. We can construct a degenerate binary search tree on 1..n starting from any number in 1..n by splitting it up at root. This changes the lower bounds from the old solution to h-1, while the upper bounds remain the same, rendering the full bounds as follows:
h-1 <= max(root-1, n-root) <= 2^h - 1
Old solution (correct only for full binary trees). A full binary tree with height h has at least 2h+1 nodes, and at most 2^(h+1)-1 nodes. It's easy to see that these bounds are tight, not only for binary trees, but also for binary search trees. In particular, they apply to the left and right subtrees of your root. Since this is a binary search tree on 1..n, you will have that left contains exactly the elements 1..(root-1), and right contains exactly the elements (root+1)..n.
This means that the following is both a necessary and sufficient condition: The larger of the subtrees left and right must satisfy the inequalities
2*(h-1) + 1 <= nodes(subtree) <= 2^h - 1
In other words, the possible values of root are exactly all values in 1..n satisfying
2*(h-1) + 1 <= max(root-1, n-root) <= 2^h - 1
Update. I blindly looked at an inequality I found at wikipedia without realizing that it only applied to full binary trees.

How to efficiently check whether it's height balanced for a massively skewed binary search tree?

I was reading this answer on how to check if a BST is height balanced, and really hooked by the bonus question:
Suppose the tree is massively unbalanced. Like, a million nodes deep on one side and three deep on the other. Is there a scenario in which this algorithm blows the stack? Can you fix the implementation so that it never blows the stack, even when given a massively unbalanced tree?
What would be a good strategy here?
I am thinking to do a level order traversal and track the depth, if a leaf is found and current node depth is bigger than the leaf node depth + 2, then it's not balanced. But how to combine this with height checking?
Edit: below is the implementation in the linked answer
IsHeightBalanced(tree)
return (tree is empty) or
(IsHeightBalanced(tree.left) and
IsHeightBalanced(tree.right) and
abs(Height(tree.left) - Height(tree.right)) <= 1)
To review briefly: a tree is defined as being either null or a root node with pointers .left to a left child and .right to a right child, where each child is in turn a tree, the root node appears in neither child, and no node appears in both children. The depth of a node is the number of pointers that must be followed to reach it from the root node. The height of a tree is -1 if it's null or else the maximum depth of a node that appears in it. A leaf is a node whose children are null.
First let me note the two distinct definitions of "balanced" proposed by answerers of the linked question.
EL-balanced A tree is EL-balanced if and only if, for every node v, |height(v.left) - height(v.right)| <= 1.
This is the balance condition for AVL trees.
DF-balanced A tree is DF-balanced if and only if, for every pair of leaves v, w, we have |depth(v) - depth(w)| <= 1. As DF points out, DF-balance for a node implies DF-balance for all of its descendants.
DF-balance is used for no algorithm known to me, though the balance condition for binary heaps is very similar, requiring additionally that the deeper leaves be as far left as possible.
I'm going to outline three approaches to testing balance.
Size bounds for balanced trees
Expand the recursive function to have an extra parameter, maxDepth. For each recursive call, pass maxDepth - 1, so that maxDepth roughly tracks how much stack space is left. If maxDepth reaches 0, report the tree as unbalanced (e.g., by returning "infinity" for the height), since no balanced tree that fits in main memory could possibly be that tall.
This approach relies on an a priori size bound on main memory, which is available in practice if not in all theoretical models, and the fact that no subtrees are shared. (PROTIP: unless you're very careful, your subtrees will be shared at some point during development.) We also need height bounds on balanced trees of at most a given size.
EL-balanced Via mutual induction, we prove a lower bound, L(h), on the number of nodes belonging to an EL-balanced tree of a given height h.
The base cases are
L(-1) = 0
L(0) = 1,
more or less by definition. The inductive case is trickier. An EL-balanced tree of height h > 0 is a node with an EL-balanced child of height h - 1 and another EL-balanced child of height either h - 1 or h - 2. This means that
L(h) = 1 + L(h - 1) + min(L(h - 2), L(h - 1)).
Add 1 to both sides and rearrange.
L(h) + 1 = L(h - 1) + 1 + min(L(h - 2) + 1, L(h - 1) + 1).
A little while later (spoiler), we find that
L(h) <= phi^(h + 2)/sqrt(5),
where phi = (1 + sqrt(5))/2 ~ 1.618.
maxDepth then should be set to the floor of the base-phi logarithm of the maximum number of nodes, plus a small constant that depends on fenceposty things.
DF-balanced Rather than write out an induction proof, I'm going to appeal to your intuition that the worst case is a complete binary tree with one extra leaf on the bottom. Then the proper setting for maxDepth is the base-2 logarithm of the maximum number of nodes, plus a small constant.
Iterative deepening depth-first search
This is the theoretician's version of the answer above. Because, for some reason, we don't know how much RAM our computer has (and with logarithmic space usage, it's not as though we need a tight bound), we again include the maxDepth parameter, but this time, we use it to truncate the tree implicitly below the specified depth. If the height of the tree comes back below the bound, then we know that the algorithm ran successfully. Alternatively, if the truncated tree is unbalanced, then so is the whole tree. The problem case is when the truncated tree is balanced but with height equal to maxDepth. Then we increase maxDepth and retry.
The simplest retry strategy is to increase maxDepth by 1 every time. Since balanced trees with n nodes have height O(log n), the running time is O(n log n). In fact, for DF-balanced trees, the running time is also O(n), since, except for the last couple traversals, the size of the truncated tree increases by a factor of 2 each time, leading to a geometric series.
Another strategy, doubling maxDepth each time, gives an O(n) running time for EL-balanced trees, since the largest tree of height h, with 2^(h + 1) - 1 nodes, is much smaller than the smallest tree of height 2h, with approximately (phi^2)^h nodes. The downside of doubling is that we may use twice as much stack space. With increase-by-1, however, in the family of minimum-size EL-balanced trees we constructed implicitly in defining L(h), the number of nodes at depth h - k in the tree of height h is polynomial of degree k. Accordingly, the last few scans will incur some superlinear term.
Temporarily mutating pointers
If there are parent pointers, then it's easy to traverse depth-first in place, because the parent pointers can be used to derive the relevant information on the stack in an efficient manner. If we don't have parent pointers but can mutate the tree temporarily, then, for descent into a child, we can cannibalize the pointer to that child to store temporarily the node's parent. The problem is determining on the way up whether we came from a left or a right child. If we can sneak a bit (say because pointers are 2-byte aligned, or because there's a spare bit in the balance factor, or because we're copying the tree for stop-and-copy garbage collection and can determine which arena we're in), then that's one way. Another test assumes that the tree is a binary search tree. It turns out that we don't need additional assumptions, however: Explain Morris inorder tree traversal without using stacks or recursion .
The one fly in the ointment is that this approach only works, as far as I know, on DF-balance, since there's no space on the stack to put the partial results for EL-balance.

Resources