Determine a self-balanced binary tree height formula knowing its number of node - algorithm

I have been working on determining the height of a self-balanced binary tree knowing its number of nodes(N) and I came with the formula:
height = ceilling[log2(N+1)], where ceilling[x] is the smallest integer not less than x.
The thing is I can't find this formula on the internet and it seems pretty accurate.
Is there any case of self-balanced binary tree this formula would fail?
What would be the general formula to determine the height of the tree then?

There is a formula on Wikipedia's Self-balancing binary search tree article.
h >= ceilling(log2(n+1) - 1) >= floor(log2(n))
The minimum height is floor(log2(n)). It's worth noting, however, that
the simplest algorithms for BST item insertion may yield a tree with height n in rather common situations. For example, when the items are inserted in sorted key order, the tree degenerates into a linked list with n nodes.
So, your formula is not far off from the "good approximation" formula (just off by 1), but there can be a pretty wide range between n and log2(n) to take into account.

Related

Time Efficiency of Binary Search Tree

for the time efficiency of inserting into binary search tree,
I know that the best/average case of insertion is O(log n), where as the worst case is O(N).
What I'm wondering is if there is any way to ensure that we will always have best/average case when inserting besides implementing an AVL (Balanced BST)?
Thanks!
There is no guaranteed log n complexity without balancing a binary search tree. While searching/inserting/deleting, you have to navigate through the tree in order to position yourself at the right place and perform the operation. The key question is - what is the number of steps needed to get at the right position? If BST is balanced, you can expect on average 2^(i-1) nodes at the level i. This further means, if the tree has k levels (kis called the height of tree), the expected number of nodes in the tree is 1 + 2 + 4 + .. + 2^(k-1) = 2^k - 1 = n, which gives k = log n, and that is the average number of steps needed to navigate from the root to the leaf.
Having said that, there are various implementations of balanced BST. You mentioned AVL, the other very popular is red-black tree, which is used e.g. in C++ for implementing std::map or in Java for implementing TreeMap.
The worst case, O(n), can happen when you don't balance BST and your tree degenerates into a linked list. It is clear that in order to position at the end of the list (which is a worst case), you have to iterate through the whole list, and this requires n steps.

Balanced Binary Search Trees on the basis of size of left and right child subtrees

I have two questions:
What is the difference between nearly balanced BST and nearly Complete Binary tree. Even if the definition of the former is clear then we can differenciate, but not able to get a relevant article.
Today, in my class I was taught about the condition to be balanced as:
max( size(root.left) , size(root.right) ) <= 3*n/4 ------------ (eqn 1).
Hence, H(n) = height for the tree of n nodes following the above property < = 1+H(3*n/4).
Continuing the recursive steps we get the bound for logn.
My question is that, is this a specific type of BST ? For example in case of AVL trees, as I remember the condtion is that the difference in the heights of left and right childs being atmost 1, or is this a more general result and the equation 1 as stated earlier can be reduced to prove the result for AVL Trees as well ? i.e. any Balanced BST will result in difference of heights of siblings being atmost 1 ?
In case its different than AVL, how do we manage the Insertion and Delete Operations in this new kind of tree ?
EDIT : Also if you can explain why 3*n/4 only ?
My Thought: It is because we can then surely say that H(n) <= 1+H(3*n/4), since if we take something like 3n/5 less than 3n/4 then H(3n/5) wont be necessarily less that H(2n/5) as the raio of 3n/5 and 2n/5 is less than 2 and as we know a factor of 2 for number of nodes increases the height by 1.
So we wont surely write H(n) <= 1 + H(3n/5), it may be H(2n/5) in place of H(3n/5) as well, am I right ?
A nearly complete BST is a BST where all levels are filled, except the last one. Definitions are kind of messed up here (some call this property perfect). Please refer to wikipedia for this.
Being balanced is a less strict criterion, i.e. all (nearly) complete BSTs are balanced, but not all balanced BSTs are complete. In that Wikipedia article is a definition for that too. Im my world a BST is balanced, if it leads to O(log n) operation cost.
For example, one can say, a BST is balanced, if each subtree has at most epsilon * n nodes, where epsilon < 1 (for example epsilon = 3/4 or even epsilon = 0.999 -- which are practically not balanced at all).
The reason for that is that the height of such a BST is roughly log_{1/epsilon} n = log_2 n / (- log_2 epsilon) = O(log n), but 1 / (- log_2 0.99) = 99.5 is a huge constant. You can try to prove that with the usual ration of epsilon = 1/2, where both subtrees have roughly the same size.
I don't know of a common BST, which uses this 3/4. Common BSTs are for example Red-Black-Trees, Splay-Trees or - if you are on a hard disk - a whole family of B-Trees. For your example, you can probably implement the operations by augumenting each node with two integers representing the number of nodes in the left and the right subtree respectively. When inserting or deleting someting, you update the numbers as you walk from the root to the leaf (or up) and if the condition is validated, you do a rotation.

What is exactly mean log n height?

I came to know the height of Random-BST/Red-Black trees and some other trees are O(log n).
I wonder, how this can be. Lets say I have a tree like this
The height of the tree is essentially the depth of the tree, which is in this case will be 4 (leaving the parent depth). But how could people say that the height can be represented by O(log n) notion?
I'm very to algorithms, and this point is confusing me a lot. Where I'm missing the point?
In algorithm complexity the variable n typically refers to the total number of items in a collection or involved in some calculation. In this case, n is the total number of nodes in the tree. So, in the picture you posted n=31. If the height of the tree is O(log n) that means that the height of the tree is proportional to the log of n. Since this is a binary tree, you'd use log base 2.
⌊log₂(31)⌋ = 4
Therefore, the height of the tree should be about 4—which is exactly the case in your example.
As I explained in a comment, a binary tree can have multiple cases:
In the degenerate case, a binary tree is simply a chain, and its height is O(n).
In the best case (for most search algorithms), a complete binary tree has the property that for any node, the height of the subtrees are the same. In this case the length will be the floor of log(n) (base 2, or base k, for k branches). You can prove this by induction on the size of the tree (structural induction in the constructors)
In the general case you will have a mix of these, a tree constructed where any node has subtress with possibly different height.

Is O(logn) always a tree?

We always see operations on a (binary search) tree has O(logn) worst case running time because of the tree height is logn. I wonder if we are told that an algorithm has running time as a function of logn, e.g m + nlogn, can we conclude it must involve an (augmented) tree?
EDIT:
Thanks to your comments, I now realize divide-conquer and binary tree are so similar visually/conceptually. I had never made a connection between the two. But I think of a case where O(logn) is not a divide-conquer algo which involves a tree which has no property of a BST/AVL/red-black tree.
That's the disjoint set data structure with Find/Union operations, whose running time is O(N + MlogN), with N being the # of elements and M the number of Find operations.
Please let me know if I'm missing sth, but I cannot see how divide-conquer comes into play here. I just see in this (disjoint set) case that it has a tree with no BST property and a running time being a function of logN. So my question is about why/why not I can make a generalization from this case.
What you have is exactly backwards. O(lg N) generally means some sort of divide and conquer algorithm, and one common way of implementing divide and conquer is a binary tree. While binary trees are a substantial subset of all divide-and-conquer algorithms, the are a subset anyway.
In some cases, you can transform other divide and conquer algorithms fairly directly into binary trees (e.g. comments on another answer have already made an attempt at claiming a binary search is similar). Just for another obvious example, however, a multiway tree (e.g. a B-tree, B+ tree or B* tree), while clearly a tree is just as clearly not a binary tree.
Again, if you want to badly enough, you can stretch the point that a multiway tree can be represented as sort of a warped version of a binary tree. If you want to, you can probably stretch all the exceptions to the point of saying that all of them are (at least something like) binary trees. At least to me, however, all that does is make "binary tree" synonymous with "divide and conquer". In other words, all you accomplish is warping the vocabulary and essentially obliterating a term that's both distinct and useful.
No, you can also binary search a sorted array (for instance). But don't take my word for it http://en.wikipedia.org/wiki/Binary_search_algorithm
As a counter example:
given array 'a' with length 'n'
y = 0
for x = 0 to log(length(a))
y = y + 1
return y
The run time is O(log(n)), but no tree here!
Answer is no. Binary search of a sorted array is O(log(n)).
Algorithms taking logarithmic time are commonly found in operations on binary trees.
Examples of O(logn):
Finding an item in a sorted array with a binary search or a balanced search tree.
Look up a value in a sorted input array by bisection.
As O(log(n)) is only an upper bound also all O(1) algorithms like function (a, b) return a+b; satisfy the condition.
But I have to agree all Theta(log(n)) algorithms kinda look like tree algorithms or at least can be abstracted to a tree.
Short Answer:
Just because an algorithm has log(n) as part of its analysis does not mean that a tree is involved. For example, the following is a very simple algorithm that is O(log(n)
for(int i = 1; i < n; i = i * 2)
print "hello";
As you can see, no tree was involved. John, also provides a good example on how binary search can be done on a sorted array. These both take O(log(n)) time, and there are of other code examples that could be created or referenced. So don't make assumptions based on the asymptotic time complexity, look at the code to know for sure.
More On Trees:
Just because an algorithm involves "trees" doesn't imply O(logn) either. You need to know the tree type and how the operation affects the tree.
Some Examples:
Example 1)
Inserting or searching the following unbalanced tree would be O(n).
Example 2)
Inserting or search the following balanced trees would both by O(log(n)).
Balanced Binary Tree:
Balanced Tree of Degree 3:
Additional Comments
If the trees you are using don't have a way to "balance" than there is a good chance that your operations will be O(n) time not O(logn). If you use trees that are self balancing, then inserts normally take more time, as the balancing of the trees normally occur during the insert phase.

Fast sampling and update of weighted items (data structure like red-black trees?)

What is the appropriate data structure for this task?
I have a set of N items. N is large.
Each item has a positive weight value associated with it.
I would like to do the following, quickly:
inner loop:
Sample an item, according it its weight.
[process...]
Update the weight of K items, where K << N.
When I say sample by weight, this is different than uniform sampling. An item's likelihood is in proportion to its weight. So if there are two items, and one has weight .8 and one has weight .2, then they have likelihood 80% and 20% respectively.
The number of items N remains fixed. Weights are in a bounded range, say [0, 1].
Weights do not always sum to one.
A naive approach takes O(n) time steps to sample.
Is there an O(log(n)) algorithm?
What is the appropriate data structure for this task?
I believe that red-black trees are inappropriate, since they treat each item as having equal weight.
Actually, you can use (modified) RB-trees for this. Moreover, a modification of any balanced tree (not necessarily binary) will do.
The trick is to store additional information in each node - in your case, it could be the total weight of the subtree rooted at the node, or something like that.
When you update (ie. insert/delete) the tree, you follow the algorithm for your favorite tree. As you change the structure, you just recalculate the sums of the nodes (which is an O(1) operation for eg. rotations and B-tree splits and joins). When you change the weight of an item, you update the sums of the node's ancestors.
When you sample, you run a modified version of search. You get the sum of all weights in the trees (ie. sum of the root) and generate a positive random number lower than this. Then, you run the search algorithm, where you go to the left node iff the number (which is a quantile you search for) is less than the sum of the left node. If you go to the right node, you subtract the left sum from the quantile.
This description is a little chaotic, but I hope it helps.
This is a problem I had to solve for some Monte Carlo simulations. You can see my current `binary tree' at the link below. I have tried to make it fairly STL-like. My tree has a fixed capacity, which you could get round with a red-black tree approach, which I have talked about trying. It is essentially an implementation of the ideas described by jpacelek.
The method is very robust in coping with inexact floating point numbers, because it almost always sums and compares quantities of the same magnitude, because they are at the same level in the tree.
http://mopssuite.svn.sourceforge.net/viewvc/mopssuite/utils/trunk/include/binary_tree.hpp?view=markup
I like jpalecek's answer. I would add that the simplest way to get your random number would be (1) Generate u from a uniform(0,1) distribution. (2) Multiply u by the sum of all the weights in the tree.
Since N is fixed, you can solve this by using an array, say v, where v[i+1] = v[i] + weight[i], v[1] = weight[0], v[0] = 0 and sample it by binary search, which is log(N), for a lower bound of a random number uniformly distributed between 0 and the sum of the weights.
Naive update of K items is O(KN), a more thoughtful one is O(N).
Spoiling yet another interview question by the circle of smug :)

Resources