Why is it important that a binary tree be balanced
Imagine a tree that looks like this:
A
\
B
\
C
\
D
\
E
This is a valid binary tree, but now most operations are O(n) instead of O(lg n).
The balance of a binary tree is governed by the property called skewness. If a tree is more skewed, then the time complexity to access an element of a the binary tree increases. Say a tree
1
/ \
2 3
\ \
7 4
\
5
\
6
The above is also a binary tree, but right skewed. It has 7 elements, so an ideal binary tree require O(log 7) = 3 lookups. But you need to go one more level deep = 4 lookups in worst case. So the skewness here is a constant 1. But consider if the tree has thousands of nodes. The skewness will be even more considerable in that case. So it is important to keep the binary tree balanced.
But again the skewness is the topic of debate as the probablity analysis of a random binary tree shows that the average depth of a random binary tree with n elements is 4.3 log n . So it is really the matter of balancing vs the skewness.
One more interesting thing, computer scientists have even found an advantage in the skewness and proposed a skewed datastructure called skew heap
To ensure log(n) search time, you need to divide the total number of down level nodes by 2 at each branch. For example, if you have a linear tree, never branching from root to the leaf node, then the search time will be linear as in a linked list.
An extremely unbalanced tree, for example a tree where all nodes are linked to the left, means you still search through every single node before finding the last one, which is not the point of a tree at all and has no benefit over a linked list. Balancing the tree makes for better search times O(log(n)) as opposed to O(n).
As we know that most of the operations on Binary Search Trees proportional to height of the Tree, So it is desirable to keep height small. It ensure that search time strict to O(log(n)) of complexity.
Rather than that most of the Tree Balancing Techniques available applies more to
trees which are perfectly full or close to being perfectly balanced.
At the end of the end you need the simplicity over your tree and go for best binary trees like red-black tree or avl
Related
Are there any advantages or specific cases where we should prefer using Binary search tree rather than AVL tree.
If you do not care about the time complexity of lookup/insert/remove operations, then BST is good enough. It's easier to implement and requires less space. However, in the worst case, its performance is O(n) - imagine adding only increasing or decreasing elements to your BST.
On the other hand, if you do care about the performance, then you may use an AVL tree because it is a self-balancing BST - its height is guaranteed to be ~ log(n), where n is a number of nodes in the tree. That's why lookup lookup/insert/remove operations are logarithmic. However, an AVL tree requires more space (each node needs to hold its height), and additional logic to re-balance the tree if such property gets violated.
The best case running time for binary search is O(log(n)), if the binary tree is balanced. The worst case would be, if the binary tree is so unbalanced, that it basically represents a linked list. In that case the running time of a binary search would be O(n).
However, what if the tree is only slightly unbalanced, as is teh case for this tree:
Best case would still be O(log n) if I am not mistaken. But what would be the worst case?
Typically, when we say something like "the cost of looking up an element in a balanced binary search tree is O(log n)," what we mean is "in the worst case, we have to do O(log n) work in the course of performing a search on a balanced binary search tree." And since we're talking about big-O notation here, the previous statement is meant to be taken about balanced trees in general rather than a specific concrete tree.
If you have a specific BST in mind, you can work out the maximum number of comparisons required to find any element. Just find the deepest node in the tree, then imagine searching for a value that's bigger than that value but smaller than the next value in the tree. That will cause you to walk all the way down the tree as deeply as possible, making the maximum number of comparisons possible (specifically, h + 1 of them, where h is the height of the tree).
To be able to talk about the big-O cost of performing lookups in a tree, you'd need to talk about a family of trees of different numbers of nodes. You could imagine "kinda balanced" trees whose depth is Θ(√n), for example, where lookups would take time O(√n), for example. However, it's uncommon to encounter trees like that in practice, since generally you'd either (1) have a totally imbalanced tree or (2) use some sort of balanced tree that would prevent the height from getting that high.
In a sorted array of n values, the run-time of binary search for a value, is
O(log n), in the worst case. In the best case, the element you are searching for, is in the exact middle, and it can finish up in constant-time. In the average case too, the run-time is O(log n).
In what situation, searching a term using a binary search tree requires a time complexity that is linear to the size of the term vocabulary (say M)? How to ensure a worst time complexity of log M?
A complete binary tree is one for which every level, except possibly the last, is completely filled. The worst case search peformance is the height of the tree, which in this case would be O(lgM), assuming M vocabulay terms in the tree.
One way to ensure this performance would be to use a self-balancing tree, e.g. a red-black tree.
Since binary search is a divide-and-conquer algorithm, we can ensure O(log M) if the tree is balanced, with equal number of terms under the sub-trees of any node. O(log M) basically means that time goes up linearly while M goes up exponentially. If it takes 1 second to search a balanced binary tree that is 10 nodes, it’d take 2 seconds to search an equally balanced tree with 100 nodes, 3 seconds to search 1000 nodes, and so on.
But if the binary search tree is extremely unbalanced to the point where it looks a lot like a linked list, we would have to go through every node, requiring a time complexity that is linear to M.
I wanted to understand how red-black tree works. I understood the algorithm, how to fix properties after insert and delete operations, but something isn't clear to me. Why red-black tree is more balanced than binary tree? I want to understand the intuition, why rotations and fixing tree properties makes red-black tree more balanced.
Thanks.
Suppose you create a plain binary tree by inserting the following items in order: 1, 2, 3, 4, 5, 6, 7, 8, 9. Each new item will always be the largest item in the tree, and so inserted as the right-most possible node. You "tree" would look like this:
1
\
2
\
3
.
.
.
9
The rotations performed in a red-black tree (or any type of balanced binary tree) ensure that neither the left nor right subtree of any node is significantly deeper than the other (typically, the difference in height is 0 or 1, but any constant factor would do.) This way, operations whose running time depends on the height h of the tree are always O(lg n), since the rotations maintain the property that h = O(lg n), whereas in the worst case shown above h = O(n).
For a red-black tree in particular, the node coloring is simply a bookkeeping trick that help in proving that the rotations always maintain h = O(lg n). Different types of balanced binary trees (AVL trees, 2-3 trees, etc) use different bookkeeping techniques for maintaining the same property.
Why red-black tree is more balanced than binary search tree?
Because a red-black tree guarantees O(logN) performance for insertion, deletion and look ups for any order of operations.
Why rotations and fixing tree properties makes red-black tree more balanced?
Apart from the general properties that any binary search tree must obey, a red-black tree also obeys the following properties:
No node has two red links connected to it.
Every path from root to null link has the same number of black links.
Red links lean left.
Now we want to prove the following proposition :
Proposition. Height of tree is ≤ 2 lg N in the worst case.
Proof.
Since every path from the root to any null link has the same number of black links and two red links are never in-a-row, the maximum height will always be less than or equal to 2logN in the worst case.
Although quite late , but since I was recently studying RBT and was struggling with the intuition behind why some magical rotation and coloring balances the tree and was thinking the same question as the OP
why rotations and fixing tree properties makes red-black tree more balanced
After a few days of "research" , I had the eureka moment and decided to write it in details . I won't copy paste here as some formatting would be not right , so anyone who is interested , can check it from github . I tried to explain with a lot of images and simulation . Hope it helps someone someday who happens to trip in this thread searching the same question : )
I have a question on time complex in trees operations.
It's said that (Data Structures, Horowitz et al) time complexity for insertion, deletion, search, finding mins-maxs, successor and predecessor nodes in BSTs is of O(h) while those of AVLs makes O(logn).
I don't exactly understand what the difference is. With h=[logn]+1 in mind, so why do we say O(h) and somewhere else O(logn)?
h is the height of the tree. It is always Omega(logn) [not asymptotically smaller then logn]. It can be very close to logn in complete tree (then you really get h=logn+1, but in a tree that decayed to a chain (each node has only one son) it is O(n).
For balanced trees, h=O(logn) (and in fact it is Theta(logn)), so any O(h) algorithm on those is actually O(logn).
The idea of self balancing search trees (and AVL is one of them) is to prevent the cases where the tree decays to a chain (or somewhere close to it), and its (the balanced tree) features ensures us O(logn) height.
EDIT:
To understand this issue better consider the next two trees (and forgive me for being terrible ascii artist):
tree 1 tree 2
7
/
6
/
5 4
/ / \
4 2 6
/ / \ / \
3 1 3 5 7
/
2
/
1
Both are valid Binary search trees, and in both searching for an element (say 1) will be O(h). But in the first, O(h) is actually O(n), while in the second it is O(logn)
O(h) means complexity linear dependent on tree height. If tree is balanced this asymptotic becomes O(logn) (n - number of elements). But it is not true for all trees. Imagine very unbalanced binary tree where each node has only left child, this tree become similar to list and number of elements in that tree equal to height of tree. Complexity for described operation will be O(n) instead of O(logn)