worst case in MAX-HEAPIFY: "the worst case occurs when the bottom level of the tree is exactly half full" - algorithm

In CLRS, third Edition, on page 155, it is given that in MAX-HEAPIFY,
"the worst case occurs when the bottom level of the tree is exactly half full"
I guess the reason is that in this case, Max-Heapify has to "float down" through the left subtree.
But the thing I couldn't get is "why half full" ?
Max-Heapify can also float down if left subtree has only one leaf. So why not consider this as the worst case ?

Read the entire context:
The children's subtrees each have size at most 2n/3 - the worst case occurs when the last row of the tree is exactly half full
Since the running time T(n) is analysed by the number of elements in the tree (n), and the recursion steps into one of the subtrees, we need to find an upper bound on the number of nodes in a subtree, relative to n, and that will yield that T(n) = T(max num. nodes in subtree) + O(1)
The worst case of number of nodes in a subtree is when the final row is as full as possible on one side, and as empty as possible on the other. This is called half full. And the left subtree size will be bounded by 2n/3.
If you're proposing a case with only a few nodes, then that's irrelevant, since all base cases can be considered O(1) and ignored.

Already there's an accepted answer but this answer is for those people who are still a bit confused (as I was), or something still doesn't click. So here's a little bit longer and more detailed explanation.
Though it might sound redundant, we have to be very clear about the exact definitions because through our attention to the details... chances are when you do that proving things becomes much easier.
From CLRS (section 6.1), a Binary Heap data structure is an array object that can be viewed as a nearly complete binary tree
From Wikipedia, In a complete binary tree, every level (except possibly the last level) is completely filled, and all the nodes in the last level are as far left as possible.
Again, from Wikipedia, A balanced binary tree is a binary tree structure in which the left and right sub-trees of every node differ in height by no more than 1.
Now that we are armed, let's dive in.
So, in comparison to the root, the height of the left and right sub-tree can differ by 1 at most.
Let's consider a tree T and let the height of the left sub-tree = h+1 and the height of the right sub-tree = h
What can be the worst-case in MAX_HEAPIFY? The worst-case happens when we end up doing maximum number of comparisons and swaps while trying to maintain the heap property.
When the MAX_HEAPIFY algorithm runs and if it recursively goes through the longest path then we can consider a possible worst-case because it will end up doing the maximum number of comparisons and swaps in the longest path.
Well, it seems all of the longest paths happen to be in the left sub-tree (as its height is h+1). But someone might as well ask: Why not the right sub-tree? Remember the above definition, all the nodes in the last level have to be as far left as possible.
Now because we have to cover every possibility that can lead to a worst-case, we need to get more number of longer paths, if any exist, and because of that, we ought to make the left sub-tree FULL (But Why? So that we can get more paths to choose from and opt for the path that gives the worst-case time among all).
Since the left subtree has a height h+1, it will have 2^(h+1) no. of leaf nodes, and, therefore, 2^(h+1) number of paths from the root. This is the maximum number of possible paths in a tree T of h+1 height.
Note: Please hold on to it if you are still reading, maybe just for the sake of crystal clarity.
Here's the image of the tree structure in the worst-case situation.
In the above image, as you can see, consider that the left (in yellow) sub-tree and the right (in pink) sub-tree each has x nodes. The pink portion is a complete right sub-tree and the yellow portion is the left sub-tree excluding the last level.
Notice that both the left (yellow) and the right (pink) sub-trees have a height of h.
Now, from the start, we have considered the left subtree to be of height h+1 as a whole (i.e. including the yellow portion and the last level).
Now, if I may ask, how many nodes do we have to add in the last level i.e. below the yellow portion to make the left sub-tree completely FULL?
Well, the bottom-most layer of the yellow portion has ⌈x/2⌉ nodes (i.e. Total number of leaves in a tree/subtree having n nodes = ⌈n/2⌉; for proof visit this link), and now if we add 2 children to each of these nodes or leaves => total x (≈x) nodes have been added (How? ⌈x/2⌉ leaves * 2 ≈ x nodes).
With this addition, we make the left sub-tree of height h+1 (i.e. the yellow portion with height h and the one last level added) FULL, hence meeting the worst-case criteria.
Since the left sub-tree is FULL, the whole Tree is HALF FULL.
Now someone might as well ask: What if we add more nodes, or, specifically, what if we add nodes in the right sub-tree? Well, we don't. And that's because now if we happen to add more nodes, the nodes will be added in the right sub-tree (as the left sub-tree is FULL), which, in turn, will tend to balance out the tree more. Now as the tree is starting to get more balanced, we are tending to move towards the best-case scenario and not the worst-case.
Final question : How many nodes do we have in total?
Total nodes in the tree, n = x (from the yellow portion) + x (from the pink portion) + x (addition of the last level below the yellow portion) = 3x
Can you notice something? As a by-product, the left sub-tree in total contains at most 2x nodes i.e. 2n/3 nodes (bcoz x = n/3).

Related

Is it possible to determine AVL tree balanced or not if instead of height depth is given?

In question it is given we can use depth only and not height.
(As we know for height we can say if difference between height of left subtree and height of right subtree is is at most one then it will be balanced)
Using depth can we find a way to prove tree balanced or not?
I tried by finding relation between different depth trees
What I got is that
If depth max = n
Then there must be n nodes whose depth is n-1
But this is just one condition I got.
It is not sufficient condition
( You can ignore my approach and try other thing .As there is no condition on approaching the problem)
The principle is the same as with height: use the following logic:
For each node do:
Get the maximum among the depths of all the nodes in the left subtree. Default (when no left subtree is present) is the current node's depth.
Get the maximum among the depths of all the nodes in the right subtree. Default (when no right subtree is present) is the current node's depth.
The difference between these two should not be more than 1.
If you implement this with a post-order traversal through the tree, you can keep track of the maximum depths -- needed in the first two steps -- as you traverse the tree.

Trying to understand max heapify

I tried watching http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/lecture-4-heaps-and-heap-sort/ to understand heaps and heapsort but did not find this clear.
I do not understand the function of max-heapify. It seems like a recursive function, but then somehow it's said to run in logarithmic time because of the height of the tree.
To me this makes no sense. In the worst case, won't it have to reverse every single node? I don't see how this can be done without it touching every single node, repeatedly.
Here's what MAX-HEAPIFY does:
Given a node at index i whose left and right subtrees are max-heaps, MAX-HEAPIFY moves the node at i down the max-heap until it no longer violates the max-heap property (that is, the node is not smaller than its children).
The longest path that a node can take before it is in the proper position is equal to the starting height of the node. Each time the node needs to go down one more level in the tree, the algorithm will choose exactly one branch to take and will never backtrack. If the node being heapified is the root of the max-heap, then the longest path it can take is the height of the tree, or O(log n).
MAX-HEAPIFY moves only one node. If you want to convert an array to a max-heap, you have to ensure that all of the subtrees are max-heaps before moving on to the root. You do this by calling MAX-HEAPIFY on n/2 nodes (leaves always satisfy the max-heap property).
From CLRS:
for i = floor(length(A)/2) downto 1
do MAX-HEAPIFY(A,i)
Since you call MAX-HEAPIFY O(n) times, building the entire heap is O(n log n).*
* As mentioned in the comments, a tighter upper-bound of O(n) can be shown. See Section 6.3 of the 2nd and 3rd editions of CLRS for the analysis. (My 1st edition is packed away, so I wasn't able to verify the section number.)
In the worst case, won't it have to reverse every single node?
You don't have to go through every node. The standard max-heapify algorithm is: (taken from Wikipedia)
Max-Heapify (A, i):
left ← 2*i // ← means "assignment"
right ← 2*i + 1
largest ← i
if left ≤ heap_length[A] and A[left] > A[largest] then:
largest ← left
if right ≤ heap_length[A] and A[right] > A[largest] then:
largest ← right
if largest ≠ i then:
swap A[i] and A[largest]
Max-Heapify(A, largest)
You can see that on each recursive call you either stop or continue with the subtree left or right. In the latter case you decrease the tree height with 1. Since the heap tree is balanced by definition you would do at most log(N) steps.
Here's an argument for why it's O(N).
Assume it's a full heap, so every non-leaf node has two children. (It still works even if that's not the case, but it's more annoying.)
Put a coin on each node in the tree. Each time we do a swap, we're going to spend one of those coins. (Note that when elements swap in the heap, the coins don't swap with them.) If we run MAX-HEAPIFY, and there's any coins left over, that means we've done fewer swaps than there are nodes in the tree, and thus MAX-HEAPIFY performs O(N) swaps.
Claim: after MAX-HEAPIFY is done running, a heap will always have at least one path from the root to a leaf with coins on every node of the path.
Proof by induction: For a single-node heap, we don't need to do any swaps, so we don't need to spend any coins. Thus, the one node gets to keep its coin, and we have a full path from root to leaf (of length 1) with coin intact.
Now, assume we have a heap with left and right subheaps, and MAX-HEAPIFY has already run on both. By inductive hypothesis, each has at least one path from root to leaf with coins on it, so we have at least two root-to-leaf paths with coins, one for each child. The farthest the root would ever need to go in order to establish the MAX-HEAP property is to swap all the way to the bottom of the tree. Let's say it swaps down into the left subtree, and it swaps all the way to down to the bottom. For each swap, we need to spend the coin, so we spend it from the node that the root swapped to.
In doing this, we spent all the coins on one of the root-to-leaf paths, but remember we originally had at least two! Therefore, we still have a root-to-leaf path complete with coins after MAX-HEAPIFY runs on the whole heap. Therefore, MAX-HEAPIFY spent fewer coins than there are nodes in the tree. Therefore, the number of swaps is O(N). QED.

Running time for binary search tree

The textbook says the number of split operations is bounded by the height of the tree, which is O(logn).
I dont quite understand why it is bounded by the height of the tree? Can someone explain that?
When you start at the root, and go as far as you can down some path towards the bottom, the maximum number of nodes you can come across is equal to the height of the tree (this should be easy to see and it is, pretty much by definition, the height of the tree).
Now when you're searching in a binary search tree, you start at the root, and, at each step, you look at the current node, and stop, go left or go right (going left or going right can be considered a split operation). This process involves the same number of nodes as the one described above (going from the root down some path), which involves encountering a number of nodes, and thus split operations, no more than the height of the tree.
Also note that the height of the tree is only O(log n) if the tree is balanced (see this page for more).
Most probably, in the textbook you are referring to, the data structure in question in a balanced binary tree with n nodes. Since it is balanced, its height is log(n). Detailed definitions and brief explanations converning the height ca be found here.

Given a number n, how many balanced binary trees (not binary search trees) are there?

The definition of balanced in this question is
The number of nodes in its left subtree and the number of nodes in its
right subtree are almost equal, which means their difference is not
greater than one
if given a n as the number of nodes in total, how many are there such trees?
Also what if we replace the number of nodes with height? Given a height, how many height balanced trees are there?
Well the difference will be made only by the last level, hence you can just find how many nodes should be left for that one, and just consider all possible combinations. Having n nodes you know that the height should be floor(log(n)) hence the same tree at depth k = floor(log(n)) - 1 is fully balanced, hence you know that is needs (m = sum(i=0..k)2^i) nodes, hence n-m nodes are left for the last level. Some definition of a balanced binary tree force "all the nodes to be left aligned", in this case it is obvious that there can be only one possibility, without this constraint you have combinations of 2^floor(log(n)) chooses n-m, because you have to pick which of the 2^floor(log(n)) possible slots you will assign with nodes, forcing a total of n-m nodes to be assigned.
For the height story you consider a sum of combinations of 2^floor(log(n)) chooses i as i goes from 1 to 2^floor(log(n)). You consider all possibilities of having either 1 node at the last level, then 2 and so on, until you don't make it a fully balanced binary tree, hence having all 2^floor(log(n)) slots assigned.

Binary tree visit: get from one leaf to another leaf

Problem: I have a binary tree, all leaves are numbered (from left to right, starting from 0) and no connection exists between them.
I want an algorithm that, given two indices (of 2 distinct leaves), visits the tree starting from the greater leaf (the one with the higher index) and gets to the lower one.
The internal nodes of the tree do not contain any useful information.
I should chose the path based only on the leaves indices. The path start from a leaf and terminates on a leaf, and of course I can access a leaf if I know its index (through an array of pointers)
The tree is static, no insertion or deletion of nodes is allowed.
I have developed an algorithm to do it but it really sucks... any ideas?
One option would be to find the least common ancestor of the two nodes, along with the sequence of nodes you should take from each node to get to that ancestor. Here's a sketch of the algorithm:
Starting from each node, walk back up to that node's parent until you reach the root. Count the number of nodes on the path from each node to the root. Let the height of the first node be h1 and the height of the second node be h2.
Let h = min(h1, h2). This is the height of the higher of the two nodes.
Starting from each node, keep following the node's parent pointer until both nodes are at height h. Record the nodes you followed during this step. At this point, both nodes are at the same height.
Until you find a common node, keep marching upwards from each node to its parent. Eventually you will hit their common ancestor. At this point, follow the path from the first node up to this ancestor, then down the path from the ancestor down to the second node.
In the worst case, this takes O(h) time and O(h) space, where h is the height of the tree. For a balanced binary tree is this O(lg n) time and space, which is quite good.
If you're interested in a Much More Hardcore version of this algorithm, consider looking into Tarjan's Least Common Ancestors algorithm, which with linear preprocessing time, can be used to find the least common ancestor much more rapidly than this.
Hope this helps!
Distance between any two nodes can be calculated with the help of lowest common ancestor:
Dist(n1, n2) = Dist(root, n1) + Dist(root, n2) - 2*Dist(root, lca)
where lca is lowest common ancestor.
see this for more help about this algorithm and see this video for learning how to calculate lca.

Resources