Trying to understand max heapify - algorithm

I tried watching http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/lecture-4-heaps-and-heap-sort/ to understand heaps and heapsort but did not find this clear.
I do not understand the function of max-heapify. It seems like a recursive function, but then somehow it's said to run in logarithmic time because of the height of the tree.
To me this makes no sense. In the worst case, won't it have to reverse every single node? I don't see how this can be done without it touching every single node, repeatedly.

Here's what MAX-HEAPIFY does:
Given a node at index i whose left and right subtrees are max-heaps, MAX-HEAPIFY moves the node at i down the max-heap until it no longer violates the max-heap property (that is, the node is not smaller than its children).
The longest path that a node can take before it is in the proper position is equal to the starting height of the node. Each time the node needs to go down one more level in the tree, the algorithm will choose exactly one branch to take and will never backtrack. If the node being heapified is the root of the max-heap, then the longest path it can take is the height of the tree, or O(log n).
MAX-HEAPIFY moves only one node. If you want to convert an array to a max-heap, you have to ensure that all of the subtrees are max-heaps before moving on to the root. You do this by calling MAX-HEAPIFY on n/2 nodes (leaves always satisfy the max-heap property).
From CLRS:
for i = floor(length(A)/2) downto 1
do MAX-HEAPIFY(A,i)
Since you call MAX-HEAPIFY O(n) times, building the entire heap is O(n log n).*
* As mentioned in the comments, a tighter upper-bound of O(n) can be shown. See Section 6.3 of the 2nd and 3rd editions of CLRS for the analysis. (My 1st edition is packed away, so I wasn't able to verify the section number.)

In the worst case, won't it have to reverse every single node?
You don't have to go through every node. The standard max-heapify algorithm is: (taken from Wikipedia)
Max-Heapify (A, i):
left ← 2*i // ← means "assignment"
right ← 2*i + 1
largest ← i
if left ≤ heap_length[A] and A[left] > A[largest] then:
largest ← left
if right ≤ heap_length[A] and A[right] > A[largest] then:
largest ← right
if largest ≠ i then:
swap A[i] and A[largest]
Max-Heapify(A, largest)
You can see that on each recursive call you either stop or continue with the subtree left or right. In the latter case you decrease the tree height with 1. Since the heap tree is balanced by definition you would do at most log(N) steps.

Here's an argument for why it's O(N).
Assume it's a full heap, so every non-leaf node has two children. (It still works even if that's not the case, but it's more annoying.)
Put a coin on each node in the tree. Each time we do a swap, we're going to spend one of those coins. (Note that when elements swap in the heap, the coins don't swap with them.) If we run MAX-HEAPIFY, and there's any coins left over, that means we've done fewer swaps than there are nodes in the tree, and thus MAX-HEAPIFY performs O(N) swaps.
Claim: after MAX-HEAPIFY is done running, a heap will always have at least one path from the root to a leaf with coins on every node of the path.
Proof by induction: For a single-node heap, we don't need to do any swaps, so we don't need to spend any coins. Thus, the one node gets to keep its coin, and we have a full path from root to leaf (of length 1) with coin intact.
Now, assume we have a heap with left and right subheaps, and MAX-HEAPIFY has already run on both. By inductive hypothesis, each has at least one path from root to leaf with coins on it, so we have at least two root-to-leaf paths with coins, one for each child. The farthest the root would ever need to go in order to establish the MAX-HEAP property is to swap all the way to the bottom of the tree. Let's say it swaps down into the left subtree, and it swaps all the way to down to the bottom. For each swap, we need to spend the coin, so we spend it from the node that the root swapped to.
In doing this, we spent all the coins on one of the root-to-leaf paths, but remember we originally had at least two! Therefore, we still have a root-to-leaf path complete with coins after MAX-HEAPIFY runs on the whole heap. Therefore, MAX-HEAPIFY spent fewer coins than there are nodes in the tree. Therefore, the number of swaps is O(N). QED.

Related

Complexity of a tree labeling algorithm

I have a generic weighted tree (undirected graph without cycles, connected) with n nodes and n-1 edges connecting a node to another one.
My algorithm does the following:
do
compute the actual leaves (nodes with degree 1)
remove all the leaves and their edges from the tree labelling each parent with the maximum value of the cost of his connected leaves
(for example if an internal node is connected to two leaf with edges with costs 5,6 then we label the internal node after removing the leaves with 6)
until the tree has size <= 2
return the node with maximum cost labelled
Can I say that the complexity is O(n) to compute the leaves and O(n) to eliminate each edge with leaf, so I have O(n)+O(n) = O(n)?
You can easily do this in O(n) with a set implemented as a simple list, queue, or stack (order of processing is unimportant).
Put all the leaves in the set.
In a loop, remove a leaf from the set, delete it and its edge from the graph. Process the label by updating the max of the parent. If the parent is now a leaf, add it to the set and keep going.
When the set is empty you're done, and the node labels are correct.
Initially constructing the set is O(n). Every vertex is placed on the set, removed and its label processed exactly once. That's all constant time. So for n nodes it is O(n) time. So we have O(n) + O(n) = O(n).
It's certainly possible to do this process in O(n), but whether or not your algorithm actually does depends.
If either "compute the actual leaves" or "remove all the leaves and their edges" loops over the entire tree, that step would take O(n).
And both the above steps will be repeated O(n) times in the worst case (if the tree is greatly unbalanced), so, in total, it could take O(n2).
To do this in O(n), you could have each node point to its parent so you can remove the leaf in constant time and maintain a collection of leaves so you always have the leaves, rather than having to calculate them - this would lead to O(n) running time.
As your tree is an artitary one. It can also be a link list in which case you would eliminate one node in each iteration and you would need (n-2) iterations of O(n) to find the leaf.
So your algorithm is actually O(N^2)
Here is an better algorithm that does that in O(N) for any tree
deleteLeaf(Node k) {
for each child do
value = deleteLeaf(child)
if(value>max)
max = value
delete(child)
return max
}
deleteLeaf(root) or deleteLeaf(root.child)

k successive calls to tree successor in bst

Prove that K-successive calls to tree successor takes O(k+h) time. Since each node is visited atmost twice the maximum bound on number of nodes visited must be 2k. The time complexity must be O(k). I dont get where is the factor of O(h) coming. Is it because of nodes which are visited but are not the successor. I am not exactly able to explain myself how is the factor O(h) is involved in the whole process
PS:I know this question already exists but I was not able to understand the solution.
Plus in the O(k+h) notation is an alternative form of writing O(MAX(k, h)).
Finding a successor once could take up to O(h) time. To see why this is true, consider a situation when you are looking for a successor of the rightmost node of the left subtree of the root: its successor is at the bottom of the right subtree, so you must traverse the height of the tree twice. That's why you need to include h in the calculation: if k is small compared to h, then h would dominate the timing of the algorithm.
The point of the exercise is to prove that the time of calling the successor k times in a row is not O(k*h), as one could imagine after observing that a single call could take up to O(h). You prove it by showing that the cost of traversing the height of the tree is distributed among the k calls, as you did by noting that each node is visited at most twice.

How to prove that finding a successor n-1 times in the BST from the minimum node is O(n)?

How to prove that finding a successor n-1 times in the BST from the minimum node is O(n)?
The questions is that we can create sorted order by
1) let the node = minimum node of the BST.
2) From that node, we recursively call find a successor.
I was told that the result is O(n) but I do not understand and do not know how to prove it.
Should not it be O(n*log n) instead? Because for the step 1, it is O(log n), for the step 2, it is also O(log n) but it is called n-1 times. Therefore, it will be O(n*log n)
Please clarify my doubt. Thank you! :)
You are correct that any individual operation might take O(log n) time, so if you perform those operations n times, you should get a runtime of O(n log n). This bound is correct, but it's not tight. The actual runtime is Θ(n).
One way to see this is to look at any individual edge in the tree. How many times will you visit each edge if you start at the leftmost node and repeatedly perform a successor query? If you look closely at how the operations work, you'll discover that every edge is visited exactly twice: once downward and once upward. Since all the work done is done traversing up and down edges, this means that the total amount of work done is proportional to twice the number of edges. In any tree, the number of edges is the number of nodes minus one, and so the total work done is Θ(n).
To formalize this as a proof, try showing that you never descend down the same edge twice and that when you ascend up an edge, you never descend down that edge again. Once you've done this, the conclusion that the runtime is Θ(n) follows from the above logic.
Hope this helps!
I wanted to post this as a comment on templatetypedef's answer, but it's too long.
His answer is right in that the easiest way to see that this is linear is because every edge is visited exactly twice, and the number of edges in a tree is always one less than the number of nodes (because every node has one parent, except the root!).
The issue is that the way he phrases the formal proof uses words that seem to imply contradiction as the way to go. In general, mathematicians frown on using contradiction because it often produces proofs with superfluous content. For instance:
Proof that 2 + 2 != 5:
Assume for contradiction that 2 + 2 = 5 (<- Remove this line)
Well 2 + 2 = 4
And 4 != 5
Contradiction! (<- Remove this line)
Contradiction tends to be verbose, and sometimes it can even obfuscate the idea behind the proof! There are times when contradiction seems pretty much necessary, but it's relatively rare and that's a separate discussion.
In this case, I don't see a proof by contradiction being any easier than a direct proof. On the other hand, regardless of proof technique, this proof is pretty ugly to do formally. Here's an attempt:
1) The succ(n) algorithm traverses one of two paths
In the first case every edge is visited on the simple path from a node to the leftmost node of its right subtree
In the other case, the node n has no right child in which case we go up its ancestors p_1, p_2, p_3, ..., p_k such that p_(k-1) is the first ancestor which is the left child of it's parent. All of those edges are visited in that simple path
We want to show that an arbitrary edge is traversed in precisely two succ() calls, once for the first case of succ() and once for the second case of succ(). Well, this is true for every edge other than the rightmost branch, but you can handle those edge cases separately. Alternatively we could prove the simpler argument where we return to the root after visiting the last element
This is two-fold because for a given edge e we have to find the n1 and n2 such that succ(n1) traverses e and succ(n2) also traverses e, as well as prove that every other succ() generates a path which does not include e.
2) First we actually prove that for each type of path that succ() visits, no two paths overlap (i.e. if succ(n) and succ(n') both traverse paths of the same type, those paths share no edges)
In the first case, the simple path is precisely defined as follows. Start at node n and go one edge to the right to r. Then traverse the left branch of the subtree rooted at r. Now consider any other such path that starts at some other node n' (note, we don't assume that n != n'). It must go right one node to r'. Then it traverses the leftmost branch of the subtree rooted at r'. If the paths overlap then pick one of the edges that overlap. If it's (n,r) = (n',r') then we have n = n' and so it's the same path. If it's some e = e' in both leftmost branches then you can show, again, that n = n' (you can trace the leftmost branches and show that every edge is the same, then finally reach the conclusion that r = r' => n = n' because for a tree the parent is unique. You'll see this tracing argument below). Thus we know that for any n and n', if their paths overlap, they are actually the same node! The contrapositive says this: if they are different nodes, then their paths don't overlap. That's exactly what we want (and the contrapositive is always equally true to the original statement).
In the second case we define the simple path starting at node n and go up the ancestors p_1, p_2, ..., p_k = g until we reach the first node p_k such that p_(k-1) is to the left of p_k. Consider some other path of the same type that starts at node n' where n != n'. Similarly it visits p_1', p_2', ..., p_k' = g'. Because it's a tree, none of those ancestors are the same as the first set. Because none of the nodes on the two paths are the same, none of the edges can be the same and hence succ(n) and succ(n') do not traverse any of the same edges
3) Now we just need to show that at least one path of each type exists for a given edge. Well take any such edge e = (c,p) (note here I am ignoring the special edges on the rightmost branch which are technically only visited once and I am also ignoring the special edges on the leftmost branch which are technically visited once by find_min() and then once by succ() calls)
If it's from a left child c to its parent p then succ(c) will cover the second type of path. To find the other path, keep going up p's ancestors p_1, p_2, ..., p_k such that p_(k-1) is to the right of p_k. succ(p_k) will traverse a path containing e by definition (since e is on the leftmost branch of the subtree of p_(k-1) which is p_k's right child).
A similar argument holds for symmetric case when c is the right child of p
To summarize the proof we've shown that succ() generates two types of path. For each type of path, all of the paths of those types do not overlap. Furthermore, for any edge we have at least one of each of those types of paths. Since we call succ() on every node we can finally conclude that each edge is traversed twice (and hence the algorithm is Theta(n)).
Despite how long this proof was, it isn't actually complete (even ignoring the points when I explicitly said I was skipping details!). There are cases where I said something exists without proving it exists. You can figure out those details if you want and it is actually really satisfying to get it completely right (in my opinion at least. Maybe when you're a genius you'll find it tedious, heh)
Hope this helped. Let me know if you want me to clarify some steps

Split a tree into equal parts by deleting an edge

I am looking for an algorithm to split a tree with N nodes (where the maximum degree of each node is 3) by removing one edge from it, so that the two trees that come as the result have as close as possible to N/2. How do I find the edge that is "the most centered"?
The tree comes as an input from a previous stage of the algorithm and is input as a graph - so it's not balanced nor is it clear which node is the root.
My idea is to find the longest path in the tree and then select the edge in the middle of the longest path. Does it work?
Optimally, I am looking for a solution that can ensure that neither of the trees has more than 2N / 3 nodes.
Thanks for your answers.
I don't believe that your initial algorithm works for the reason I mentioned in the comments. However, I think that you can solve this in O(n) time and space using a modified DFS.
Begin by walking the graph to count how many total nodes there are; call this n. Now, choose an arbitrary node and root the tree at it. We will now recursively explore the tree starting from the root and will compute for each subtree how many nodes are in each subtree. This can be done using a simple recursion:
If the current node is null, return 0.
Otherwise:
For each child, compute the number of nodes in the subtree rooted at that child.
Return 1 + the total number of nodes in all child subtrees
At this point, we know for each edge what split we will get by removing that edge, since if the subtree below that edge has k nodes in it, the spilt will be (k, n - k). You can thus find the best cut to make by iterating across all nodes and looking for the one that balances (k, n - k) most evenly.
Counting the nodes takes O(n) time, and running the recursion visits each node and edge at most O(1) times, so that takes O(n) time as well. Finding the best cut takes an additional O(n) time, for a net runtime of O(n). Since we need to store the subtree node counts, we need O(n) memory as well.
Hope this helps!
If you see my answer to Divide-And-Conquer Algorithm for Trees, you can see I'll find a node that partitions tree into 2 nearly equal size trees (bottom up algorithm), now you just need to choose one of the edges of this node to do what you want.
Your current approach is not working assume you have a complete binary tree, now add a path of length 3*log n to one of leafs (name it bad leaf), your longest path will be within one of a other leafs to the end of path connected to this bad leaf, and your middle edge will be within this path (in fact after you passed bad leaf) and if you partition base on this edge you have a part of O(log n) and another part of size O(n) .

worst case in MAX-HEAPIFY: "the worst case occurs when the bottom level of the tree is exactly half full"

In CLRS, third Edition, on page 155, it is given that in MAX-HEAPIFY,
"the worst case occurs when the bottom level of the tree is exactly half full"
I guess the reason is that in this case, Max-Heapify has to "float down" through the left subtree.
But the thing I couldn't get is "why half full" ?
Max-Heapify can also float down if left subtree has only one leaf. So why not consider this as the worst case ?
Read the entire context:
The children's subtrees each have size at most 2n/3 - the worst case occurs when the last row of the tree is exactly half full
Since the running time T(n) is analysed by the number of elements in the tree (n), and the recursion steps into one of the subtrees, we need to find an upper bound on the number of nodes in a subtree, relative to n, and that will yield that T(n) = T(max num. nodes in subtree) + O(1)
The worst case of number of nodes in a subtree is when the final row is as full as possible on one side, and as empty as possible on the other. This is called half full. And the left subtree size will be bounded by 2n/3.
If you're proposing a case with only a few nodes, then that's irrelevant, since all base cases can be considered O(1) and ignored.
Already there's an accepted answer but this answer is for those people who are still a bit confused (as I was), or something still doesn't click. So here's a little bit longer and more detailed explanation.
Though it might sound redundant, we have to be very clear about the exact definitions because through our attention to the details... chances are when you do that proving things becomes much easier.
From CLRS (section 6.1), a Binary Heap data structure is an array object that can be viewed as a nearly complete binary tree
From Wikipedia, In a complete binary tree, every level (except possibly the last level) is completely filled, and all the nodes in the last level are as far left as possible.
Again, from Wikipedia, A balanced binary tree is a binary tree structure in which the left and right sub-trees of every node differ in height by no more than 1.
Now that we are armed, let's dive in.
So, in comparison to the root, the height of the left and right sub-tree can differ by 1 at most.
Let's consider a tree T and let the height of the left sub-tree = h+1 and the height of the right sub-tree = h
What can be the worst-case in MAX_HEAPIFY? The worst-case happens when we end up doing maximum number of comparisons and swaps while trying to maintain the heap property.
When the MAX_HEAPIFY algorithm runs and if it recursively goes through the longest path then we can consider a possible worst-case because it will end up doing the maximum number of comparisons and swaps in the longest path.
Well, it seems all of the longest paths happen to be in the left sub-tree (as its height is h+1). But someone might as well ask: Why not the right sub-tree? Remember the above definition, all the nodes in the last level have to be as far left as possible.
Now because we have to cover every possibility that can lead to a worst-case, we need to get more number of longer paths, if any exist, and because of that, we ought to make the left sub-tree FULL (But Why? So that we can get more paths to choose from and opt for the path that gives the worst-case time among all).
Since the left subtree has a height h+1, it will have 2^(h+1) no. of leaf nodes, and, therefore, 2^(h+1) number of paths from the root. This is the maximum number of possible paths in a tree T of h+1 height.
Note: Please hold on to it if you are still reading, maybe just for the sake of crystal clarity.
Here's the image of the tree structure in the worst-case situation.
In the above image, as you can see, consider that the left (in yellow) sub-tree and the right (in pink) sub-tree each has x nodes. The pink portion is a complete right sub-tree and the yellow portion is the left sub-tree excluding the last level.
Notice that both the left (yellow) and the right (pink) sub-trees have a height of h.
Now, from the start, we have considered the left subtree to be of height h+1 as a whole (i.e. including the yellow portion and the last level).
Now, if I may ask, how many nodes do we have to add in the last level i.e. below the yellow portion to make the left sub-tree completely FULL?
Well, the bottom-most layer of the yellow portion has ⌈x/2⌉ nodes (i.e. Total number of leaves in a tree/subtree having n nodes = ⌈n/2⌉; for proof visit this link), and now if we add 2 children to each of these nodes or leaves => total x (≈x) nodes have been added (How? ⌈x/2⌉ leaves * 2 ≈ x nodes).
With this addition, we make the left sub-tree of height h+1 (i.e. the yellow portion with height h and the one last level added) FULL, hence meeting the worst-case criteria.
Since the left sub-tree is FULL, the whole Tree is HALF FULL.
Now someone might as well ask: What if we add more nodes, or, specifically, what if we add nodes in the right sub-tree? Well, we don't. And that's because now if we happen to add more nodes, the nodes will be added in the right sub-tree (as the left sub-tree is FULL), which, in turn, will tend to balance out the tree more. Now as the tree is starting to get more balanced, we are tending to move towards the best-case scenario and not the worst-case.
Final question : How many nodes do we have in total?
Total nodes in the tree, n = x (from the yellow portion) + x (from the pink portion) + x (addition of the last level below the yellow portion) = 3x
Can you notice something? As a by-product, the left sub-tree in total contains at most 2x nodes i.e. 2n/3 nodes (bcoz x = n/3).

Resources