I'm looking at the implementation of UnionFind with union by rank and path compression from here http://en.wikipedia.org/wiki/Disjoint-set_data_structure#Disjoint-set_forests (it's pretty much the same pseudo-code as in CLRS) and don't understand why path compression doesn't change rank. If we call find for an endpoint of the longest path from the root the rank should go down and if it doesn't the next union operation will choose an incorrect root.
"Rank" is one of those horribly overloaded terms in theoretical computer science. As Wikipedia notes, in the context of that disjoint set data structure with path compression, rank is not an intrinsic property of the current topology of the forest -- there's just no good way to keep the height of each node up to date. As defined by the sequence of unions, however, rank is useful in proving the running time bound involving the inverse Ackermann function.
Rank is not the actual depth of the tree rather it is an upper bound. As such, on a find operation, the rank is allowed to get out of sync with the depth.
Related
I was studying disjoint set data structure. I understand how rank helps us make shallow trees and path compression decrease tree height.
I can find the below code in most articles or study materials for path compression.
int find(subset[] subsets, int I) {
if (subsets[i].parent != I) {
subsets[i].parent= find(subsets, subsets[i].parent);
return subsets[i].parent;
}
}
I am wondering what about ranks. When we will do path compression the rank for the root also change but we did not update that. Could you please explain to me if I am missing something?
I have checked with the online tool so it looks like if we don't update the rank then it would not work as expected. I think it's more about the approximation of path compression. I am thinking in the worst-case scenario it can possible to create a dense tree.
No, you don't update the ranks in path compression. In fact the reason that the word "rank" is used instead of "height", is because the rank doesn't accurately reflect the height, because it isn't updated in path compression.
Rank is a worst-case height that is accurate enough to provide the promised complexities. Whenever I write a union-find structure, though, I use the size of the subtree instead of rank. It works just as well and is also useful for other things.
The below images show a Union find Problem solved by rank with path compression. If you don't understand my handwriting then read the description below to understand what I have done.
Description:
First I have done Union of 1 and 2. Second, Union(3,4). Third Union(5,6) and so on by comparing their ranks while also doing path compression when finding the representative element of the tree whose union is to be done.
My Doubt:
My doubt is, If you look at the final tree in the image, you'll see a tree completely flat ( flat means I meant the tree's depth ). Will path compression always result in a flat tree no matter how many elements are present?
And Also how can we find the Union_find's time complexity with path compression?
It is possible to build inverse-trees of unlimited depth. E.g., if you happen to always choose the roots as your Union() arguments, then no paths will be compressed, and you can build your tree as tall as you like.
If (as your written notes suggest) you use rank to choose the root resulting from your union, then your trees will be balanced, so you will need Ω(2^n) operations to generate a tree of depth n. (Specifically: to build a tree of of depth n, first build two trees of depth n-1 and take the Union() of their roots.)
The amortized time complexity of union-find with rank-matching and path compression is known to be O(inverse_ackermann).
I am talking about the union-find-disjoint data structure. There are multiple resources on the internet about how to implement this. So far, I have learnt of two optimization techniques for unions. The first one is 'balancing' the tree by a variable Rank, which says how deep the deepest node is, and therefore is the upper bound on find(). The second optimization is: setting an object's parent to be the head node, while calling find() (the setting also includes the object's parents, so it becomes a cascade of optimizations).
However, when implementations use the two of them at once, they usually merge the two together without much thought. Specifically, GeeksforGeeks (just as an example, nothing personal) does this. Wouldn't this lead to the ranks getting "corrupted" and O(log n) complexity?
For example, if I have a long line of nodes (5 to 4 to 3 to 2 to 1 to 0, which is the head) and I call find() to 2, the rank stays 5 even though it should be 3.
In such implementations, ranks are still upper bounds on the heights of the trees.
They may indeed become inexact upper bounds.
The log* proof does not seem to rely on exactness of that upper bound.
In Tarjan's 1975 article "Efficiency of a Good But Not Linear Set Union Algorithm" linked at the bottom of the above page, he seems to use union-by-size instead of union-by-rank.
The size (number of vertices), unlike the exact rank, is easy to maintain in O(1) operations per union.
Rank is not a strict measure of depth. From Wikipedia:
the term rank is used instead of depth since it stops being equal to the depth if path compression (...) is also used
Note also that the example you give cannot occur. There is no order of unions that will result in a string of single nodes when using union by rank. In fact, a tree with rank r will have at least 2r nodes (easily proved with induction). It is also unclear to me how you arrive at the conclusions that a rank that is "too large" will lead to logarithmic complexity.
I am implementing the quick union algorithm for a union/find structure. In the implementation given at the "Algorithms in Java" book site, the Princeton implementation fails to maintain the size invariant of tree while implementing path compression (in the find() method). Shouldn't this adversely affect the algorithm? or am I missing something? Also, if I am right, how would we go about modifying the size array?
Unless I'm mistaken, I think that this code is indeed maintaining the invariant that the root of each tree stores the number of nodes in its subtree.
When the data structure is created, note that the constructor sets sz[i] = 1 for each node in the forest. This means that the values start off correct.
During a union operation, the data structure correctly adjusts the size of the root of the merged trees. Therefore, after any union operation, all the tree roots have the correct sizes.
While you are correct that during path compression in the find step that the sizes aren't updated, there is no reason that the data structure would change sizes here. Path compression just reduces the length of the paths from nodes in some tree up to the root of the tree. It doesn't change the number of nodes stored in that tree. Accordingly, the size information at the root of the tree undergoing path compression does not need to change. Although some internal subtrees might lose some children as they are reparented higher up in the tree, this is irrelevant because the union/find structure only needs to maintain size information at the roots of its trees, not at internal nodes.
Overall, this means that the data structure does correctly store size information. There is no adverse impact on runtime, nor is there a need to correct anything.
Hope this helps!
I was reading about the famous union-find problem, and the book was saying: "either the find or the union will take O(n) time, and the other one will take O(1)...."
But what about using bit strings to represent the set?
Then both union (using bit OR) and find (iterating through set lists checking the corresponding bit is 1) will take O(1)..
What is wrong with that logic?
Both operations can be done in amortized time of O(Alpha(n)), where Alpha is an inverse of the Ackermann function (grows very slowly). You have to represent the problem as a forrest. Choose a representative of some subgraph (tree node) and the union operation will merge the trees (hang the smaller tree below the root of the higher). The union operation simply traverses to the root AND shorthens the traversed path (hangs the searched element (possibly all traversed elements) below the root).
With a bitfield
union is going to be O(n). You assume that you can do a simple bit or on two native integers but if n is large you obviously cannot use builtin types.
finding is going to be O(1). You don't have to iterate, you know the exact location of the bit.
Also, a bitfield is not really suited for arbitrary sets. For example if you have a set that can contain any 32bit integer, you need a bitfield with a size of 4G/8=0.5G.