The below images show a Union find Problem solved by rank with path compression. If you don't understand my handwriting then read the description below to understand what I have done.
Description:
First I have done Union of 1 and 2. Second, Union(3,4). Third Union(5,6) and so on by comparing their ranks while also doing path compression when finding the representative element of the tree whose union is to be done.
My Doubt:
My doubt is, If you look at the final tree in the image, you'll see a tree completely flat ( flat means I meant the tree's depth ). Will path compression always result in a flat tree no matter how many elements are present?
And Also how can we find the Union_find's time complexity with path compression?
It is possible to build inverse-trees of unlimited depth. E.g., if you happen to always choose the roots as your Union() arguments, then no paths will be compressed, and you can build your tree as tall as you like.
If (as your written notes suggest) you use rank to choose the root resulting from your union, then your trees will be balanced, so you will need Ω(2^n) operations to generate a tree of depth n. (Specifically: to build a tree of of depth n, first build two trees of depth n-1 and take the Union() of their roots.)
The amortized time complexity of union-find with rank-matching and path compression is known to be O(inverse_ackermann).
Is there a balanced BST structure that also keeps track of subtree size in each node?
In Java, TreeMap is a red-black tree, but doesn't provide subtree size in each node.
Previously, I did write some BST that could keep track subtree size of each node, but it's not balanced.
The questions are:
Is it possible to implement such a tree, while keeping efficiency of (O(lg(n)) for basic operations)?
If yes, then is there any 3rd-party libraries provide such an impl?
A Java impl is great, but other languages (e.g c, go) would also be helpful.
BTW:
The subtree size should be kept track in each node.
So that could get the size without traversing the subtree.
Possible appliation:
Keep track of rank of items, whose value (that the rank depends on) might change on fly.
The Weight Balanced Tree (also called the Adams Tree, or Bounded Balance tree) keeps the subtree size in each node.
This also makes it possible to find the Nth element, from the start or end, in log(n) time.
My implementation in Nim is on github. It has properties:
Generic (parameterized) key,value map
Insert (add), lookup (get), and delete (del) in O(log(N)) time
Key-ordered iterators (inorder and revorder)
Lookup by relative position from beginning or end (getNth) in O(log(N)) time
Get the position (rank) by key in O(log(N)) time
Efficient set operations using tree keys
Map extensions to set operations with optional value merge control for duplicates
There are also implementations in Scheme and Haskell available.
That's called an "order statistic tree": https://en.wikipedia.org/wiki/Order_statistic_tree
It's pretty easy to add the size to any kind of balanced binary tree (red-black, avl, b-tree, etc.), or you can use a balancing algorithm that works with the size directly, like weight-balanced trees (#DougCurrie answer) or (better) size-balanced trees: https://cs.wmich.edu/gupta/teaching/cs4310/lectureNotes_cs4310/Size%20Balanced%20Tree%20-%20PEGWiki%20sourceMayNotBeFullyAuthentic%20but%20description%20ok.pdf
Unfortunately, I don't think there are any standard-library implementations, but you can find open source if you look for it. You may want to roll your own.
There are many different descriptions and examples for the disjoint-set structure available on-line.
In some cases, for each set, it stores "rank". When a set is merged into another set, the rank of the former is increased by 1, if they are of the same rank.
In other cases, for each set, it stores its size. When a set is merged into another set, their sizes are added.
Here it stores ranks.
In the wikipedia article, it stores ranks.
In the Cornell University lecture notes, it stores ranks.
In the example from "Algorithms", by Sedgewick and Wayne, it stores sizes.
Here, it also stores sizes (main site).
Cormen et al. write:
The obvious approach would be to make the root of the tree with fewer
nodes point to the root of the tree with more nodes. Rather than
explicitly keeping track of the size of the subtree rooted at each
node, we shall use an approach that eases the analysis. For each
node, we maintain a rank, which is an upper bound on the height of
the node. In union by rank, we make the root with smaller rank point
to the root with larger rank during a UNION operation.
Which is better / more proper?
All the analysis(is?) indicate that both methods provide the optimal O(alpha) complexity, when combined with tree collapsing technique.
Then the only implementation specific difference comes from the size that either the size or rank variables take. Size can be upto size_t but rank can be encoded always in three bits.
Occasionally those three bits can be encoded in the unused bits in the data/nodes to be processed leading to better performance (speed and size).
The Wikipedia page for Rendezvous hashing (Highest Random Weight "HRW") makes the following claim:
While it might first appear that the HRW algorithm runs in O(n) time, this is not the case. The sites can be organized hierarchically, and HRW applied at each level as one descends the hierarchy, leading to O(log n) running time, as in.[7]
I got a copy of the referenced paper, "Hash-Based Virtual Hierarchies for Scalable Location Service in Mobile Ad-hoc Networks." However the hierarchy referenced in their paper seems to be very specific to their application domain. As far as I can discern, there is no clear indication of how to generalize the method. The Wikipedia remark makes it seem like log is the general case.
I looked at a few general HRW implementations, and none of them seemed to support anything better than linear time. I gave it some thought, but I don't see any way to organize sites hierarchically without causing parent nodes to cause inefficient remapping when they drop out, significantly defeating the main advantage of HRW.
Does anybody know how to do this? Alternatively, is Wikipedia incorrect about there being a general way to implement this in log time?
Edit: Investigating mcdowella's approach:
OK, I think I see how this could work. But you need a little more than you've specified.
If you just do what you've described, you get in a situation where each leaf probably just has either zero or one nodes in it, and there's significant variance in how many nodes are in the leaf-most subtrees. If you swap using HRW at each level with just making the whole thing a regular search tree, you get exactly the same effect. Essentially, you've got an implementation of consistent hashing, along with its flaw of having unequal loading between buckets. Computing the combined weights, the defining implementation of HRW, adds nothing; you're better off just doing a search at each level, since it saves doing the hashes, and can be implemented without looping over each radix value
It's fixable though: you just need to be using HRW to choose from many alternatives at the final level. That is, you need all of the leaf nodes to be in large buckets, comparable to the number of replicas you'd have in consistent hashing. These large buckets should be approximately equally-loaded compared to each other, and then you're using HRW to choose the specific site. Since the bucket sizes are fixed, this is an O(n) algorithm, and we get all of the key HRW properties.
Honestly though, I think this is pretty questionable. It isn't so much an implementation of HRW, as it is just combining HRW with consistent hashing. I guess there's nothing wrong with that, and it might even be better than the usual technique of using replicas, in some cases. But I think it's misleading to state that HRW is log(n), if this is actually what the author meant.
Additionally, the original description is also questionable. You don't need to apply HRW at each level, and you shouldn't, as there is no advantage in doing so; you should do something fast (such as indexing), and just use HRW for the final choice.
Is this really the best we can do, or is there some other way to make HRW O(log(n))?
If you give each site a sufficiently long random id expressed in radix k (perhaps by hashing a non-random id) then you can associate the sites with leaves of a tree which has at most k descendants at each node. There is no need to associate any site with an internal node of the tree.
To work out where to store an item, use HRW to work out from the root of the tree down which way to branch at tree nodes, stopping when you reach a leaf, which is associated with a site. You can do this without having to communicate with any site until you work out which site you want to store the item at - all you need to know is the hashed ids of the sites to construct a tree.
Because sites are associated only with leaves there is no way an internal node of the tree can drop out, except if all of the sites associated with leaves under it drop out, at which point it will become irrelevant.
I don't buy the updated answer. There are two nice properties of HRWs that appear to get lost when you compare the weights of branches instead of all sites.
One is that you can pick the top-n sites instead of just the primary, and these should be randomly distributed. If you're descending into a single tree, the top-n sites will be near each other in the tree. This could be fixed by descending multiple times with different salts but that seems like a lot of extra work.
Two is that it is obvious what happens when a site is added or remove and only 1/|sites| of the data moves in the case of an add. If you modify the existing tree, it only affects the peer site. In the case of an add, the only data that moves is from the new peer of the added site. In the case of a delete, all the data that was at that site now moves to the former peer. If you instead recompute the tree, all of the data could move depending on the way the tree is constructed.
I think you can use the same "virtual node" approach normally used for consistent hashing. Suppose you have N physical nodes with IDs:
{n1,...,nN}.
Choose V, the number of virtual nodes per physical node, and generate a new list of IDs:
{n1v1,v1v2,...,n1vV
,n2v1,n2v2,...,n2vV
,...
,nNv1,nNv2,...,nNvV}.
Arrange these into the leaves of a fixed but randomized binary tree with labels on the internal nodes. These internal labels could be, for example, a concatenation of the labels of its child nodes.
To choose a physical node to store an object O at, start at the root and choose the branch with the higher hash H(label,O). Repeat the process until you reach a leaf. Store the object at the physical node corresponding to the virtual node at that leaf. This takes O(log(NV)) = O(log(N)+log(V)) = O(log(N)) steps (since V is constant).
If a physical node fails, the objects at that node are rehashed, skipping over subtrees with no active leaves.
One way to implement HRW rendezvous hashing in log time
One way to implement rendezvous hashing in O(log N), where N is the number of cache nodes:
Each file named F is cached in the cache node named C with the largest weight w(F,C), as is normal in rendezvous hashing.
First, we use a nonstandard hash function w() something like this:
w(F,C) = h(F) xor h(C).
where h() is some good hash function.
tree construction
Given some file named F, rather than calculate w(F,C) for every cache node -- which requires O(N) time for each file --
we pre-calculate a binary tree based only on the hashed names h(C) of the cache nodes;
a tree that lets us find the cache node with the maximum w(F,C) value in O(log N) time for each file.
Each leaf of the tree contains the name C of one cache node.
The root (at depth 0) of the tree points to 2 subtrees.
All the leaves where the most significant bit of h(C) is 0 are in the root's left subtree; all the leaves where the most significant bit of h(C) are 1 are in the root's right subtree.
The two children of the root node (at depth 1) deal with the next-most-significant bit of h(C).
And so on, with the interior nodes at depth D dealing with the D'th-most-significant bit of h(C).
With a good hash function, each step down from the root approximately halves the candidate cache nodes in the chosen subtree,
so we end up with a tree of depth roughly ln_2 N.
(If we end up with a tree with that is "too unbalanced",
somehow get everyone to agree on some different hash function from some universal hashing family rebuild the tree, before we add any files to the cache, until we get a tree that is "not too unbalanced").
Once the tree has been built, we never need to change it no matter how many file names F we later encounter.
We only change it when we add or remove cache nodes from the system.
filename lookup
For a filename F that happens to hash to h(F) = 0 (all zero bits),
we find the cache node with the highest weight (for that filename) by starting at the root and always taking the right subtree when possible.
If that leads us to an interior node that doesn't have a right subtree, then we take its left subtree.
Continue until we reach a node without a left or right subtree -- i.e., a leaf node that contains the name of the selected cache node C.
When looking up some other file named F, first we hash its name to get h(F), then
we start at the root and go right or left respectively (if possible) determined by the next bit in h(F) is 0 or 1.
Since the tree (by construction) is not "too unbalanced",
traversing the whole tree from the root to the leaf that contains the name of the chosen cache node C requires O(ln N) time in the worst case.
We expect that for a typical set of file names,
the hash function h(F) "randomly" chooses left or right at each depth of the tree.
Since the tree (by construction) is not "too unbalanced",
we expect each physical cache node to cache roughly the same number of files (within a multiple of 4 or so).
drop out effects
When some physical cache node fails,
everyone deletes the corresponding leaf node from their copy of this tree.
(Everyone also deletes every interior node that then has no leaf descendants).
This doesn't require moving around any files cached on any other cache node -- they still map to the same cache node they always did.
(The right-most leaf node in a tree is still the right-most leaf node in that tree, no matter how many other nodes in that tree are deleted).
For example,
....
\
|
/ \
| |
/ / \
| X |
/ \ / \
V W Y Z
With this O(log N) algorithm, when cache node X dies, leaf X is deleted from the tree, and all its files become (hopefully relatively evenly) distributed between Y and Z -- none of the files from X end up at V or W or any other cache node.
All the files that previously went to cache nodes V, W, Y, Z continue to go to those same cache nodes.
rebalancing after dropout
Many cache nodes failing or new cache nodes adding or both, may make the tree "too unbalanced".
Picking a new hash function is a big hassle after we've added a bunch of files to the cache, so rather than pick a new hash function like we did when initially constructing the tree, maybe it would be better to somehow rebalance the tree by remove a few nodes, rename them with some new semi-random names, and then add them back to the system.
Repeat until the system is no longer "too unbalanced".
(Start with the most unbalanced nodes -- the nodes cacheing the least amount of data).
comments
p.s.:
I think this may be pretty close to what mcdowella was thinking,
but with more details filled in to clarify that (a) yes, it is log(N) because it's a binary tree that is "not too unbalanced", (b) it doesn't have "replicas", and (c) when one cache node fails, it doesn't require any remapping of files that were not on that cache node.
p.p.s.:
I'm pretty sure that Wikipedia page is wrong to imply that typical implementations of rendezvous hashing occur in O(log N) time, where N is the number of cache nodes.
It seems to me (and I suspect the original designers of the hash as well) that the time it takes to (internally, without communicating) recalculate a hash against every node in the network is going to be insignificant and not worth worrying about compared to the time it takes to fetch data from some remote cache node.
My understanding is that rendezvous hashing is almost always implemented with a simple linear algorithm that uses O(N) time, where N is the number of cache nodes, every time we get a new filename F and want to choose the cache node for that file.
Such a linear algorithm has the advantage that it can use a "better" hash function than the above xor-based w(), so when some physical cache node dies, all the files that were cached on the now-dead node are expected to become evenly distributed among all the remaining nodes.
I'm looking at the implementation of UnionFind with union by rank and path compression from here http://en.wikipedia.org/wiki/Disjoint-set_data_structure#Disjoint-set_forests (it's pretty much the same pseudo-code as in CLRS) and don't understand why path compression doesn't change rank. If we call find for an endpoint of the longest path from the root the rank should go down and if it doesn't the next union operation will choose an incorrect root.
"Rank" is one of those horribly overloaded terms in theoretical computer science. As Wikipedia notes, in the context of that disjoint set data structure with path compression, rank is not an intrinsic property of the current topology of the forest -- there's just no good way to keep the height of each node up to date. As defined by the sequence of unions, however, rank is useful in proving the running time bound involving the inverse Ackermann function.
Rank is not the actual depth of the tree rather it is an upper bound. As such, on a find operation, the rank is allowed to get out of sync with the depth.