There are many different descriptions and examples for the disjoint-set structure available on-line.
In some cases, for each set, it stores "rank". When a set is merged into another set, the rank of the former is increased by 1, if they are of the same rank.
In other cases, for each set, it stores its size. When a set is merged into another set, their sizes are added.
Here it stores ranks.
In the wikipedia article, it stores ranks.
In the Cornell University lecture notes, it stores ranks.
In the example from "Algorithms", by Sedgewick and Wayne, it stores sizes.
Here, it also stores sizes (main site).
Cormen et al. write:
The obvious approach would be to make the root of the tree with fewer
nodes point to the root of the tree with more nodes. Rather than
explicitly keeping track of the size of the subtree rooted at each
node, we shall use an approach that eases the analysis. For each
node, we maintain a rank, which is an upper bound on the height of
the node. In union by rank, we make the root with smaller rank point
to the root with larger rank during a UNION operation.
Which is better / more proper?
All the analysis(is?) indicate that both methods provide the optimal O(alpha) complexity, when combined with tree collapsing technique.
Then the only implementation specific difference comes from the size that either the size or rank variables take. Size can be upto size_t but rank can be encoded always in three bits.
Occasionally those three bits can be encoded in the unused bits in the data/nodes to be processed leading to better performance (speed and size).
Related
Let's say the number is given at run-time, eg. 20. Trees are also not necessarily full. Unfortunately reducing the number of leaves doesn't seem to be an option either, as the structure of the tree perserves some physical meaning.
Memory efficiency seems to be a big issue for nodes with more than 1 child. If the memory for the child pointer array has to be reserved/allocated, then a lot of unused memory might be reserved; If using an dynamic array/vector to save the pointers, it's slow when reallocation happens.
So my question is, is there a data structure to perserve the relative parent-child relation while not using a tree with high number of leaves?
I am talking about the union-find-disjoint data structure. There are multiple resources on the internet about how to implement this. So far, I have learnt of two optimization techniques for unions. The first one is 'balancing' the tree by a variable Rank, which says how deep the deepest node is, and therefore is the upper bound on find(). The second optimization is: setting an object's parent to be the head node, while calling find() (the setting also includes the object's parents, so it becomes a cascade of optimizations).
However, when implementations use the two of them at once, they usually merge the two together without much thought. Specifically, GeeksforGeeks (just as an example, nothing personal) does this. Wouldn't this lead to the ranks getting "corrupted" and O(log n) complexity?
For example, if I have a long line of nodes (5 to 4 to 3 to 2 to 1 to 0, which is the head) and I call find() to 2, the rank stays 5 even though it should be 3.
In such implementations, ranks are still upper bounds on the heights of the trees.
They may indeed become inexact upper bounds.
The log* proof does not seem to rely on exactness of that upper bound.
In Tarjan's 1975 article "Efficiency of a Good But Not Linear Set Union Algorithm" linked at the bottom of the above page, he seems to use union-by-size instead of union-by-rank.
The size (number of vertices), unlike the exact rank, is easy to maintain in O(1) operations per union.
Rank is not a strict measure of depth. From Wikipedia:
the term rank is used instead of depth since it stops being equal to the depth if path compression (...) is also used
Note also that the example you give cannot occur. There is no order of unions that will result in a string of single nodes when using union by rank. In fact, a tree with rank r will have at least 2r nodes (easily proved with induction). It is also unclear to me how you arrive at the conclusions that a rank that is "too large" will lead to logarithmic complexity.
Relaxed Radix Balanced Trees (RRB-trees) are a generalization of immutable vectors (used in Clojure and Scala) that have 'effectively constant' indexing and update times. RRB-trees maintain efficient indexing and update but also allow efficient concatenation (log n).
The authors present the data structure in a way that I find hard to follow. I am not quite sure what the invariant is that each node maintains.
In section 2.5, they describe their algorithm. I think they are ensuring that indexing into the node will only ever require e extra steps of linear search after radix searching. I do not understand how they derived their formula for the extra steps, and I think perhaps I'm not sure what each of the variables mean (in particular "a total of p sub-tree branches").
What's how does the RRB-tree concatenation algorithm work?
They do describe an invariant in section 2.4 "However, as mentioned earlier
B-Trees nodes do not facilitate radix searching. Instead we chose
the initial invariant of allowing the node sizes to range between m
and m - 1. This defines a family of balanced trees starting with
well known 2-3 trees, 3-4 trees and (for m=32) 31-32 trees. This
invariant ensures balancing and achieves radix branch search in the
majority of cases. Occasionally a few step linear search is needed
after the radix search to find the correct branch.
The extra steps required increase at the higher levels."
Looking at their formula, it looks like they have worked out the maximum and minimum possible number of values stored in a subtree. The difference between the two is the maximum possible difference between the maximum and minimum number of values underneath a point. If you divide this by the number of values underneath a slot, you have the maximum number of slots you could be off by when you work out which slot to look at to see if it contains the index you are searching for.
#mcdowella is correct that's what they say about relaxed nodes. But if you're splitting and joining nodes, a range from m to m-1 means you will sometimes have to adjust up to m-1 (m-2?) nodes in order to add or remove a single element from a node. This seems horribly inefficient. I think they meant between m and (2 m) - 1 because this allows nodes to be split into 2 when they get too big, or 2 nodes joined into one when they are too small without ever needing to change a third node. So it's a typo that the "2" is missing in "2 m" in the paper. Jean Niklas L’orange's masters thesis backs me up on this.
Furthermore, all strict nodes have the same length which must be a power of 2. The reason for this is an optimization in Rich Hickey's Clojure PersistentVector. Well, I think the important thing is to pack all strict nodes left (more on this later) so you don't have to guess which branch of the tree to descend. But being able to bit-shift and bit-mask instead of divide is a nice bonus. I didn't time the get() operation on a relaxed Scala Vector, but the relaxed Paguro vector is about 10x slower than the strict one. So it makes every effort to be as strict as possible, even producing 2 strict levels if you repeatedly insert at 0.
Their tree also has an even height - all leaf nodes are equal distance from the root. I think it would still work if relaxed trees had to be within, say, one level of one-another, though not sure what that would buy you.
Relaxed nodes can have strict children, but not vice-versa.
Strict nodes must be filled from the left (low-index) without gaps. Any non-full Strict nodes must be on the right-hand (high-index) edge of the tree. All Strict leaf nodes can always be full if you do appends in a focus or tail (more on that below).
You can see most of the invariants by searching for the debugValidate() methods in the Paguro implementation. That's not their paper, but it's mostly based on it. Actually, the "display" variables in the Scala implementation aren't mentioned in the paper either. If you're going to study this stuff, you probably want to start by taking a good look at the Clojure PersistentVector because the RRB Tree has one inside it. The two differences between that and the RRB Tree are 1. the RRB Tree allows "relaxed" nodes and 2. the RRB Tree may have a "focus" instead of a "tail." Both focus and tail are small buffers (maybe the same size as a strict leaf node), the difference being that the focus will probably be localized to whatever area of the vector was last inserted/appended to, while the tail is always at the end (PerSistentVector can only be appended to, never inserted into). These 2 differences are what allow O(log n) arbitrary inserts and removals, plus O(log n) split() and join() operations.
I am implementing the quick union algorithm for a union/find structure. In the implementation given at the "Algorithms in Java" book site, the Princeton implementation fails to maintain the size invariant of tree while implementing path compression (in the find() method). Shouldn't this adversely affect the algorithm? or am I missing something? Also, if I am right, how would we go about modifying the size array?
Unless I'm mistaken, I think that this code is indeed maintaining the invariant that the root of each tree stores the number of nodes in its subtree.
When the data structure is created, note that the constructor sets sz[i] = 1 for each node in the forest. This means that the values start off correct.
During a union operation, the data structure correctly adjusts the size of the root of the merged trees. Therefore, after any union operation, all the tree roots have the correct sizes.
While you are correct that during path compression in the find step that the sizes aren't updated, there is no reason that the data structure would change sizes here. Path compression just reduces the length of the paths from nodes in some tree up to the root of the tree. It doesn't change the number of nodes stored in that tree. Accordingly, the size information at the root of the tree undergoing path compression does not need to change. Although some internal subtrees might lose some children as they are reparented higher up in the tree, this is irrelevant because the union/find structure only needs to maintain size information at the roots of its trees, not at internal nodes.
Overall, this means that the data structure does correctly store size information. There is no adverse impact on runtime, nor is there a need to correct anything.
Hope this helps!
If I have a large set of continuous ranges ( e.g. [0..5], [10..20], [7..13],[-1..37] ) and can arrange those sets into any data-structure I like, what's the most efficient way to test which sets a particular test_number belongs to?
I've thought about storing the sets in a balanced binary tree based on the low number of a set ( and each node would have all the sets that have the same lowest number of their set). This would allow you to efficiently prune the number of sets based on whether the test_number you're testing against the sets is less than the lowest number of a set, and then prune that node and all the nodes to the right of that node ( which have a low number in their range which is greater than the test_number) . I think that would prune about 25% of the sets on average, but then I would need to linearly look at all the rest of the nodes in the binary tree to determine whether the test_number belonged in those sets. ( I could further optimize by sorting the lists of sets at any one node by the highest number in the set, which would allow me to do binary search within a specific list to determine which set, if any, contain the test_number. Unfortunately, most of the sets I'll be dealing with don't have overlapping set boundaries.)
I think that this problem has been solved in graphics processing since they've figured out ways to efficiently test which polygons in their entire model contribute to a specific pixel, but I don't know the terminology of that type of algorithm.
Your intuition about the relevance of your problem to graphics is correct. Consider building and querying a segment tree. It is particularly-well suited for the counting query you want. See also its description in Computational Geometry.
I think building a tree structure will speed things up considerably (provided you have enough sets and numbers to check that it's worth the initial cost). Instead of a binary tree it should be a ternary tree. Each node should have left, middle, and right nodes, where the left node contains a set that is strictly less than the node set, the right is strictly greater, and the middle has overlap.
Set1
/ | \
/ | \
/ | \
Set2 Set3 Set4
It's quick and easy to tell if there's overlap in the sets, since you only have to compare the min and max values to order them. In the simple case above, Set2[max] < Set1[min], Set4[min] > Set1[max], and Set1 and Set3 have some overlap. This will speed up your search because if the number you're searching for is in Set1, it won't be in Set2 or Set4, and you don't have to check them.
I just want to point out that using a scheme like this only saves time over the naive implementation of checking every set if you have more numbers to check than you have sets.
I think I would organise them in the same way Mediawiki indexes pages - as a bucket sort. I don't know that it's the most efficient algorithm out there, but it should be fast, and is pretty easy to implement (even I've managed it, and in SQL at that!!).
Basically, the algorithm for sorting is
For Each SetOfNumbers
For Each NumberInSet
Put SetOfNumbers into Bin(NumberInSet)
Then to query, you can just count the number of items in Bin(MyNumber)
This approach will work well when your SetOfNumbers rarely changes, although if they change regularly it's generally not too hard to keep the Bins updated either. It's chief disadvantage is that it trades space, and initial sorting time, for very fast queries.
Note that in the algorithm I've expanded the Ranges into SetsOfNumbers - enumerating every number in a given range.