Most performant way to find all the leaf nodes in a tree data structure - algorithm

I have a tree data structure where each node can have any number of children, and the tree can be of any height. What is the optimal way to get all the leaf nodes in the tree? Is it possible to do better than just traversing every path in the tree until I hit the leaf nodes?
In practice the tree will usually have a max depth of 5 or so, and each node in the tree will have around 10 children.
I'm open to other types of data structures or special trees that would make getting the leaf nodes especially optimal.
I'm using javascript but really just looking for general recommendations, any language etc.
Thanks!

Memory layout is essential to optimal retrieval, so the child lists should be contiguous and not linked list, the nodes should be place after each other in retrieval order.
The more static your tree is, the better layout can be done.
All in one layout
All in one array totally ordered
Pro
memory can be streamed for maximal throughput (hardware pre-fetch)
no unneeded page lookups
normal lookups can be made
no extra memory to make linked lists.
internal nodes use offset to find the child relative to itself
Con
inserting / deleting can be cumbersome
insert / delete O(N)
insert might lead to resize of the array leading to a costly copy
Two array layout
One array for internal nodes
One array for leafs
Internal nodes points to the leafs
Pro
leaf nodes can be streamed at maximum throughput (maybe the best layout if your mostly only interested in the leafs).
no unneeded page lookups
indirect lookups can be made
Con
if all leafs are ordered insert / delete can be cumbersome
if leafs are unordered insertion is ease, just add at the end.
deleting unordered leafs is also a problem if no tombstones are allowed as the last leaf would have to be moved back and the internal nodes would need fix up. (via a further indirection this can also be fixed see slot-map)
resizing of the either might lead to a large copy, though less than the All-in-one as they could be done independently.
Array of arrays (dynamic sized, C++ vector of vectors)
using contiguous arrays for referencing the children of each node
Pro
running through each child list is fast
each child array may be resized independently
Con
while removing much of the extra work of linked list children the individual lists are dispersed among all other data making lookup taking extra time.
insert might cause resize and copy of an array.

Finding the leaves of a tree is O(n), which is optimal for a tree, because you have to look at O(n) places to retrieve all n things, plus the branch nodes along the way. The constant overhead is the branch nodes.
If we increase the branching factor, e.g. letting each branch have 32 children instead of 2, we significantly decrease the number of overhead nodes, which might make the traversal faster.
If we skip a branch, we're not including the values in that branch, so we have to look at all branches.

Related

HRW rendezvous hashing in log time?

The Wikipedia page for Rendezvous hashing (Highest Random Weight "HRW") makes the following claim:
While it might first appear that the HRW algorithm runs in O(n) time, this is not the case. The sites can be organized hierarchically, and HRW applied at each level as one descends the hierarchy, leading to O(log n) running time, as in.[7]
I got a copy of the referenced paper, "Hash-Based Virtual Hierarchies for Scalable Location Service in Mobile Ad-hoc Networks." However the hierarchy referenced in their paper seems to be very specific to their application domain. As far as I can discern, there is no clear indication of how to generalize the method. The Wikipedia remark makes it seem like log is the general case.
I looked at a few general HRW implementations, and none of them seemed to support anything better than linear time. I gave it some thought, but I don't see any way to organize sites hierarchically without causing parent nodes to cause inefficient remapping when they drop out, significantly defeating the main advantage of HRW.
Does anybody know how to do this? Alternatively, is Wikipedia incorrect about there being a general way to implement this in log time?
Edit: Investigating mcdowella's approach:
OK, I think I see how this could work. But you need a little more than you've specified.
If you just do what you've described, you get in a situation where each leaf probably just has either zero or one nodes in it, and there's significant variance in how many nodes are in the leaf-most subtrees. If you swap using HRW at each level with just making the whole thing a regular search tree, you get exactly the same effect. Essentially, you've got an implementation of consistent hashing, along with its flaw of having unequal loading between buckets. Computing the combined weights, the defining implementation of HRW, adds nothing; you're better off just doing a search at each level, since it saves doing the hashes, and can be implemented without looping over each radix value
It's fixable though: you just need to be using HRW to choose from many alternatives at the final level. That is, you need all of the leaf nodes to be in large buckets, comparable to the number of replicas you'd have in consistent hashing. These large buckets should be approximately equally-loaded compared to each other, and then you're using HRW to choose the specific site. Since the bucket sizes are fixed, this is an O(n) algorithm, and we get all of the key HRW properties.
Honestly though, I think this is pretty questionable. It isn't so much an implementation of HRW, as it is just combining HRW with consistent hashing. I guess there's nothing wrong with that, and it might even be better than the usual technique of using replicas, in some cases. But I think it's misleading to state that HRW is log(n), if this is actually what the author meant.
Additionally, the original description is also questionable. You don't need to apply HRW at each level, and you shouldn't, as there is no advantage in doing so; you should do something fast (such as indexing), and just use HRW for the final choice.
Is this really the best we can do, or is there some other way to make HRW O(log(n))?
If you give each site a sufficiently long random id expressed in radix k (perhaps by hashing a non-random id) then you can associate the sites with leaves of a tree which has at most k descendants at each node. There is no need to associate any site with an internal node of the tree.
To work out where to store an item, use HRW to work out from the root of the tree down which way to branch at tree nodes, stopping when you reach a leaf, which is associated with a site. You can do this without having to communicate with any site until you work out which site you want to store the item at - all you need to know is the hashed ids of the sites to construct a tree.
Because sites are associated only with leaves there is no way an internal node of the tree can drop out, except if all of the sites associated with leaves under it drop out, at which point it will become irrelevant.
I don't buy the updated answer. There are two nice properties of HRWs that appear to get lost when you compare the weights of branches instead of all sites.
One is that you can pick the top-n sites instead of just the primary, and these should be randomly distributed. If you're descending into a single tree, the top-n sites will be near each other in the tree. This could be fixed by descending multiple times with different salts but that seems like a lot of extra work.
Two is that it is obvious what happens when a site is added or remove and only 1/|sites| of the data moves in the case of an add. If you modify the existing tree, it only affects the peer site. In the case of an add, the only data that moves is from the new peer of the added site. In the case of a delete, all the data that was at that site now moves to the former peer. If you instead recompute the tree, all of the data could move depending on the way the tree is constructed.
I think you can use the same "virtual node" approach normally used for consistent hashing. Suppose you have N physical nodes with IDs:
{n1,...,nN}.
Choose V, the number of virtual nodes per physical node, and generate a new list of IDs:
{n1v1,v1v2,...,n1vV
,n2v1,n2v2,...,n2vV
,...
,nNv1,nNv2,...,nNvV}.
Arrange these into the leaves of a fixed but randomized binary tree with labels on the internal nodes. These internal labels could be, for example, a concatenation of the labels of its child nodes.
To choose a physical node to store an object O at, start at the root and choose the branch with the higher hash H(label,O). Repeat the process until you reach a leaf. Store the object at the physical node corresponding to the virtual node at that leaf. This takes O(log(NV)) = O(log(N)+log(V)) = O(log(N)) steps (since V is constant).
If a physical node fails, the objects at that node are rehashed, skipping over subtrees with no active leaves.
One way to implement HRW rendezvous hashing in log time
One way to implement rendezvous hashing in O(log N), where N is the number of cache nodes:
Each file named F is cached in the cache node named C with the largest weight w(F,C), as is normal in rendezvous hashing.
First, we use a nonstandard hash function w() something like this:
w(F,C) = h(F) xor h(C).
where h() is some good hash function.
tree construction
Given some file named F, rather than calculate w(F,C) for every cache node -- which requires O(N) time for each file --
we pre-calculate a binary tree based only on the hashed names h(C) of the cache nodes;
a tree that lets us find the cache node with the maximum w(F,C) value in O(log N) time for each file.
Each leaf of the tree contains the name C of one cache node.
The root (at depth 0) of the tree points to 2 subtrees.
All the leaves where the most significant bit of h(C) is 0 are in the root's left subtree; all the leaves where the most significant bit of h(C) are 1 are in the root's right subtree.
The two children of the root node (at depth 1) deal with the next-most-significant bit of h(C).
And so on, with the interior nodes at depth D dealing with the D'th-most-significant bit of h(C).
With a good hash function, each step down from the root approximately halves the candidate cache nodes in the chosen subtree,
so we end up with a tree of depth roughly ln_2 N.
(If we end up with a tree with that is "too unbalanced",
somehow get everyone to agree on some different hash function from some universal hashing family rebuild the tree, before we add any files to the cache, until we get a tree that is "not too unbalanced").
Once the tree has been built, we never need to change it no matter how many file names F we later encounter.
We only change it when we add or remove cache nodes from the system.
filename lookup
For a filename F that happens to hash to h(F) = 0 (all zero bits),
we find the cache node with the highest weight (for that filename) by starting at the root and always taking the right subtree when possible.
If that leads us to an interior node that doesn't have a right subtree, then we take its left subtree.
Continue until we reach a node without a left or right subtree -- i.e., a leaf node that contains the name of the selected cache node C.
When looking up some other file named F, first we hash its name to get h(F), then
we start at the root and go right or left respectively (if possible) determined by the next bit in h(F) is 0 or 1.
Since the tree (by construction) is not "too unbalanced",
traversing the whole tree from the root to the leaf that contains the name of the chosen cache node C requires O(ln N) time in the worst case.
We expect that for a typical set of file names,
the hash function h(F) "randomly" chooses left or right at each depth of the tree.
Since the tree (by construction) is not "too unbalanced",
we expect each physical cache node to cache roughly the same number of files (within a multiple of 4 or so).
drop out effects
When some physical cache node fails,
everyone deletes the corresponding leaf node from their copy of this tree.
(Everyone also deletes every interior node that then has no leaf descendants).
This doesn't require moving around any files cached on any other cache node -- they still map to the same cache node they always did.
(The right-most leaf node in a tree is still the right-most leaf node in that tree, no matter how many other nodes in that tree are deleted).
For example,
....
\
|
/ \
| |
/ / \
| X |
/ \ / \
V W Y Z
With this O(log N) algorithm, when cache node X dies, leaf X is deleted from the tree, and all its files become (hopefully relatively evenly) distributed between Y and Z -- none of the files from X end up at V or W or any other cache node.
All the files that previously went to cache nodes V, W, Y, Z continue to go to those same cache nodes.
rebalancing after dropout
Many cache nodes failing or new cache nodes adding or both, may make the tree "too unbalanced".
Picking a new hash function is a big hassle after we've added a bunch of files to the cache, so rather than pick a new hash function like we did when initially constructing the tree, maybe it would be better to somehow rebalance the tree by remove a few nodes, rename them with some new semi-random names, and then add them back to the system.
Repeat until the system is no longer "too unbalanced".
(Start with the most unbalanced nodes -- the nodes cacheing the least amount of data).
comments
p.s.:
I think this may be pretty close to what mcdowella was thinking,
but with more details filled in to clarify that (a) yes, it is log(N) because it's a binary tree that is "not too unbalanced", (b) it doesn't have "replicas", and (c) when one cache node fails, it doesn't require any remapping of files that were not on that cache node.
p.p.s.:
I'm pretty sure that Wikipedia page is wrong to imply that typical implementations of rendezvous hashing occur in O(log N) time, where N is the number of cache nodes.
It seems to me (and I suspect the original designers of the hash as well) that the time it takes to (internally, without communicating) recalculate a hash against every node in the network is going to be insignificant and not worth worrying about compared to the time it takes to fetch data from some remote cache node.
My understanding is that rendezvous hashing is almost always implemented with a simple linear algorithm that uses O(N) time, where N is the number of cache nodes, every time we get a new filename F and want to choose the cache node for that file.
Such a linear algorithm has the advantage that it can use a "better" hash function than the above xor-based w(), so when some physical cache node dies, all the files that were cached on the now-dead node are expected to become evenly distributed among all the remaining nodes.

heap and tree data structure implementation difference

So I see that trees are usually implemented as a list where each node is dynamically allocated and each node contains pointers to two of its children.
But a heap is almost always implemented (or so is recommended in text books) using an array. Why is that? Is there some underlying assumption about the uses of these two data strcutures? For e.g. if you are implementing a priority queue using a min heap then the number of nodes in the queue is constant and so it can be implemented using an array of fixed size. But when you are talking/teaching about a heap in general why recommend implemeting it using an array. Or to flip the question a bit why not recommend learnig about trees with an implementation using arrays?
(I assume by heap you mean binary heap; other heaps are almost always linked nodes.)
A binary heap is always a complete tree, and no operation on it moves whole subtrees around or otherwise alters the topology of the tree in any nontrivial way. This is not an assumption, the first is part of the definition of a heap and the second is immediately obvious from the definition of the operations.
First, since the Ahnentafel layout requires reserving space for every internal node (and all leaf nodes except the rightmost ones), an incomplete tree implemented this way would waste space for nodes that don't exist. Conversely, for a complete tree it's the most efficient layout possible, since all space is actually used for node data, and no space is needed for pointers.
Second, moving a subtree in the array would require copying all child elements to their new positions (since the left child's index is always twice the parent's index, the former changes when the latter changes, recursively down to the leafs). When you have nodes linked via pointers, you only need to move a few pointers around regardless of how large the trees below those pointers are. Moving subtrees is a core component of many algorithms of trees, including all kinds of binary search trees. It needs to be lightning fast for those algorithms to be efficient. Binary heap operations however never need to do this so it's a non-issue.

Auto-balancing (or cheaply balanced) 3D datastructure

I am working on a tool that requires a 3D "voxel-based" engine. By that I mean it will involve adding and removing cubes from a grid. In order to manage these cubes I need a data structure that allows for quick insertions and deletes. The problem I've seen with k-d trees and octrees is that it seems like they would frequently need to be recreated (or at least rebalanced) because of these operations.
Before I jumped in I wanted to get opinions on what the best way to go about this would be.
Some more details:
x,y,z position is in integer space
needs to be efficient enough for a real-time application
there is no hard limit on the number of cubes that would be used.
In all likelihood the number will most often be inconsequentially
low (<100), however I would like to have the tool handle as many
cubes as possible
I guess the ultimate question is what is the best way to manage what is essentially 3D point data in a way that can handle frequent insertions and deletes?
(No I'm not making Minecraft)
Octrees are easy to update dynamically. Typically the tree is refined based on a per leaf upper/lower population count:
When a new item is inserted, it is pushed onto the item list for the enclosing leaf node. If the upper population count is exceeded, the leaf is refined.
When an existing item is erased, it is removed from the item list for the enclosing leaf node. If the lower population count is reached, the leaf siblings are scanned. If all siblings are leaf nodes and their cummulative item count is less than the upper population count the set of siblings are deleted and the items pushed onto the parent.
Both operations are local, traversing only the height of the tree, which is O(log(n)) for well distributed point sets.
KD-trees, on the other hand, are not easy to update dynamically, since their structure is based on the distribution of the full point set.
There are also a number of other spatial data structures that support dynamic updates - R-trees, Delaunay triangulations to name a few, but it's not clear that they'd offer better performance than an Octree. I'm not aware of any spatial structure that supports better than O(log(n)) dynamic queries.
Hope this helps.

BTree- predetermined size?

I read this on wikipedia:
In B-trees, internal (non-leaf) nodes can have a variable number of
child nodes within some pre-defined range. When data is inserted or
removed from a node, its number of child nodes changes. In order to
maintain the pre-defined range, internal nodes may be joined or split.
Because a range of child nodes is permitted, B-trees do not need
re-balancing as frequently as other self-balancing search trees, but
may waste some space, since nodes are not entirely full.
We have to specify this range for B trees. Even when I looked up CLRS (Intro to Algorithms), it seemed to make to use of arrays for keys and children. My question is- is there any way to reduce this wastage in space by defining the keys and children as lists instead of predetermined arrays? Is this too much of a hassle?
Also, for the life of me I'm not able to get a decent psedocode on btreeDeleteNode. Any help here is appreciated too.
When you say "lists", do you mean linked lists?
An array of some kind of element takes up one element's worth of memory per slot, whether that slot is filled or not. A linked list only takes up memory for elements it actually contains, but for each one, it takes up one element's worth of memory, plus the size of one pointer (two if it's a doubly-linked list, unless you can use the xor trick to overlap them).
If you are storing pointers, and using a singly-linked list, then each list link is twice the size of each array slot. That means that unless the list is less than half full, a linked list will use more memory, not less.
If you're using a language whose runtime has per-object overhead (like Java, and like C unless you are handling memory allocation yourself), then you will also have to pay for that overhead on each list link, but only once on an array, and the ratio is even worse.
I would suggest that your balancing algorithm should keep tree nodes at least half full. If you split a node when it is full, you will create two half-full nodes. You then need to merge adjacent nodes when they are less than half full. You can then use an array, safe in the knowledge that it is more efficient than a linked list.
No idea about the details of deletion, sorry!
B-Tree node has an important characteristic, all keys in the node is sorted. When finding a specific key, binary search is used to find the right position. Using binary search keeps the complexity of search algorithm in B-Tree O(logn).
If you replace the preallocated array with some kind of linked list, you lost the ordering. Unless you use some complex data structures, like skip list, to keep the search algorithm with O(logn). But it's totally unnecessary, skip list itself is better.

Binary Search Tree for specific intent

We all know there are plenty of self-balancing binary search trees (BST), being the most famous the Red-Black and the AVL. It might be useful to take a look at AA-trees and scapegoat trees too.
I want to do deletions insertions and searches, like any other BST. However, it will be common to delete all values in a given range, or deleting whole subtrees. So:
I want to insert, search, remove values in O(log n) (balanced tree).
I would like to delete a subtree, keeping the whole tree balanced, in O(log n) (worst-case or amortized)
It might be useful to delete several values in a row, before balancing the tree
I will most often insert 2 values at once, however this is not a rule (just a tip in case there is a tree data structure that takes this into account)
Is there a variant of AVL or RB that helps me on this? Scapegoat-trees look more like this, but would also need some changes, anyone who has got experience on them can share some thougts?
More precisely, which balancing procedure and/or removal procedure would help me keep this actions time-efficient?
It is possible to delete a range of values a BST in O(logn + objects num).
The easiest way I know is to work with the Deterministic Skip List data structure (you might want to read a bit about this data structure before you go on).
In the deterministic skip list all of the real values are stored in the bottom level, and there are pointers on upper levels to them. Insert, search and remove are done in O(logn).
The range deletion operation can be done according to the following algorithm:
Find the first element in the range - O(logn)
Go forward in the linked list, and remove all elements that are still in the range. If there are elements with pointers to the upper levels - remove them too, until reaching the topmost level (removal from a linked list) - O(number of deleted objects)
Fix the pointers to fit deterministic skip list (2-3 elements between every pointer upward)
The total complexity of the range delete is O(logn + number of objects in the range).
Notice that if you choose to work with a random skip list, you get the same complexity, but on average, and not worst case. The plus is that you don't have to fix the upper level pointers to meet the 2-3 demand.
A deterministic skip list has a 1-1 mapping to a 2-3 tree, so with some more work, the procedure described above could work for a 2-3 tree as well.
Long ago in the pre-STL days I wrote my own B-Tree (BST) algorithm because I had a rather large data set at the time (roughly 700K items in 2 trees that were interdependent). I found that rebalancing after every 100-200 insertions/deletions was the peak performance I could get at the time based on experimentation on 486 and SGI hardware. This number may be different now, or maybe not since it does appear to be an algorithmic optimization limit unless you convert to a parallel model.
In short, you could apply a modification trigger for the rebalancing, and allow for forced rebalancing when you've completed all your modifications.
The improvement was remarkable. The initial straight load was not complete after 25m (killed the process). Rebalancing as we went also was killed after 15m. The restricted modification loads with a rebalance every 100 mods loaded and ran in less than 3m. Note that during the "run" portion, there were 0-8 modification to the tree per initial entry. You really need to consider whether you always need to be in-balance when the tree will be modified again in the near term.
Hmm, what about B-trees? They are also balanced, and if you choose a big-order one --- it depends on how many items do you have ---, you will save a bunch of object creation/destruction times.
To 2. If you have a B-tree of order 100, you can remove up to 100 items by one function call.
To 3. This feature can be applied to almost any of the trees, just implement a RemoveSome() function that removes N items and does a rebalance. For B-trees, it's a bit trickier, but can be done.
Note: I supposed you're a programmer. If you need a complete, tested, off-the-shelf solution, you need another answer.
It should be easy to implement deleting a node and its sub nodes in an AVL tree if every node stores its height instead of a balance factor. After deleting a node keep rotating until the two child nodes differ by no more than one. Then move up the tree and repeat. The only real difference from a normal deletion will be a while instead of an if for testing the heights.
The Set implementation in the OCaml standard library is a purely functional AVL tree that satisfies all of your requirements and, in particular, has very efficient implementations of set theoretic operations (union, intersection, difference). Insertion and deletion are O(log n). You can remove subtrees and runs of elements by representing them as a set and using set difference. You can insert two elements simultaneously by creating a 2-element set and applying set union.

Resources