What is a zip tree, and how does it work? - random

I've heard of a new balanced BST data structure called a zip tree. What is the zip tree? How does it work?

At a high level, a zip tree is a
randomized balanced binary search tree,
that is a way of encoding a skiplist as a BST, and
that uses a pair of operations called zipping and unzipping rather than tree rotations.
The first bullet point - that zip trees are randomized, balanced BSTs - gives a feel for what a zip tree achieves at a high level. It's a type of balanced binary search tree that, like treaps and unlike red/black trees, uses randomization to balance the tree. In that sense, a zip tree isn't guaranteed to be a balanced tree, but rather has a very high probability of being balanced.
The second bullet point - that zip trees are encodings of skiplists - shows where zip trees come from and why, intuitively, they're balanced. You can think of a zip tree as a way of taking the randomized skiplist data structure, which supports all major operations in expected time O(log n), and representing it as a binary search tree. This provides the intuition for where zip trees come from and why we'd expect them to be so fast.
The third bullet point - zip trees use zipping and unzipping rather than tree rotations - accounts for the name of the zip tree and what it feels like to code one up. Zip trees differ from other types of balanced trees (say, red/black trees or AVL trees) in that nodes are moved around the tree not through rotations, but through a pair of operations that convert a larger chain of nodes into two smaller chains or vice-versa.
The rest of this answer dives deeper into where zip trees come from, how they work, and how they're structured.
Review: Skip Lists
To understand where zip trees come from, let's begin with a review of another data structure, the skiplist. A skiplist is a data structure that, like a binary search tree, stores a collection of elements in sorted order. Skiplists, however, aren't tree structures. Rather, a skiplist works by storing elements in sorted order through several layers of linked lists. A sample skiplist is shown here:
As you can see, the elements are represented in sorted order. Each element has an associated height, and is part of a number of linked lists equal to its height. All of the elements of the skiplist participate in the bottom layer. Ideally, roughly half of the nodes will be in the layer above that, roughly a quarter of the nodes will be in the layer above that, roughly an eighth of the nodes will be in the layer above that, etc. (More on how this works later on.)
To do a lookup in a skiplist, we begin in the topmost layer. We walk forward in the skiplist until either (1) we find the element we're looking for, (2) we find an element bigger than the one we're looking for, or (3) we hit the end of the list. In the first case, we uncork the champagne and celebrate because we discovered the item we were searching for and there's nothing more to do. In the second case or third cases, we've "overshot" the element that we're looking for. But that's nothing to worry about - in fact, that's helpful because it means that what we're looking for must be between the node we hit that "overshot" and the node that comes before it. So we'll go to the previous node, drop down one layer, and pick up our search from there.
For example, here's how we'd do a search for 47:
Here, the blue edges indicate the links followed where we moved forward, and the red edges indicate where we overshot and decided to descend down a layer.
A powerful intuition for how skiplists work - which we'll need later on as we transition to zip trees - is that the topmost layer of the skiplist partitions the remaining elements of the skiplists into different ranges. You can see this here:
Intuitively, a skiplist search will be "fast" if we're able to skip looking at most of the elements. Imagine, for example, that the second-to-last layer of the skiplist only stores every other element of the skiplist. In that case, traversing the second-to-last layer is twice as fast as traversing the bottom layer, so we'd expect a lookup starting in the second-to-last layer to take half as much time as a lookup starting in the bottom layer. Similarly, imagine that the layer above that one only stores every other element from the layer below it. Then searching in that layer will take roughly half as much time as searching the layer below it. More generally, if each layer only stores roughly half the elements of the layer below it, then we could skip past huge amounts of the elements in the skiplist during a search, giving us good performance.
The skiplist accomplishes this by using the following rule: whenever we insert an element into the skiplist, we flip a coin until we get heads. We then set the height of the newly-inserted node to be the number of coins that we ended up tossing. This means it has a 50% chance to stay in its current layer and a 50% chance to move to the layer above it, which means, in aggregate, that roughly half the nodes will only be in the bottom layer, roughly half of what's left will be one layer above that, roughly half of what's left will be one layer above that, etc.
(For those of you with a math background, you could also say that the height of each node in the skiplist is a Geom(1/2) random variable.)
Here's an example of inserting 42 into the skiplist shown above, using a height of 1:
Deletion from a skiplist is also a fairly simple operation: we simply splice it out of whatever linked lists it happens to be in. That means that if we were to delete the 42 we just inserted from the above list, we'd end up with the same skiplist that we started with.
It can be shown that the expected cost of an insertion, deletion, or lookup in a skiplist is O(log n), based on the fact that the number of items in each list is roughly half the number of items in the one below it. (That means we'd expect to see O(log n) layers, and only take a constant number of steps in each layer.)
From Skiplists to Zip Trees
Now that we've reviewed skiplists, let's talk about where the zip tree comes from.
Let's imagine that you're looking at the skiplist data structure. You really like the expected O(log n) performance of each operation, and you like how conceptually simple it is. There's just one problem - you really don't like linked lists, and the idea of building something with layers upon layers of linked lists doesn't excite you. On the other hand, you really love binary search trees. They've got a really simple structure - each node has just two pointers leaving it, and there's a simple rule about where everything gets placed. This question then naturally arises: could you get all the benefits of a skiplist, except in BST form?
It turns out that there's a really nice way to do this. Let's imagine that you have the skiplist shown here:
Now, imagine you perform a lookup in this skiplist. How would that search work? Well, you'd always begin by scanning across the top layer of the skiplist, moving forward until you found a key that was bigger than the one you were looking for, or until you hit the end of the list and found that there were no more nodes at the top level. From there, you'd then "descend" one level into a sub-skiplist containing only the keys between the last node you visited and the one that overshot.
It's possible to model this exact same search as a BST traversal. Specifically, here's how we might represent the top layer of that skiplist as a BST:
Notice that all these nodes chain to the right, with the idea being that "scanning forward in the skiplist" corresponds to "visiting larger and larger keys." In a BST, moving from one node to a larger node corresponds to moving right, hence the chain of nodes to the right.
Now, each node in a BST can have up to two children, and in the picture shown above each node has either zero children or one child. If we fill in the missing children by marking what ranges they correspond to, we get this.
And hey, wait a minute! It sure looks like the BST is partitioning the space of keys the same way that the skiplist is. That's promising, since it suggests that we're on to something here. Plus, it gives us a way to fill in the rest of the tree: we can recursively convert the subranges of the skiplist into their own BSTs and glue the whole thing together. If we do that, we get this tree encoding the skiplist:
We now have a way of representing a skiplist as a binary search tree. Very cool!
Now, could we go the other way around? That is, could we go from a BST to a skiplist? In general, there's no one unique way to do this. After all, when we converted the skiplist to a BST, we did lose some information. Specifically, each node in the skiplist has an associated height, and while each node in our BST has a height as well it's not closely connected to the skiplist node heights. To address this, let's tag each BST node with the height of the skiplist node that it came from. This is shown here:
Now, some nice patterns emerge. For starters, notice that each node's associated number is bigger than its left child's number. That makes sense, since each step to the left corresponds to descending into a subrange of the skiplist, where nodes will have lower heights. Similarly, each node's associated number is greater than or equal to the number of its right child. And that again makes sense - moving to the right either means
continuing forward at the same level that we were already on, in which case the height remains the same, or
hitting the end of a range and descending into a subrange, in which case the height decreases.
Can we say more about the shape of the tree? Sure we can! For example, in a skiplist, each node's height is picked by flipping coins until we get heads, then counting how many total coins we flipped. (Or, as before, it's geometrically distributed with probability 1/2). So if we were to imagine building a BST that corresponded to a skiplist, we'd want the numbers assigned to the nodes to work out the same way.
Putting these three rules together, we get the following, which defines the shape of our tree, the zip tree!
A zip tree is a binary search tree where
Each node has an associated number called its rank. Ranks are assigned randomly to each node by flipping coins until heads is flipped, then counting how many total coins were tossed.
Each node's rank is strictly greater than its left child's rank.
Each node's rank is greater than or equal to its right child's rank.
It's amazing how something like a skiplist can be represented as a BST by writing out such simple rules!
Inserting Elements: Unzipping
Let's suppose you have a zip tree. How would you insert a new element into it?
We could in principle answer this question by looking purely at the rules given above, but I think it's a lot easier to figure this out by remembering that zip trees are skiplists in disguise. For example, here's the above zip tree, with its associated skiplist:
Now, suppose we want to insert 18 into this zip tree. To see how this might play out, imagine that we decide to give 18 a rank of 2. Rather than looking at the zip tree, let's look at what would happen if we did the insertion into the skiplist. That would give rise to this skiplist:
If we were to take this skiplist and encode it as a zip tree, we'd get the following result:
What's interesting about this is that we can see what the tree needs to look like after the insertion, even if we don't know how to perform the insertion. We can then try to figure out what the insertion logic needs to look like by reverse-engineering it from these "before" and "after" pictures.
Let's think about what change this insertion made to our zip tree. To begin with, let's think back to our intuition for how we encode skiplists as zip trees. Specifically, chains of nodes at the same level in a skiplist with no intervening "higher" elements map to chains of nodes in the zip tree that lean to the right. Inserting an element into the skiplist corresponds to adding some new element into one of the levels, which has the effect of (1) adding in something new into some level of the skiplist, and (2) taking chains of elements in the skiplist that previously were adjacent at some level, then breaking those connections.
For example, when we inserted 18 into the skiplist shown here, we added something new into the blue chain highlighted here, and we broke all of the red chains shown here:
What is that going to translate into in our zip tree? Well, we can highlight the blue link where our item was inserted here, as well as the red links that were cut:
Let's see if we can work out what's going on here. The blue link here is, fortunately, pretty easy to find. Imagine we do a regular BST insertion to add 18 into our tree. As we're doing so, we'll pause when we reach this point:
Notice that we've hit a key with the same rank as us. That means that, if we were to keep moving to the right, we'd trace out this region of the skiplist:
To find the blue edge - the place where we go - we just need to walk down through this chain of nodes until we find one bigger than us. The blue edge - our insertion point - is then given by the edge between that node and the one above it.
We can identify this location in a different way: we've found the blue edge - our insertion point - when we've reached a point where the node to insert (1) has a bigger rank than the node to the left, (2) has a rank that's greater than or equal to the node on the right, and (3) if the node to the right has the same rank, our new item to insert is less than the item to the right. The first two rules ensure that we're inserting into the right level of the skiplist, and the last rule ensures that we insert into the right place in that level of the skiplist.
Now, where are our red edges? Intuitively, these are the edges that were "cut" because 18 has been added into the skiplist. Those would be items that previously were between the two nodes on opposite ends of the blue edge, but which node need to get partitioned into the new ranges defined by the split version of that blue edge.
Fortunately, those edges appear in really nice places. Here's where they map to:
(In this picture, I've placed the new node 18 in the middle of the blue edge that we identified in the skiplist. This causes the result not to remain a BST, but we'll fix that in a minute.)
Notice that these are the exact same edges that we'd encounter if we were to finish doing our regular BST insertion - it's the path traced out by looking for 18! And something really nice happens here. Notice that
each time we move to the right, the node, when cut, goes to the right of 18, and
each time we move to the left, the node, when cut, goes to the left of 18.
In other words, once we find the blue edge where we get inserted, we keep walking as though we were doing our insertion as usual, keeping track of the nodes where we went left and the nodes where we went right. We can then chain together all the nodes where we went left and chain together all the nodes where we went right, gluing the results together under our new node. That's shown here:
This operation is called unzipping, and it's where we get the name "zip tree" from. The name kinda make sense - we're taking two interleaved structures (the left and right chains) and splitting them apart into two simpler linear chains.
To summarize:
Inserting x into a zip tree works as follows:
Assign a random rank to x by flipping coins and counting how many flips were needed to get heads.
Do a search for x. Stop the search once you reach a node where
the node's left child has a lower rank than x,
the node's right child has a rank less than or equal to x, and
the node's right child, if it has the same rank as x, has a larger key than x.
Perform a unzip. Specifically:
Continue the search for x as before, recording when we move left and when we move right.
Chain all the nodes together where we went left by making each the left child of the previously-visited left-moving node.
Chain all the nodes together where we went right by making each the right child of the previously-visited right-moving node.
Make those two chains the children of the node x.
You might notice that this "unzipping" procedure is equivalent to what you'd get if you performed a different operation. You could achieve the same result by inserting x as usual, then using tree rotations to pull x higher and higher in the tree until it came to rest in the right place. This is a perfectly valid alternative strategy for doing insertions, though it's a bit slower because two passes over the tree are required (a top-down pass to insert at a leaf, then a bottom-up pass to do the rotations).
Removing Elements: Zipping
Now that we've seen how to insert elements, how do we remove them?
Let's begin with a helpful observation: if we insert an item into a zip tree and then remove it, we should end up with the exact same tree that we started with. To see why this is, we can point back to a skiplist. If you add and then remove something from a skiplist, then you end up with the same skiplist that you would have had before. So that means that the zip tree needs to end up looking identical to how it started after we add and then remove an element.
To see how to do this, we'd need to perform two steps:
Undo the unzip operation, converting the two chains of nodes formed back into a linear chain of nodes.
Undo the break of the blue edge, restoring the insertion point of x.
Let's begin with how to undo an unzip operation. This, fortunately, isn't too bad. We can identify the chains of nodes that we made with the unzip operation when we inserted x into the zip tree fairly easily - we simply look at the left and right children of x, then move, respectively, purely to the left and purely to the right.
Now, we know that these nodes used to be linked together in a chain. What order do we reassemble them into? As an example, take a look a this part of a zip tree, where we want to remove 53. The chains to the left and right of 53 are highlighted:
If we look at the nodes making up the left and right chains, we can see that there's only one way to reassemble them. The topmost node of the reassembled chain must be 67, since it has rank 3 and will outrank all other items. After that, the next node must be 41, because it's the smaller of the rank-2 elements and elements with the same rank have smaller items on top. By repeating this process, we can reconstruct the chain of nodes, as shown here, simply by using the rules for how zip trees have to be structured:
This operation, which interleaves two chains together into one, is called zipping.
To summarize, here's how a deletion works:
Deleting a node x from a zip tree works as follows:
Find the node x in the tree.
Perform a zip of its left and right subtrees. Specifically:
Maintain "lhs" and "rhs" pointers, initially to the left and right subtrees.
While both those pointers aren't null:
If lhs has a higher rank than rhs, make lhs's right child rhs, then advance lhs to what used to be lhs's right child.
Otherwise, make rhs's left child lhs, then advance rhs to point to what used to be rhs's left child.
Rewire x's parent to point to the result of the zip operation rather than x.
More to Explore
To recap our main points: we saw how to represent a skiplist as a BST by using the idea of ranks. That gave rise to the zip tree, which uses ranking rules to determine parent/child relationships. Those rules are maintained using the zip and unzip operations, hence the name.
Doing a full analysis of a zip list is basically done by reasoning by analogy to a skiplist. We can show, for example, that the expected runtime of an insertion or deletion is O(log n) by pointing at the equivalent skiplist and noting that the time complexity of the equivalent operations there are O(log n). And we can similary show that these aren't just expected time bounds, but expected time bounds with a high probability of occurring.
There's a question of how to actually store the information needed to maintain a zip tree. One option would be to simply write the rank of each item down in the nodes themselves. That works, though since ranks are very unlikely to exceed O(log n) due to the nature of geometric random variables, that would waste a lot of space. Another alternative would be to use a hash function on node addresses to generate a random, uniformly-distributed integer in some range, then find the position of the most least-significant 1 bit to simulate our coin tosses. That increases the costs of insertions and deletions due to the overhead of computing the hash codes, but also decreases the space usage.
Zip trees aren't the first data structure to map skiplists and BSTs together. Dean and Jones developed an alternative presentation of this idea in 2007. There's also another way to exploit this connection. Here, we started with a randomized skiplist, and used it to derive a randomized BST. But we can run this in reverse as well - we can start with a deterministic balanced BST and use that to derive a deterministic skiplist. Munro, Papadakis, and Sedgewick found a way to do this by connecting 2-3-4 trees and skiplists.
And zip trees aren't the only randomized balanced BST. The treap was the first structure to do this, and with a little math you can show that treaps tend to have slightly lower expected heights than zip trees. The tradeoff, though, is that you need more random bits per node than in a zip tree.
Hope this helps!

Related

Dynamically building a balanced BST with values "in the leaves"?

In their book Computational Geometry (2008), de Berg, et al., describe the data structure underlying their range search algorithm as a balanced BST where "leaves of T store the points of P and the internal nodes of T store splitting values to guide the search."
The Wikipedia page on range trees (link), which cites de Berg, says: "A 1-dimensional range tree on a set of n points is a binary search tree" such that "each node which is not a leaf stores the largest value of its left subtree."
Examples online construct such trees statically, by first sorting the set of points and then recursively pairing up nodes.
Does there exist an algorithm to build a BST of this nature dynamically (i.e., with the ability to insert additional values into the tree)? Where is it described?
It's possible to adapt just about any tree balancing procedure to work with these two examples, just by treating the leaves separately -- make a balanced tree of the internal nodes, and then take care to keep the leaves in order. Each operation, including balancing, will require you to recalculate the "summary statistics" on at most O(log N) nodes. Those are all the nodes that were updated and their ancestors.
This can be a little complicated, though, and doesn't work for the multi-dimensional range tree, because every level is treated differently from the ones above and below, and that makes tree rotations (which most balancing operations require) invalid.
For these kinds of trees, therefore, where different levels are handled differently, it is usually best to just avoid tree rotations by using a low-order B+tree variant like a 2-3 tree. In a tree like this, nodes can be split and merged, but they never have to change height -- you can implement them so that leaves are always leaves and internal nodes are always internal. The height of the tree is only ever changed by adding or removing the root.
Of course, if you use a tree that can have more than 2 children per node, then your search algorithms will need to change, but the changes are typically trivial.

Rope and self-balancing binary tree hybrid? (i.e Sorted set with fast n-th element lookup)

Is there a data structure for a sorted set allows quick lookup of the n-th (i.e. the least but n-th) item? That is, something like a a hybrid between a rope and a red-black tree.
Seems like it should be possible to either keep track of the size of the left subtree and update it through rotations or do something else clever and I'm hoping someone smart has already worked this out.
Seems like it should be possible to either keep track of the size of the left subtree and update it through rotations […]
Yes, this is quite possible; but instead of keeping track of the size of the left subtree, it's a bit simpler to keep track of the size of the complete subtree rooted at a given node. (You can then get the size of its left subtree by examining its left-child's size.) It's not as tricky as you might think, because you can always re-calculate a node's size as long as its children are up-to-date, so you don't need any extra bookkeeping beyond making sure that you recalculate sizes by working your way up the tree.
Note that, in most mutable red-black tree implementations, 'put' and 'delete' stop walking back up the tree once they've restored the invariants, whereas with this approach you need to walk all the way back up the tree in all cases. That'll be a small performance hit, but at least it's not hard to implement. (In purely functional red-black tree implementations, even that isn't a problem, because those always have to walk the full path back up to create the new parent nodes. So you can just put the size-calculation in the constructor — very simple.)
Edited in response to your comment:
I was vaguely hoping this data structure already had a name so I could just find some implementations out there and that there was something clever one could do to minimize the updating but (while I can find plenty of papers on data structures that are variations of balanced binary trees) I can't figure out a good search term to look for papers that let one lookup the nth least element.
The fancy term for the nth smallest value in a collection is order statistic; so a tree structure that enables fast lookup by order statistic is called an order statistic tree. That second link includes some references that may help you — not sure, I haven't looked at them — but regardless, that should give you some good search terms. :-)
Yes, this is fully possible. Self-balancing tree algorithms do not actually need to be search trees, that is simply the typical presentation. The actual requirement is that nodes be ordered in some fashion (which a rope provides).
What is required is to update the tree weight on insert and erase. Rotations do not require a full update, local is enough. For example, a left rotate requires that the weight of the parent be added to the new parent (since that new parent is the old parent's right child it is not necessary to walk down the new parent's right descent tree since that was already the new parent's left descent tree). Similarly, for a right rotate it is necessary to subtract the weight of the new parent only, since the new parent's right descent tree will become the left descent tree of the old parent.
I suppose it would be possible to create an insert that updates the weight as it does rotations then adds the weight up any remaining ancestors but I didn't bother when I was solving this problem. I simply added the new node's weight all the way up the tree then did rotations as needed. Similarly for erase, I did the fix-up rotations then subtracted the weight of the node being removed before finally unhooking the node from the tree.

How would you keep an ordinary binary tree (not BST) balanced?

I'm aware of ways to keep binary search trees balanced/self-balancing using rotations.
I am not sure if my case needs to be that complicated. I don't need to maintain any sorted order property like with self-balancing BSTs. I just have an ordinary binary tree that I may need to delete nodes or insert nodes. I need try to maintain balance in the tree. For simplicity, my binary tree is similar to a segment tree, and every time a node is deleted, all the nodes along the path from the root to this node will be affected (in my case, it's just some subtraction of the nodal values). Similarly, every time a node is inserted, all the nodes from the root to the inserted node's final location will be affected (an addition to nodal values this time).
What would be the most straightforward way to keep a tree such as this balanced? It doesn't need to be strictly as height balanced as AVL trees, but something like RB trees or maybe slightly less balanced is acceptable as well.
If a new node does not have to be inserted at a particular spot -- possibly determined by its own value and the values in the tree -- but you are completely free to choose its location, then you could maintain the shape of the tree as a complete tree:
In a complete binary tree every level, except possibly the last, is completely filled, and all nodes in the last level are as far left as possible.
An array is a very efficient data structure for a complete tree, as you can store the nodes in their order in a breadth-first traversal. Because the tree is given to be complete, the array has no gaps. This structure is commonly used for heaps:
Heaps are usually implemented with an array, as follows:
Each element in the array represents a node of the heap, and
The parent / child relationship is defined implicitly by the elements' indices in the array.
Example of a complete binary max-heap with node keys being integers from 1 to 100 and how it would be stored in an array.
In the array, the first index contains the root element. The next two indices of the array contain the root's children. The next four indices contain the four children of the root's two child nodes, and so on. Therefore, given a node at index i, its children are at indices 2i + 1 and 2i + 2, and its parent is at index floor((i-1)/2). This simple indexing scheme makes it efficient to move "up" or "down" the tree.
Operations
In your case, you would define the insert/delete operations as follows:
Insert: append the node to the end of the array. Then perform the mutation needed to its ancestors (as you described in your question)
Delete: replace the node to be deleted with the node that currently sits at the very end of the array, and shorten the array by 1. Make the updates needed that follow from the change at these two locations -- so two paths from root-to-node are impacted.
When balancing non-BSTs, the big question to ask is
Can your tree efficiently support rotations?
Some types of binary trees, like k-d trees, have a specific layer-by-layer structure that makes rotations infeasible. Others, like range trees, have auxiliary metadata in each node that's expensive to update after a rotation. But if you can handle rotations, then you can use just about any of the balancing strategies out there. The simplest option might be to model your tree on a treap: put a randomly-chosen weight field into each node, and then, during insertions, rotate your newly-added leaf up until its weight is less than its parent. To delete, repeatedly rotate the node with its lighter child until it's a leaf, then delete it.
If you cannot support rotations, you'll need a rebalancing strategy that does not require them. Perhaps the easiest option there is to model your tree after a scapegoat tree, which works by lazily detecting a node that's too deep for the tree to be balanced, then rebuilding the smallest imbalanced subtree possible into a perfectly-balanced tree to get everything back into order. Deletions are handled by rebuilding the whole tree once the number of nodes drops by some constant factor.

Split 2-3 tree into less-than and greater-than given value X

I need to write function, which receives some key x and split 2-3 tree into 2 2-3 trees. In first tree there are all nodes which are bigger than x, and in second which are less. I need to make it with complexity O(logn). thanks in advance for any idea.
edited
I thought about finding key x in the tree. And after split its two sub-trees(bigger or lesser if they exist) into 2 trees, and after begin to go up and every time to check sub-trees which I've not checked yet and to join to one of the trees. My problem is that all leaves must be at the same level.
If you move from the root to your key and split each node so one points at the nodes larger than the key and the other at the rest and then make the larger node be a part of your larger tree, say by having the leftmost node at one level higher point at it, (don't fix the tree yet, do it at the end) until you reach the key you will get your trees. Then you just need to fix both trees on the path you used (note that the same path exists on both trees).
Assuming you have covered 2-3-4 trees in the lecture already, here is a hint: see whether you can apply the same insertion algorithm for 2-3 trees also. In particular, make insertions always start in the leaf, and then restructure the tree appropriately. When done, determine the complexity of the algorithm you got.

Is there an effient way of determining whether a leaf node is reachable from another arbitrary node in a Directed Acyclic Graph?

Wikipedia: Directed Acyclic Graph
Not sure if leaf node is still proper terminology since it's not really a tree (each node can have multiple children and also multiple parents) and also I'm actually trying to find all the root nodes (which is really just a matter of semantics, if you reverse the direction of all the edges it'd they'd be leaf nodes).
Right now we're just traversing the entire graph (that's reachable from the specified node), but that's turning out to be somewhat expensive, so I'm wondering if there's a better algorithm for doing this. One thing I'm thinking is that we keep track of nodes that have been visited already (while traversing a different path) and don't recheck those.
Are there any other algorithmic optimizations?
We also thought about keeping a list of root nodes that this node is a descendant of, but it seems like maintaining such a list would be fairly expensive as well if we need to check if it changes every time a node is added, moved, or removed.
Edit:
This is more than just finding a single node, but rather finding ALL nodes that are endpoints.
Also there is no master list of nodes. Each node has a list of it's children and it's parents. (Well, that's not completely true, but pulling millions of nodes from the DB ahead of time is prohibitively expensive and would likely cause an OutOfMemory exception)
Edit2:
May or may not change possible solutions, but the graph is bottom-heavy in that there's at most a few dozen root nodes (what I'm trying to find) and some millions (possibly tens or hundreds of millions) leaf nodes (where I'm starting from).
There are a few methods that each may be faster depending on your structure, but in general what youre going to want is a traversal.
A depth first search, goes through each possible route, keeping track of nodes that have already been visited. It's a recursive function, because at each node you have to branch and try each child node of it. There's no faster method if you dont know which way to look for the object you just have to try each way! You definitely need to keep track of where you have already been because it would be wasteful otherwise. It should require on the order of the number of nodes to do a full traversal.
A breadth first search is similar but visits each child of the node before "moving on" and as such builds up layers of distance from the chosen root. This can be faster if the destination is expected to be close to the root node. It would be slower if it is expected to be all the way down a path, because it forces you to traverse every possible edge.
Youre right about maybe keeping a list of known root nodes, the tradeoff there is that you basically have to do the search whenever you alter the graph. If you are altering the graph rarely this is acceptable, but if you alter the graph more frequently than you need to generate this information, then of course it is too costly.
EDIT: Info Update.
It sounds like we are actually looking for a path between two arbitrary nodes, the root/leaf semantic keeps getting switched. The DepthFirstSearch (DFS) starts at one node, and then for each unvisited child, recurse. Break if you find the target node. Due to the way recursion evaluates, this will traverse all the way down the 'left' path, then enumerate nodes at this distance before ever getting to the 'right' path. This is time costly and inefficient if the target node is potentially the first child on the right. BreadthFirst walks in steps, covering all children before moving forward. Because your graph is bottom heavy like a tree, both will be approximately the same execution time.
When the graph is bottom heavy you might be interested in a reverse traversal. Start at the target node and walk upwards, because there are relatively fewer nodes in this direction. So long as the nodes in general have more parents than children, this direction will be much faster. You can also combine the approaches, stepping one up and one down , then comparing lists of nodes, and meeting somewhere in the middle. (this combination might seem the fastest if you ignore that twice as much work is done at each step).
However, since you said that your graph is stored as a list of lists of children, you have no real way of traversing the graph backwards. A node does not know what its parents are. This is a problem. To fix it you have to get a node to know what its parents are by adding that data on graph update, or by creating a duplicate of the whole structure (which you have said is too large). It will need the whole structure to be rewritten, which sounds probably out of the question due to it being a large database at this point.
There's a lot of work to do.
http://en.wikipedia.org/wiki/Graph_(data_structure)
Just color (keep track of) visited nodes.
Sample in Python:
def reachable(nodes, edges, start, end):
color = {}
for n in nodes:
color[n] = False
q = [start]
while q:
n = q.pop()
if color[n]:
continue
color[n] = True
for adj in edges[n]:
q.append(adj)
return color[end]
For a vertex x you want to compute a bit array f(x), each bit corresponds to a root vertex Ri, and 1 (resp 0) means "x can (resp can't) be reached from root vertex Ri.
You could partition the graph into one "upper" set U containing all your target roots R and such that if x in U then all parents of x are in U. For example the set of all vertices at distance <=D from the closest Ri.
Keep U not too big, and precompute f for each vertex x of U.
Then, for a query vertex y: if y is in U, you already have the result. Otherwise recursively perform the query for all parents of y, caching the value f(x) for each visited vertex x (in a map for example), so you won't compute a value twice. The value of f(y) is the bitwise OR of the value of its parents.

Resources