Deleting a node in a BK Tree - algorithm

I have seen many different implementations of BK Trees in many different languages, and literally none of them seem to include a way to remove nodes from the tree.
Even the original article where BK Trees were first introduced does not provide a meaningful insight about node deletion, as the authors merely suggest to mark the node to be deleted so that it is ignored:
The deletion of a key in Structures 1 [the BK Tree] and 2 follows a process similar to that above, with special consideration for the case in which the key to be deleted is the representative x° [root key]. In this case, the key cannot simply be deleted, as it is essential for the structure information. Instead an extra bit must be used for each key which denotes whether the key actually corresponds to a record or not. The search algorithm is modified correspondingly to ignore keys which do not correspond to records. This involves testing the extra bit in the Update procedure.
While it may be theoretically possible to properly delete a node in a BK Tree, is it possible to do so in linear/sublinear time?

While it may be theoretically possible to properly delete a node in a BK Tree, is it possible to do so in linear/sublinear time?
If you want to physically remove it from a BK-Tree, then I can't think of a way to do this in a linear time for all cases. Consider 2 scenarios, when a node is removed. Please note that I do not account for a time complexity related to calculating the Levenshtein distance because that operation doesn't depend on the number of words, although it requires some processing time too.
Remove non-root node
Find a parent of the node in the tree.
Save node's child nodes.
Nullify parent's reference to the node.
Re-add each child node as if it were a new node.
Here, even if step 1 can be done in O(1), steps 2 and 4 are way more expensive. Inserting a single node is O(h), where h is a height of tree. To make matters worse, this has to be done for each child node of the original node, and so it will be O(k*h), where k is a number of child nodes.
Remove root node
Rebuild the tree from scratch without using the previous root node.
Rebuilding a tree will be at least O(n) in the best case and O(h*n) otherwise.
Alternative solution
That's why it's better not to delete a node physically, but keep it in a tree and just mark it as deleted. This way it will be used, as before, for inserting new nodes, but will be excluded from suggestion results for a misspelled word. This can be done in O(1).

Related

B-Tree deletion in a single pass

Is it possible to remove an element from a B-Tree in a single pass?
Wikipedia says "Do a single pass down the tree, but before entering (visiting) a node, restructure the tree so that once the key to be deleted is encountered, it can be deleted without triggering the need for any further restructuring"
but doesn't say anything about how it is done.
Google only gives me the process of removing an element having to reestructure the tree.
Cormen also doesn't say anything about it.
It's possible in a variant of B+ tree called PO-B+ tree. In this "preparatory operations B+ tree" the number of keys in a node may be between n-1 and 2n+1 rather than n and 2n in the usual B+-tree (quoted from the paper). For delete operation (called PO-delete in the paper) you just merge (called "catenate" in the paper) all the nodes (except the root) that could be merged (or take a key from a neighbor), while moving toward the leaf. For PO-insert operation you split all the nodes (including the root). The description is given in the paper.
This preemptive restructuring only makes sense if the tree is used in multi-threaded environment, as it reduces the locking, and increases the concurency. It does not pay if a tree is accessed by only one actor.

A B-Tree - How do I know when the depth of it changes with removal of keys

Let k and 2k be keys in a B-­tree, B. 
Assume that the depth of B is reduced if the key k is deleted. 
It is necessary the case that if we delete the key 2k instead, the depth of B will be reduced as well?
I'm having a hard time to visualize and solve this, can someone please show me how should I think about this and solve this?
Assuming a classical B-tree that constrains the number of keys per node: a precondition for a node merge involving the root - and thus a reduction in height - is that the root have exactly one key and exactly two children, both of which have exactly the minimum allowable number of keys. The state of affairs at deeper levels of the tree does not matter.
If the top of the tree looks as described, then the height reduction can be triggered by deleting the root's only key or one of the keys in its two children. That deletion could be direct, or it could be the result of a change rippling up after a deletion deeper in the tree.
In any case there are many constellations where deletion of key 2k will not trigger a height reduction under the same circumstances. There are many conditions that can prevent a height reduction after the deletion of key 2k: the key residing in a 'safe' node (with more than the minimum number of keys) or having a 'safe' parent, the existence of 'safe' siblings somewhere along the path so that borrowing becomes possible, etc. pp.
Visualisation resources on the web are discussed in another topic here:
Are there any B-tree programs or sites that show visually how a B-tree works

HRW rendezvous hashing in log time?

The Wikipedia page for Rendezvous hashing (Highest Random Weight "HRW") makes the following claim:
While it might first appear that the HRW algorithm runs in O(n) time, this is not the case. The sites can be organized hierarchically, and HRW applied at each level as one descends the hierarchy, leading to O(log n) running time, as in.[7]
I got a copy of the referenced paper, "Hash-Based Virtual Hierarchies for Scalable Location Service in Mobile Ad-hoc Networks." However the hierarchy referenced in their paper seems to be very specific to their application domain. As far as I can discern, there is no clear indication of how to generalize the method. The Wikipedia remark makes it seem like log is the general case.
I looked at a few general HRW implementations, and none of them seemed to support anything better than linear time. I gave it some thought, but I don't see any way to organize sites hierarchically without causing parent nodes to cause inefficient remapping when they drop out, significantly defeating the main advantage of HRW.
Does anybody know how to do this? Alternatively, is Wikipedia incorrect about there being a general way to implement this in log time?
Edit: Investigating mcdowella's approach:
OK, I think I see how this could work. But you need a little more than you've specified.
If you just do what you've described, you get in a situation where each leaf probably just has either zero or one nodes in it, and there's significant variance in how many nodes are in the leaf-most subtrees. If you swap using HRW at each level with just making the whole thing a regular search tree, you get exactly the same effect. Essentially, you've got an implementation of consistent hashing, along with its flaw of having unequal loading between buckets. Computing the combined weights, the defining implementation of HRW, adds nothing; you're better off just doing a search at each level, since it saves doing the hashes, and can be implemented without looping over each radix value
It's fixable though: you just need to be using HRW to choose from many alternatives at the final level. That is, you need all of the leaf nodes to be in large buckets, comparable to the number of replicas you'd have in consistent hashing. These large buckets should be approximately equally-loaded compared to each other, and then you're using HRW to choose the specific site. Since the bucket sizes are fixed, this is an O(n) algorithm, and we get all of the key HRW properties.
Honestly though, I think this is pretty questionable. It isn't so much an implementation of HRW, as it is just combining HRW with consistent hashing. I guess there's nothing wrong with that, and it might even be better than the usual technique of using replicas, in some cases. But I think it's misleading to state that HRW is log(n), if this is actually what the author meant.
Additionally, the original description is also questionable. You don't need to apply HRW at each level, and you shouldn't, as there is no advantage in doing so; you should do something fast (such as indexing), and just use HRW for the final choice.
Is this really the best we can do, or is there some other way to make HRW O(log(n))?
If you give each site a sufficiently long random id expressed in radix k (perhaps by hashing a non-random id) then you can associate the sites with leaves of a tree which has at most k descendants at each node. There is no need to associate any site with an internal node of the tree.
To work out where to store an item, use HRW to work out from the root of the tree down which way to branch at tree nodes, stopping when you reach a leaf, which is associated with a site. You can do this without having to communicate with any site until you work out which site you want to store the item at - all you need to know is the hashed ids of the sites to construct a tree.
Because sites are associated only with leaves there is no way an internal node of the tree can drop out, except if all of the sites associated with leaves under it drop out, at which point it will become irrelevant.
I don't buy the updated answer. There are two nice properties of HRWs that appear to get lost when you compare the weights of branches instead of all sites.
One is that you can pick the top-n sites instead of just the primary, and these should be randomly distributed. If you're descending into a single tree, the top-n sites will be near each other in the tree. This could be fixed by descending multiple times with different salts but that seems like a lot of extra work.
Two is that it is obvious what happens when a site is added or remove and only 1/|sites| of the data moves in the case of an add. If you modify the existing tree, it only affects the peer site. In the case of an add, the only data that moves is from the new peer of the added site. In the case of a delete, all the data that was at that site now moves to the former peer. If you instead recompute the tree, all of the data could move depending on the way the tree is constructed.
I think you can use the same "virtual node" approach normally used for consistent hashing. Suppose you have N physical nodes with IDs:
{n1,...,nN}.
Choose V, the number of virtual nodes per physical node, and generate a new list of IDs:
{n1v1,v1v2,...,n1vV
,n2v1,n2v2,...,n2vV
,...
,nNv1,nNv2,...,nNvV}.
Arrange these into the leaves of a fixed but randomized binary tree with labels on the internal nodes. These internal labels could be, for example, a concatenation of the labels of its child nodes.
To choose a physical node to store an object O at, start at the root and choose the branch with the higher hash H(label,O). Repeat the process until you reach a leaf. Store the object at the physical node corresponding to the virtual node at that leaf. This takes O(log(NV)) = O(log(N)+log(V)) = O(log(N)) steps (since V is constant).
If a physical node fails, the objects at that node are rehashed, skipping over subtrees with no active leaves.
One way to implement HRW rendezvous hashing in log time
One way to implement rendezvous hashing in O(log N), where N is the number of cache nodes:
Each file named F is cached in the cache node named C with the largest weight w(F,C), as is normal in rendezvous hashing.
First, we use a nonstandard hash function w() something like this:
w(F,C) = h(F) xor h(C).
where h() is some good hash function.
tree construction
Given some file named F, rather than calculate w(F,C) for every cache node -- which requires O(N) time for each file --
we pre-calculate a binary tree based only on the hashed names h(C) of the cache nodes;
a tree that lets us find the cache node with the maximum w(F,C) value in O(log N) time for each file.
Each leaf of the tree contains the name C of one cache node.
The root (at depth 0) of the tree points to 2 subtrees.
All the leaves where the most significant bit of h(C) is 0 are in the root's left subtree; all the leaves where the most significant bit of h(C) are 1 are in the root's right subtree.
The two children of the root node (at depth 1) deal with the next-most-significant bit of h(C).
And so on, with the interior nodes at depth D dealing with the D'th-most-significant bit of h(C).
With a good hash function, each step down from the root approximately halves the candidate cache nodes in the chosen subtree,
so we end up with a tree of depth roughly ln_2 N.
(If we end up with a tree with that is "too unbalanced",
somehow get everyone to agree on some different hash function from some universal hashing family rebuild the tree, before we add any files to the cache, until we get a tree that is "not too unbalanced").
Once the tree has been built, we never need to change it no matter how many file names F we later encounter.
We only change it when we add or remove cache nodes from the system.
filename lookup
For a filename F that happens to hash to h(F) = 0 (all zero bits),
we find the cache node with the highest weight (for that filename) by starting at the root and always taking the right subtree when possible.
If that leads us to an interior node that doesn't have a right subtree, then we take its left subtree.
Continue until we reach a node without a left or right subtree -- i.e., a leaf node that contains the name of the selected cache node C.
When looking up some other file named F, first we hash its name to get h(F), then
we start at the root and go right or left respectively (if possible) determined by the next bit in h(F) is 0 or 1.
Since the tree (by construction) is not "too unbalanced",
traversing the whole tree from the root to the leaf that contains the name of the chosen cache node C requires O(ln N) time in the worst case.
We expect that for a typical set of file names,
the hash function h(F) "randomly" chooses left or right at each depth of the tree.
Since the tree (by construction) is not "too unbalanced",
we expect each physical cache node to cache roughly the same number of files (within a multiple of 4 or so).
drop out effects
When some physical cache node fails,
everyone deletes the corresponding leaf node from their copy of this tree.
(Everyone also deletes every interior node that then has no leaf descendants).
This doesn't require moving around any files cached on any other cache node -- they still map to the same cache node they always did.
(The right-most leaf node in a tree is still the right-most leaf node in that tree, no matter how many other nodes in that tree are deleted).
For example,
....
\
|
/ \
| |
/ / \
| X |
/ \ / \
V W Y Z
With this O(log N) algorithm, when cache node X dies, leaf X is deleted from the tree, and all its files become (hopefully relatively evenly) distributed between Y and Z -- none of the files from X end up at V or W or any other cache node.
All the files that previously went to cache nodes V, W, Y, Z continue to go to those same cache nodes.
rebalancing after dropout
Many cache nodes failing or new cache nodes adding or both, may make the tree "too unbalanced".
Picking a new hash function is a big hassle after we've added a bunch of files to the cache, so rather than pick a new hash function like we did when initially constructing the tree, maybe it would be better to somehow rebalance the tree by remove a few nodes, rename them with some new semi-random names, and then add them back to the system.
Repeat until the system is no longer "too unbalanced".
(Start with the most unbalanced nodes -- the nodes cacheing the least amount of data).
comments
p.s.:
I think this may be pretty close to what mcdowella was thinking,
but with more details filled in to clarify that (a) yes, it is log(N) because it's a binary tree that is "not too unbalanced", (b) it doesn't have "replicas", and (c) when one cache node fails, it doesn't require any remapping of files that were not on that cache node.
p.p.s.:
I'm pretty sure that Wikipedia page is wrong to imply that typical implementations of rendezvous hashing occur in O(log N) time, where N is the number of cache nodes.
It seems to me (and I suspect the original designers of the hash as well) that the time it takes to (internally, without communicating) recalculate a hash against every node in the network is going to be insignificant and not worth worrying about compared to the time it takes to fetch data from some remote cache node.
My understanding is that rendezvous hashing is almost always implemented with a simple linear algorithm that uses O(N) time, where N is the number of cache nodes, every time we get a new filename F and want to choose the cache node for that file.
Such a linear algorithm has the advantage that it can use a "better" hash function than the above xor-based w(), so when some physical cache node dies, all the files that were cached on the now-dead node are expected to become evenly distributed among all the remaining nodes.

Deletion from B tree rule

Well, I'm studying for a test and I'm a little bit confused with the following.
The following image is a B-tree with t=3 so each node can have at most 2t-1 keys and at least t-1 keys.
I'm being asked to delete key=3.
I can't understand why I need to join the root with its sons in this case. I know the delete algorithm is defensive as it starts in the root and checks every node so it will not need to go to any ancestor again.
But which rule will be broken if I don't join the root with its son?
Original B-tree
After deleting key 3
As for me I would just delete key 3 and that's it.
It would not broke any of the rules, the algorithm just executes every possible node merge while looking up the given key. This is necessary to ensure that there will be no need to traverse the tree upwards after the deletion.
Also, the height of the tree is reduced, which will speed up later lookups.
So this behaviour is an algorithmic decision to implement the B-tree efficiently.

Fair deletion of nodes in Binary Search Tree

The idea of deleting a node in BST is:
If the node has no child, delete it and update the parent's pointer to this node as null
If the node has one child, replace the node with its children by updating the node's parent's pointer to its child
If the node has two children, find the predecessor of the node and replace it with its predecessor, also update the predecessor's parent's pointer by pointing it to its only child (which only can be a left child)
the last case can also be done with use of a successor instead of predecessor!
It's said that if we use predecessor in some cases and successor in some other cases (giving them equal priority) we can have better empirical performance ,
Now the question is , how is it done ? based on what strategy? and how does it affect the performance ? (I guess by performance they mean time complexity)
What I think is that we have to choose predecessor or successor to have a more balanced tree ! but I don't know how to choose which one to use !
One solution is to randomly choose one of them (fair randomness) but isn't better to have the strategy based on the tree structure ? but the question is WHEN to choose WHICH ?
The thing is that is fundamental problem - to find correct removal algorithm for BST. For 50 years people were trying to solve it (just like in-place merge) and they didn't find anything better then just usual algorithm (with predecessor/successor removing). So, what is wrong with classic algorithm? Actually, this removing unbalances the tree. After several random operations add/remove you'll get unbalanced tree with height sqrt(n). And it is no matter what you choosed - remove successor or predecessor (or random chose beetwen these ways) - the result is the same.
So, what to choose? I'm guessing random based (succ or pred) deletion will postpone unbalancing of your tree. But, if you want to have perfectly balanced tree - you have to use red-black ones or something like that.
As you said, it's a question of balance, so in general the method that disturbs the balance the least is preferable. You can hold some metrics to measure the level of balance (e.g., difference from maximal and minimal leaf height, average height etc.), but I'm not sure whether the overhead worth it. Also, there are self-balancing data structures (red-black, AVL trees etc.) that mitigate this problem by rebalancing after each deletion. If you want to use the basic BST, I suppose the best strategy without apriori knowledge of tree structure and the deletion sequence would be to toggle between the 2 methods for each deletion.

Resources