As a student, I've been trying to implement B+ tree in C myself. The insert is ok but the deletion holds me off. One of my questions is this:
Is it ok to remain a key in the internal node while its key in the leaf node has been deleted?
This may happen when the internal node is not the leaf's parent.
Is my description clear enough? Does any one have similar experiences?
The question you should ask yourself when working with a data structure is, "What are the invariants?" For a B+ tree, some of the invariants are:
Records are stored in the leaf nodes,
Leaf nodes must be at least half full.
So if you decide that B+ trees allow you to keep keys that no longer correspond to records, that's fine. Just make sure that your insertion and search algorithms still work given your particular set of invariants.
In general, it is somewhat bizarre to encounter a key in any kind of tree that does not correspond to one of the records. I'd also expect the cost for correcting it in a B+ tree with large fanout to be fairly small.
Related
Given a B+-tree with a branching factor of g and s being the number of keys contained, what would be the maximum number of duplicates allowed for a single key lets call this si? And how do we calculate that number?
My first idea was to say that each level can have have one instance of si, so my idea would be to maximise the depth, which would be our answer, however I'm not sure about this.
I have searched online but it seems no one has asked this question before, first time asking a question here so any feedback is welcome.
Many thanks.
A proper B-tree cannot contain duplicate 'keys' because the values in question would not uniquely identify their associated data and hence they would not be keys.
The reason is that B-trees do not distinguish between nodes for routing searches (internal nodes) and nodes for storing data (leaf nodes) as some other data structures do. All nodes in a classic B-tree are data-bearing, and the internal nodes just happen to also route search traffic.
B+ trees, on the other hand, store all data in leaf nodes and use the nodes above the leaf level (a.k.a. 'index layer') only for routing. The values in internal nodes - a.k.a. 'separator keys' - only have to guide searches but they do not have to uniquely identify any data records. They do not even have to correspond to any actually existing key values (i.e. ones that have associated data). That makes it often possible to shorten the separator keys drastically, as long as they keep separating the same things. For example, "F" is just as effective at separting "Arthur Dent" from "Zaphod Beeblebrox" as "Ford Prefect" is.
One consequence is that one and the same value could potentially occur at each and every level of the B+ tree without any ill effect, since only the one and only occurrence at the leaf level actually works as a data key; the ones in internal nodes serve only to guide searches.
Another consequence is that internal nodes in B+ trees can usually hold orders of magnitude more keys than internal nodes in B-trees (which are diluted by record data and do not allow shortening of key values). This increases I/O and cache efficiency, and it usually makes B+ trees more shallow than B-trees containing the same data.
As for calculating the height: outside of homework assignments you actually need two node capacity values to describe a B+ tree, one for the maximum number of separator keys in an internal node and one for the maximum number of keys (data records) in a leaf node. Basically, you divide the number of records by the leaf node capacity to determine the number of 'outputs' that the index layer needs to have, and then feed that into the usual B-tree height formula in order to determine the height of the index layer.
I understand the idea behind using Merkle tree to identify inconsistencies in data, as suggested by articles like
Key Concepts: Using Merkle trees to detect inconsistencies in data
Merkle Tree | Brilliant Math & Science Wiki
Essentially, we use a recursive algorithm to traverse down from root we want to verify, and follow the nodes where stored hash values are different from server (with trusted hash values), all the way to the inconsistent leaf/datablock.
If there's only one such block (leaf) that's corrupted, this means we following a single path down to leaf, which is log(n) queries.
However, in the case of multiple inconsistent data blocks/leaves, we need up to O(n) queries. In the extreme case, all data blocks are corrupted, and our algorithm will need to send every single node to server (authenticator). In the real world this becomes costly due to the network.
So my question is, is there any known improvement to the basic traverse-from-root algorithm? A possible improvement I could think of is to query the level of nodes in the middle. For example, in the tree below, we send the server the two nodes in the second level ('64' and '192'), and for any node that returns inconsistency, we recursively go to the middle level of that sub-tree - something like a binary search based on height.
This increases our best case time from O(1) to O(sqrt(n)), and probably reduces our worst case time to some extent (I have not calculated how much).
I wonder if there's any better approach than this? I've tried to search for relevant articles on Google Scholar, but looks like most of the algorithm-focused papers are concerned with the merkle-tree traversal problem, which is different from the problem above.
Thanks in advance!
I have seen many different implementations of BK Trees in many different languages, and literally none of them seem to include a way to remove nodes from the tree.
Even the original article where BK Trees were first introduced does not provide a meaningful insight about node deletion, as the authors merely suggest to mark the node to be deleted so that it is ignored:
The deletion of a key in Structures 1 [the BK Tree] and 2 follows a process similar to that above, with special consideration for the case in which the key to be deleted is the representative x° [root key]. In this case, the key cannot simply be deleted, as it is essential for the structure information. Instead an extra bit must be used for each key which denotes whether the key actually corresponds to a record or not. The search algorithm is modified correspondingly to ignore keys which do not correspond to records. This involves testing the extra bit in the Update procedure.
While it may be theoretically possible to properly delete a node in a BK Tree, is it possible to do so in linear/sublinear time?
While it may be theoretically possible to properly delete a node in a BK Tree, is it possible to do so in linear/sublinear time?
If you want to physically remove it from a BK-Tree, then I can't think of a way to do this in a linear time for all cases. Consider 2 scenarios, when a node is removed. Please note that I do not account for a time complexity related to calculating the Levenshtein distance because that operation doesn't depend on the number of words, although it requires some processing time too.
Remove non-root node
Find a parent of the node in the tree.
Save node's child nodes.
Nullify parent's reference to the node.
Re-add each child node as if it were a new node.
Here, even if step 1 can be done in O(1), steps 2 and 4 are way more expensive. Inserting a single node is O(h), where h is a height of tree. To make matters worse, this has to be done for each child node of the original node, and so it will be O(k*h), where k is a number of child nodes.
Remove root node
Rebuild the tree from scratch without using the previous root node.
Rebuilding a tree will be at least O(n) in the best case and O(h*n) otherwise.
Alternative solution
That's why it's better not to delete a node physically, but keep it in a tree and just mark it as deleted. This way it will be used, as before, for inserting new nodes, but will be excluded from suggestion results for a misspelled word. This can be done in O(1).
The idea of deleting a node in BST is:
If the node has no child, delete it and update the parent's pointer to this node as null
If the node has one child, replace the node with its children by updating the node's parent's pointer to its child
If the node has two children, find the predecessor of the node and replace it with its predecessor, also update the predecessor's parent's pointer by pointing it to its only child (which only can be a left child)
the last case can also be done with use of a successor instead of predecessor!
It's said that if we use predecessor in some cases and successor in some other cases (giving them equal priority) we can have better empirical performance ,
Now the question is , how is it done ? based on what strategy? and how does it affect the performance ? (I guess by performance they mean time complexity)
What I think is that we have to choose predecessor or successor to have a more balanced tree ! but I don't know how to choose which one to use !
One solution is to randomly choose one of them (fair randomness) but isn't better to have the strategy based on the tree structure ? but the question is WHEN to choose WHICH ?
The thing is that is fundamental problem - to find correct removal algorithm for BST. For 50 years people were trying to solve it (just like in-place merge) and they didn't find anything better then just usual algorithm (with predecessor/successor removing). So, what is wrong with classic algorithm? Actually, this removing unbalances the tree. After several random operations add/remove you'll get unbalanced tree with height sqrt(n). And it is no matter what you choosed - remove successor or predecessor (or random chose beetwen these ways) - the result is the same.
So, what to choose? I'm guessing random based (succ or pred) deletion will postpone unbalancing of your tree. But, if you want to have perfectly balanced tree - you have to use red-black ones or something like that.
As you said, it's a question of balance, so in general the method that disturbs the balance the least is preferable. You can hold some metrics to measure the level of balance (e.g., difference from maximal and minimal leaf height, average height etc.), but I'm not sure whether the overhead worth it. Also, there are self-balancing data structures (red-black, AVL trees etc.) that mitigate this problem by rebalancing after each deletion. If you want to use the basic BST, I suppose the best strategy without apriori knowledge of tree structure and the deletion sequence would be to toggle between the 2 methods for each deletion.
Edited after Alex Taggart's remark below.
I am using a zipper to easily traverse and edit a tree which can grow to many thousands of nodes. Each node is incomplete when it is first created. Data is going to be added/removed all the time in random positions, leaf nodes are going to be replaced by branches, etc.
The tree can be very unbalanced.
Fast random access to a node is also important.
An implementation would be to traverse the tree using a zipper and create a hash table of the nodes indexed by key. Needless to say the above would be very inefficient as:
2 copies of each node need to be created
any changes need to be consistently mirrored between the 2 data structures (tree and hashmap).
In short, is there a time/space efficient way to combine the easiness of traversing/updating with a zipper and the fast access of a hash table in clojure?
Clojure's data structures are persistent and use structural sharing. This means that operations like adding, removing or accumulating are not as inefficient as you describe. The memory cost will be minimal since you are not duplicating what's already there.
By default Clojure's data structures are immutable. The nodes in your tree like structure will thus not update themselves unless you use some sort of reference type (like a Var). I don't know enough about your specific use case to advice on the best way to access nodes. One way to access nodes in a nested structure is the get-in function where you supply the path to the node to return its value.
Hope this helps solving your problem.