Disadvantages of top-down node splitting on insertion into B+ tree - b-tree

For a B+ tree insertion why would you traverse down the tree then back upwards splitting the parents?
Wikipedia suggests this method of insertion:
Perform a search to determine what bucket the new record should go
into.
If the bucket is not full (at most b - 1 entries after the insertion), add the record.
Otherwise, split the bucket.
Allocate new leaf and move half the bucket's elements to the new bucket.
Insert the new leaf's
smallest key and address into the parent.
If the parent is full, split it too.
Add the middle key to the parent node.
Repeat until a parent is found that need not split.
If the root splits, create a new root which has one key and two
pointers.
Why would you traverse down then tree and then go back up performing the splits? Why not split the nodes as you encounter them on the way down?
To me, the proposed method performs twice the work and requires more bookkeeping as well.
Can anyone explain why this is the preferred method for insertion as opposed to splitting on the way down and what the disadvantages are for inserting during the traversal?

You have to backtrack up the tree because you don't actually know whether a split is required at the lowest level until you get there.
It's all there in the phrase "If the bucket is not full, ...".
You should also be aware that it's nowhere near twice the work. Since you're remembering all sorts of stuff on the way down (node pointers, indexes within the node, and so on), there's not as much calculation or searching on the way back up.

Related

B-tree insertion: during the descend in the tree, Why we split every node with 2t-1 elements?

In B-tree insertion algorithm, I see that in order to solve the case in which we need to insert an element to a leaf with 2t-1 elements, we need to do split algorithm to the tree. Something I don't understand is why in the insertion algorithm during the descend in the tree (to the willing point) we split every node with 2t-1 elements, even though I seems useless. for example
example
I understand that there is a case in which couple of nodes above the leaf got 2t-1 elements, and in case we want move the median to them we face problem, but why not to give pinpoint solution for that, instead of doing split every time.
correct me if I say something wrong.
We split the full nodes on the way down to the target position because we don't know if we will need to "go back up." You can do it the way you are thinking, where we go down to the target node, split it, and then insert the median of the split into the parent, recursively splitting nodes as needed. But this requires us to go from the root, down to the target, and back up, potentially all the way to the root again. This might be undesirable, e.g. if accessing the nodes twice would be too expensive. In that case, it may be better to go in one pass straight down, where you split any full nodes to anticipate the need for more space.
For a demonstration, you can try inserting 10 into the trees in the middle and on the bottom of your drawing. The tree on the bottom, unsplit, needs to be split all the way to the root in the same way as the middle tree, because the two-pass algorithm didn't leave any space. In the middle tree, inserting 10 still causes a split, but it doesn't extend all the way up because the top two layers of the tree are very spacious.
There is an important caveat, though. Let t be the minimum number of children per node. For the two pass algorithm, the maximum number of children a node can have needs to be at least u = 2t - 1. If it is less, like 2t - 2, then splitting a full node (2t - 3 elements), even with the additional element to insert, will not be able to make two non-deficient nodes. The one pass algorithm requires a higher maximum, u = 2t. This is because the two-pass algorithm always has an element on hand to cancel exactly one deficiency. The one-pass algorithm does not have this ability, as it sometimes splits nodes unnecessarily, so it can't stick the element it's holding into one of the deficiencies. It might not belong there.
I've implemented B-trees several times, and have never split nodes on the way down.
Usually I do insert recursively, such that node->insert(key,data) can return a new key to insert in the parent. The parent calls insert on the child node, and if the child splits it returns a new key to the parent. If the parent splits then it returns the a key to it's parent, etc.
I've found that the insert implementation can stay pretty clean this way.

Implementing the Rope data structure using binary search trees (splay trees)

In a standard implementation of the Rope data structure using splay trees, the nodes would be ordered according to a rank statistic measuring the position of each one from the start of the string, so the keys normally found in binary search tree would be irrelevant, would they not?
I ask because the keys shown in the graphic below (thanks Wikipedia!) are letters, which would presumably become non-unique once the number of nodes exceeded the length of the chosen alphabet. Wouldn't it be better to use integers or avoid using keys altogether?
Separately, can anyone point me to a good implementation of the logic to recompute rank statistics after each operation?
Presumably, if the index for a split falls within the substring attached to a particular node, say, between "Hel" and "llo_" on the node E above, you would remove the substring from E, split it and reattach it as two children of E. Correct?
Finally, after a certain number of such operations, the tree could, I suppose, end up with as many leaves as letters. What would be the best way to keep track of that and prune the tree (by combining substrings) as necessary?
Thanks!
For what it's worth, you can implement a Rope using Splay Trees by attaching a substring to each node of the binary search tree (not just to the leaf nodes as shown above).
The rank of each node is its size plus the size of its left subtree. But when recomputing ranks during splay operations, you need to remember to walk down the node.left.right branch, too.
If each node records a reference to the substring it represents (cf. the actual substring itself), everything runs faster. That way when a split operation falls within an existing node, you just need to modify the node's attributes to reflect the right part of the substring you want to split, then add another node to represent the left part and merge it with the left subtree.
Done as above, each node records (in addition its left, right and parent attributes etc.) its rank, size (in characters) and the location of the first character it represents in the string you're trying to modify. That way, you never actually modify the initial string: you just do your operations on bits of the tree and reproduce the final string when you're ready by walking it in order.

Most performant way to find all the leaf nodes in a tree data structure

I have a tree data structure where each node can have any number of children, and the tree can be of any height. What is the optimal way to get all the leaf nodes in the tree? Is it possible to do better than just traversing every path in the tree until I hit the leaf nodes?
In practice the tree will usually have a max depth of 5 or so, and each node in the tree will have around 10 children.
I'm open to other types of data structures or special trees that would make getting the leaf nodes especially optimal.
I'm using javascript but really just looking for general recommendations, any language etc.
Thanks!
Memory layout is essential to optimal retrieval, so the child lists should be contiguous and not linked list, the nodes should be place after each other in retrieval order.
The more static your tree is, the better layout can be done.
All in one layout
All in one array totally ordered
Pro
memory can be streamed for maximal throughput (hardware pre-fetch)
no unneeded page lookups
normal lookups can be made
no extra memory to make linked lists.
internal nodes use offset to find the child relative to itself
Con
inserting / deleting can be cumbersome
insert / delete O(N)
insert might lead to resize of the array leading to a costly copy
Two array layout
One array for internal nodes
One array for leafs
Internal nodes points to the leafs
Pro
leaf nodes can be streamed at maximum throughput (maybe the best layout if your mostly only interested in the leafs).
no unneeded page lookups
indirect lookups can be made
Con
if all leafs are ordered insert / delete can be cumbersome
if leafs are unordered insertion is ease, just add at the end.
deleting unordered leafs is also a problem if no tombstones are allowed as the last leaf would have to be moved back and the internal nodes would need fix up. (via a further indirection this can also be fixed see slot-map)
resizing of the either might lead to a large copy, though less than the All-in-one as they could be done independently.
Array of arrays (dynamic sized, C++ vector of vectors)
using contiguous arrays for referencing the children of each node
Pro
running through each child list is fast
each child array may be resized independently
Con
while removing much of the extra work of linked list children the individual lists are dispersed among all other data making lookup taking extra time.
insert might cause resize and copy of an array.
Finding the leaves of a tree is O(n), which is optimal for a tree, because you have to look at O(n) places to retrieve all n things, plus the branch nodes along the way. The constant overhead is the branch nodes.
If we increase the branching factor, e.g. letting each branch have 32 children instead of 2, we significantly decrease the number of overhead nodes, which might make the traversal faster.
If we skip a branch, we're not including the values in that branch, so we have to look at all branches.

Insertion In 2-3-4 tree

Consider the following 2-3-4 tree (i.e., B-tree with a minimum degree of two) in
which each data item is a letter. The usual alphabetical ordering of letters is used
in constructing the tree.
What is the result of inserting G in the above tree?
I am getting the answer as
But the answer in solution key is
Can anyone explain how to get the answer provided by the solution key?
As long the invariants are not violated, the operation is technically valid. The insertion algorithm in CLRS splits on the way down, so it would split the root like you did.
However, another implementation might observe that the second child is empty and the first is full. That means the "rotation" can be done and the root node count is unaffected. The rotation involves pushing L down into the second child (prepending) and pulling up I up into L's previous place in the root. Now the first child has only two entries and you can insert into it.
Animated insertion using the CLRS method you used

BTree- predetermined size?

I read this on wikipedia:
In B-trees, internal (non-leaf) nodes can have a variable number of
child nodes within some pre-defined range. When data is inserted or
removed from a node, its number of child nodes changes. In order to
maintain the pre-defined range, internal nodes may be joined or split.
Because a range of child nodes is permitted, B-trees do not need
re-balancing as frequently as other self-balancing search trees, but
may waste some space, since nodes are not entirely full.
We have to specify this range for B trees. Even when I looked up CLRS (Intro to Algorithms), it seemed to make to use of arrays for keys and children. My question is- is there any way to reduce this wastage in space by defining the keys and children as lists instead of predetermined arrays? Is this too much of a hassle?
Also, for the life of me I'm not able to get a decent psedocode on btreeDeleteNode. Any help here is appreciated too.
When you say "lists", do you mean linked lists?
An array of some kind of element takes up one element's worth of memory per slot, whether that slot is filled or not. A linked list only takes up memory for elements it actually contains, but for each one, it takes up one element's worth of memory, plus the size of one pointer (two if it's a doubly-linked list, unless you can use the xor trick to overlap them).
If you are storing pointers, and using a singly-linked list, then each list link is twice the size of each array slot. That means that unless the list is less than half full, a linked list will use more memory, not less.
If you're using a language whose runtime has per-object overhead (like Java, and like C unless you are handling memory allocation yourself), then you will also have to pay for that overhead on each list link, but only once on an array, and the ratio is even worse.
I would suggest that your balancing algorithm should keep tree nodes at least half full. If you split a node when it is full, you will create two half-full nodes. You then need to merge adjacent nodes when they are less than half full. You can then use an array, safe in the knowledge that it is more efficient than a linked list.
No idea about the details of deletion, sorry!
B-Tree node has an important characteristic, all keys in the node is sorted. When finding a specific key, binary search is used to find the right position. Using binary search keeps the complexity of search algorithm in B-Tree O(logn).
If you replace the preallocated array with some kind of linked list, you lost the ordering. Unless you use some complex data structures, like skip list, to keep the search algorithm with O(logn). But it's totally unnecessary, skip list itself is better.

Resources