a-b trees reason for requirement of a <= (b+1)/2 - data-structures

The definition of a-b trees says
2 ≤ a ≤ (b+1)/2
Each internal node except the root has at least a children and at most b children.
The root has at most b children.
Now can anyone explain the reason for the requirement a ≤ (b+1)/2 ?.

Well a≤(b+1)/2 is equivalent to 2a≤b+1 or 2a-1≤b.
The interval from a to 2a-1 is the smallest interval for which the rebalancing algorithm used for (a,b) trees actually work. Picking actual value for b greater than the minimum can affect the space utilization in the nodes and the complexity of the actual rebalancing part in the delete/insert operations but the algorithm will still work.
Lets say for example a=k and k>=2. If we pick value for b less than 2a-1say 2a-2=2k-2.
The Insert rebalancing part says :
If the node you are inserting into already has b keys and they become b+1.
If the overflowing node is the tree root :
create two new nodes, copy into each half the root entries
put into the root pointers to these two new nodes with the key that separates them.
Else
create a new node
move half of the entries of the overflowing node into the new node
insert a pointer to the new node in the upper neighbor
Now if b=2a-2 overflow occurs when you need to add one entry in node with 2a-2 entries leaving you with 2a-1 entries total. If you split the node one of the nodes will be left with less than a entries creating an underflow in the node which is a problem. If b was the minimum possible value for b which is 2a-1 then overflowing node will have "total" 2a entries which you can split into two nodes which have exactly the minimum a.
Edit:
For more concrete example lets say you could make a (a,b) tree with a=3 and b=4 which is less than 2a-1. Then If you want to store exactly 5 values in your tree how are you going to do it? You cant put all the values in one node because they are more than 4. But you cant split into two nodes either because only one of the two nodes will have 3 entries and the other will be underflowing(2 entries only which is prohibited).
Another simple example is if a=2, then the minimum value of b is 3. If b was 2 this means that the node must have always 2 entries which is not achievable.

Related

when exactly should root split in a B Tree

I learned B trees recently and from what I understand a node can have minimum t-1 keys and maximum 2t-1 keys given minimum degree t. Exception being root can have even 1 key.
Here is the example from CLRS 3rd edition Fig 18.7 (Page 498) where t=3
min keys = 3-1 = 2
max keys = 2*3-1 = 5
In the d) example when L is inserted why is the root splitted when it doesn't violate the B tree properties at the moment (It has 5 keys which is maximum allowed).
Why isn't inserting L into [J K L] without splitting [G M P T X] considered.
Should I always split the root when it reaches the maximum?
There are several variants of the insertion algorithm for B-trees. In this case the insertion algorithm is the "single pass down the tree" variant.
The background for this variant is given on page 493:
Since we cannot insert a key into a leaf node that is full, we introduce an operation that splits a full node 𝑦 (having 2𝑡 − 1 keys) around its median key 𝑦:key𝑡 into two nodes having only 𝑡 − 1 keys each. The median key moves up into 𝑦’s parent to identify the dividing point between the two new trees. But if 𝑦’s parent is also full, we must split it before we can insert the new key, and thus we could end up splitting full nodes all the way up the tree.
As with a binary search tree, we can insert a key into a B-tree in a single pass down the tree from the root to a leaf. To do so, we do not wait to find out whether we will actually need to split a full node in order to do the insertion. Instead, as we travel down the tree searching for the position where the new key belongs, we split each full node we come to along the way (including the leaf itself). Thus whenever we want to split a full node 𝑦, we are assured that its parent is not full.
In other words, this insertion algorithm will split a node earlier than might be strictly needed, in order to avoid to have to split nodes while backtracking out of recursion.
This algorithm is further described on page 495 with pseudo code.
This explains why at the insertion of L the root node is split immediately before any recursive call is made.
Alternative algorithms would not do this, and would delay the split up to the point when it is inevitable.

RB tree with sum

I have some questions about augmenting data structures:
Let S = {k1, . . . , kn} be a set of numbers. Design an efficient
data structure for S that supports the following two operations:
Insert(S, k) which inserts the
number k into S (you can assume that k is not contained in S yet), and TotalGreater(S, a)
which returns the sum of all keys ki ∈ S which are larger than a, that is, P ki∈S, ki>a ki .
Argue the running time of both operations and give pseudo-code for TotalGreater(S, a) (do not given pseudo-code for Insert(S, k)).
I don't understand how to do this, I was thinking of adding an extra field to the RB-tree called sum, but then it doesn't work because sometimes I need only the sum of the left nodes and sometimes I need the sum of the right nodes too.
So I was thinking of adding 2 fields called leftSum and rightSum and if the current node is > GivenValue then add the cached value of the sum of the sub nodes to the current sum value.
Can someone please help me with this?
You can just add a variable size to each node, which is the number of nodes in the subtree rooted at that node. When finding the node with the smallest value that is larger than the value a, two things can happen on the path to that node: you can go left or right. Every time you go left, you add the size of the right child + 1 to the running total. Every time you go right, you do nothing.
There are two conditions for termination. 1) we find a node containing the exact value a, in which case we add the size of its right child to the total. 2) we reach a leaf, in which case we add 1 if it is larger than a, or nothing if it is smaller.
As Jordi describes: The key-word could be augmented red-black tree.

Why in B-tree and B+_tree store from half-full to complete-full in each non-leaf node

I've just learn B-tree and B+-tree in DBMS.
I don't understand why a non-leaf node in tree has between [n/2] and n children, when n is fix for particular tree.
Why is that? and advantage of that?
Thanks !
This is the feature that makes the B+ and B-tree balanced, and due to it, we can easily compute the complexity of ops on the tree and bound it to O(logn) [where n is the number of elements in the data set].
If a node could have more then B sons, we could create a tree with depth 2: a root, and all other nodes will be leaves, from the root. searching for an element will be then O(n), and not the desired O(logn).
If a node could have less then B/2 sons, we could create a tree which is actually a linked list [n nodes, each with 1 son], with height n - and a search op will again be O(n) instead of O(logn)
Small currection: every non-leaf node - except the root, has B/2 to B children. the root alone is allowed to have less then B/2 sons.
The basic assumption of this structure is to have a fixed block size, this is why each internal block has n slots for indexing its children.
When there is a need to add a child to a block that is full (has exactly n children), the block is split into two blocks, which then replace the original block in its parent's index. The number of children in each of the two blocks is obviously n div 2 (assuming n is even). This is what the lower limit comes from.
If the parent is full, the operation repeats, potentially up to the root itself.
The split operation and allowing for n/2-filled blocks allows for most of the insertions/deletions to only cause local changes instead of re-balancing huge parts of the tree.

Indexing count of buckets

So, here is my little problem.
Let's say I have a list of buckets a0 ... an which respectively contain L <= c0 ... cn < H items. I can decide of the L and H limits. I could even update them dynamically, though I don't think it would help much.
The order of the buckets matter. I can't go and swap them around.
Now, I'd like to index these buckets so that:
I know the total count of items
I can look-up the ith element
I can add/remove items from any bucket and update the index efficiently
Seems easy right ? Seeing these criteria I immediately thought about a Fenwick Tree. That's what they are meant for really.
However, when you think about the use cases, a few other use cases creep in:
if a bucket count drops below L, the bucket must disappear (don't worry about the items yet)
if a bucket count reaches H, then a new bucket must be created because this one is full
I haven't figured out how to edit a Fenwick Tree efficiently: remove / add a node without rebuilding the whole tree...
Of course we could setup L = 0, so that removing would become unecessary, however adding items cannot really be avoided.
So here is the question:
Do you know either a better structure for this index or how to update a Fenwick Tree ?
The primary concern is efficiency, and because I do plan to implement it cache/memory considerations are worth worrying about.
Background:
I am trying to come up with a structure somewhat similar to B-Trees and Ranked Skip Lists but with a localized index. The problem of those two structures is that the index is kept along the data, which is inefficient in term of cache (ie you need to fetch multiple pages from memory). Database implementations suggest that keeping the index isolated from the actual data is more cache-friendly, and thus more efficient.
I have understood your problem as:
Each bucket has an internal order and buckets themselves have an order, so all the elements have some ordering and you need the ith element in that ordering.
To solve that:
What you can do is maintain a 'cumulative value' tree where the leaf nodes (x1, x2, ..., xn) are the bucket sizes. The value of a node is the sum of values of its immediate children. Keeping n a power of 2 will make it simple (you can always pad it with zero size buckets in the end) and the tree will be a complete tree.
Corresponding to each bucket you will maintain a pointer to the corresponding leaf node.
Eg, say the bucket sizes are 2,1,4,8.
The tree will look like
15
/ \
3 12
/ \ / \
2 1 4 8
If you want the total count, read the value of the root node.
If you want to modify some xk (i.e. change correspond bucket size), you can walk up the tree following parent pointers, updating the values.
For instance if you add 4 items to the second bucket it will be (the nodes marked with * are the ones that changed)
19*
/ \
7* 12
/ \ / \
2 5* 4 8
If you want to find the ith element, you walk down the above tree, effectively doing the binary search. You already have a left child and right child count. If i > left child node value of current node, you subtract the left child node value and recurse in the right tree. If i <= left child node value, you go left and recurse again.
Say you wanted to find the 9th element in the above tree:
Since left child of root is 7 < 9.
You subtract 7 from 9 (to get 2) and go right.
Since 2 < 4 (the left child of 12), you go left.
You are at the leaf node corresponding to the third bucket. You now need to pick the second element in that bucket.
If you have to add a new bucket, you double the size of your tree (if needed) by adding a new root, making the existing tree the left child and add a new tree with all zero buckets except the one you added (which we be the leftmost leaf of the new tree). This will be amortized O(1) time for adding a new value to the tree. Caveat is you can only add a bucket at the end, and not anywhere in the middle.
Getting the total count is O(1).
Updating single bucket/lookup of item are O(logn).
Adding new bucket is amortized O(1).
Space usage is O(n).
Instead of a binary tree, you can probably do the same with a B-Tree.
I still hope for answers, however here is what I could come up so far, following #Moron suggestion.
Apparently my little Fenwick Tree idea cannot be easily adapted. It's easy to append new buckets at the end of the fenwick tree, but not in it the middle, so it's kind of a lost cause.
We're left with 2 data structures: Binary Indexed Trees (ironically the very name Fenwick used to describe his structure) and Ranked Skip List.
Typically, this does not separate the data from the index, however we can get this behavior by:
Use indirection: the element held by the node is a pointer to a bucket, not the bucket itself
Use pool allocation so that the index elements, even though allocated independently from one another, are still close in memory which shall helps the cache
I tend to prefer Skip Lists to Binary Trees because they are self-organizing, so I'm spared the trouble of constantly re-balancing my tree.
These structures would allow to get to the ith element in O(log N), I don't know if it's possible to get faster asymptotic performance.
Another interesting implementation detail is I have a pointer to this element, but others might have been inserted/removed, how do I know the rank of my element now?
It's possible if the bucket points back to the node that owns it. But this means that either the node should not move or it should update the bucket's pointer when moved around.

How many elements can be held in a B-tree of order n?

Is it 2n? Just checking.
Terminology
The Order of a B-Tree is inconstantly defined in the literature.
(see for example the terminology section of Wikipedia's article on B-Trees)
Some authors consider it to be the minimum number of keys a non-leaf node may hold, while others consider it to be the maximum number of children nodes a non-leaf node may hold (which is one more than the maximum number of keys such a node could hold).
Yet many others skirt around the ambiguity by assuming a fixed length key (and fixed sized nodes), which makes the minimum and maximum the same, hence the two definitions of the order produce values that differ by 1 (as said the number of keys is always one less than the number of children.)
I define depth as the number of nodes found in the search path to a leaf record, and inclusive of the root node and the leaf node. In that sense, a very shallow tree with only a root node pointing directly to leaf nodes has depth 2. If that tree were to grow and require an intermediate level of non-leaf nodes, its depth would be 3 etc.
How many elements can be held in a B-Tree of order n?
Assuming fixed length keys, and assuming that "order" n is defined as the maximum number of child nodes, the answer is:
(Average Number of elements that fit in one Leaf-node) * n ^ (depth - 1)
How do I figure?...:
The data (the "elements") is only held in leaf nodes. So the number of element held is the average number of elements that fit in one node, times the number of leaf nodes.
The number of leaf nodes is itself driven by the number of children that fit in a non-leaf node (the order). For example the non-leaf node just above a leaf node, points to n (the order) leaf-nodes. Then, the non-leaf node above this non-leaf node points to n similar nodes etc, hence "to the power of (depth -1)".
Note that the formula above generally holds using the averages (of key held in a non-leaf node, and of elements held in a leaf node) rather than assuming fixed key length and fixed record length: trees will typically have a node size that is commensurate with the key and record sizes, hence holding a number key or records that is big enough that the effective number of keys or record held in any leaf will vary relatively little compared with the average.
Example:
A tree of depth 4 (a root node, two level of non-leaf nodes and one level [obviously] of leaf nodes) and of order 12 (non-leaf nodes can hold up to 11 keys, hence point to 12 nodes below them) and such that leaf nodes can contain 5 element each, would:
- have its root node point to 12 nodes below it
- each node below it points to 12 nodes below them (hence there will be 12 * 12 nodes in the layer "3" (assuming the root is layer 1 etc., this numbering btw is also ambiguously defined...)
- each node in "layer 3" will point to 12 leaf-nodes (hence there will be 12 * 12 * 12 leaf nodes.
- each leaf node has 5 elements (in this example case)
Hence.. such a tree will hold...
Nb Of Elements in said tree = 5 * 12 * 12 * 12
= 5 * (12 ^ 3)
= 5 * (12 ^ depth -1)
= 8640
Recognize the formula on the 3rd line.
What is generally remarkable for B-Tree, and which makes for their popularity is that a relatively shallow tree (one with a limited number of "hops" between the root and the sought record), can hold a relatively high number record. This number is multiplied by the order at each level.
My book says that the order of a B-tree is the maximum number of pointers that can be stored in a node. (p. 348) The number of "keys" is one less than the order. So a B-tree of order n can hold n-1 elements.
The book is "File Structures", second edition, by Michael J. Folk.
If your formula for the number of elements doesn't include an exponentiation somewhere, you've done it wrong.
A binary tree of order 5 can hold 2^0 + 2^1 + 2^2 + 2^3 + 2^4 elements, so 31 .. (which is 2^order - 1).
Edit:
I appear to have gotten order and depth / length mixed up. What on earth is the order of a binary tree? You appear to discuss B-trees as if they don't, by the very nature of their definition, hold a maximum of two child elements per element.
Let Order of b-tree is 'm' means maximum number of nodes that can be inserted at same level in a b-tree=m-1.After that nodes will splits.
for ex: if order is 3 then only 2 maximum node can be inserted on arrival of 3rd element ,nodes will splits by following the property of binary search tree or self balancing tree.

Resources