How many elements can be held in a B-tree of order n? - b-tree

Is it 2n? Just checking.

Terminology
The Order of a B-Tree is inconstantly defined in the literature.
(see for example the terminology section of Wikipedia's article on B-Trees)
Some authors consider it to be the minimum number of keys a non-leaf node may hold, while others consider it to be the maximum number of children nodes a non-leaf node may hold (which is one more than the maximum number of keys such a node could hold).
Yet many others skirt around the ambiguity by assuming a fixed length key (and fixed sized nodes), which makes the minimum and maximum the same, hence the two definitions of the order produce values that differ by 1 (as said the number of keys is always one less than the number of children.)
I define depth as the number of nodes found in the search path to a leaf record, and inclusive of the root node and the leaf node. In that sense, a very shallow tree with only a root node pointing directly to leaf nodes has depth 2. If that tree were to grow and require an intermediate level of non-leaf nodes, its depth would be 3 etc.
How many elements can be held in a B-Tree of order n?
Assuming fixed length keys, and assuming that "order" n is defined as the maximum number of child nodes, the answer is:
(Average Number of elements that fit in one Leaf-node) * n ^ (depth - 1)
How do I figure?...:
The data (the "elements") is only held in leaf nodes. So the number of element held is the average number of elements that fit in one node, times the number of leaf nodes.
The number of leaf nodes is itself driven by the number of children that fit in a non-leaf node (the order). For example the non-leaf node just above a leaf node, points to n (the order) leaf-nodes. Then, the non-leaf node above this non-leaf node points to n similar nodes etc, hence "to the power of (depth -1)".
Note that the formula above generally holds using the averages (of key held in a non-leaf node, and of elements held in a leaf node) rather than assuming fixed key length and fixed record length: trees will typically have a node size that is commensurate with the key and record sizes, hence holding a number key or records that is big enough that the effective number of keys or record held in any leaf will vary relatively little compared with the average.
Example:
A tree of depth 4 (a root node, two level of non-leaf nodes and one level [obviously] of leaf nodes) and of order 12 (non-leaf nodes can hold up to 11 keys, hence point to 12 nodes below them) and such that leaf nodes can contain 5 element each, would:
- have its root node point to 12 nodes below it
- each node below it points to 12 nodes below them (hence there will be 12 * 12 nodes in the layer "3" (assuming the root is layer 1 etc., this numbering btw is also ambiguously defined...)
- each node in "layer 3" will point to 12 leaf-nodes (hence there will be 12 * 12 * 12 leaf nodes.
- each leaf node has 5 elements (in this example case)
Hence.. such a tree will hold...
Nb Of Elements in said tree = 5 * 12 * 12 * 12
= 5 * (12 ^ 3)
= 5 * (12 ^ depth -1)
= 8640
Recognize the formula on the 3rd line.
What is generally remarkable for B-Tree, and which makes for their popularity is that a relatively shallow tree (one with a limited number of "hops" between the root and the sought record), can hold a relatively high number record. This number is multiplied by the order at each level.

My book says that the order of a B-tree is the maximum number of pointers that can be stored in a node. (p. 348) The number of "keys" is one less than the order. So a B-tree of order n can hold n-1 elements.
The book is "File Structures", second edition, by Michael J. Folk.

If your formula for the number of elements doesn't include an exponentiation somewhere, you've done it wrong.
A binary tree of order 5 can hold 2^0 + 2^1 + 2^2 + 2^3 + 2^4 elements, so 31 .. (which is 2^order - 1).
Edit:
I appear to have gotten order and depth / length mixed up. What on earth is the order of a binary tree? You appear to discuss B-trees as if they don't, by the very nature of their definition, hold a maximum of two child elements per element.

Let Order of b-tree is 'm' means maximum number of nodes that can be inserted at same level in a b-tree=m-1.After that nodes will splits.
for ex: if order is 3 then only 2 maximum node can be inserted on arrival of 3rd element ,nodes will splits by following the property of binary search tree or self balancing tree.

Related

when exactly should root split in a B Tree

I learned B trees recently and from what I understand a node can have minimum t-1 keys and maximum 2t-1 keys given minimum degree t. Exception being root can have even 1 key.
Here is the example from CLRS 3rd edition Fig 18.7 (Page 498) where t=3
min keys = 3-1 = 2
max keys = 2*3-1 = 5
In the d) example when L is inserted why is the root splitted when it doesn't violate the B tree properties at the moment (It has 5 keys which is maximum allowed).
Why isn't inserting L into [J K L] without splitting [G M P T X] considered.
Should I always split the root when it reaches the maximum?
There are several variants of the insertion algorithm for B-trees. In this case the insertion algorithm is the "single pass down the tree" variant.
The background for this variant is given on page 493:
Since we cannot insert a key into a leaf node that is full, we introduce an operation that splits a full node 𝑦 (having 2𝑡 − 1 keys) around its median key 𝑦:key𝑡 into two nodes having only 𝑡 − 1 keys each. The median key moves up into 𝑦’s parent to identify the dividing point between the two new trees. But if 𝑦’s parent is also full, we must split it before we can insert the new key, and thus we could end up splitting full nodes all the way up the tree.
As with a binary search tree, we can insert a key into a B-tree in a single pass down the tree from the root to a leaf. To do so, we do not wait to find out whether we will actually need to split a full node in order to do the insertion. Instead, as we travel down the tree searching for the position where the new key belongs, we split each full node we come to along the way (including the leaf itself). Thus whenever we want to split a full node 𝑦, we are assured that its parent is not full.
In other words, this insertion algorithm will split a node earlier than might be strictly needed, in order to avoid to have to split nodes while backtracking out of recursion.
This algorithm is further described on page 495 with pseudo code.
This explains why at the insertion of L the root node is split immediately before any recursive call is made.
Alternative algorithms would not do this, and would delay the split up to the point when it is inevitable.

How many comparisons a call to removeMin() will make in max heap of 7-ary tree?

Assume that a max heap with 10^6 elements is stored in a complete 7-ary tree. Approximately how many comparisons a call to removeMin() will make?
5000
50
10^6
500
5
My solution: The number of comparisons should be equal to the number of leaf nodes at most because in max heap, the min. can be found at any of the leaf nodes which is not in the above options. Better approach was to take the square of ( log of 10^6 to the base 7) which gives 50 but this is only when we are sure that the minimum element will follow a single branch across tree which in the case of max heap is not correct.
I hope that you can help.
There's no "natural" way to remove the minimum value from a max heap. You simply have to look at all the leaf nodes to figure out which one happens to be the minimum.
The question then is how many leaf nodes there are. Intuitively, we'd expect the fraction of nodes in the heap that are leaves to be pretty close to the total number of nodes. Take it to the limit - if you have a 1,000,000-ary heap, you'd have one node in the top layer and all remaining 999,999 elements in the next layer. Even in the smallest case where the heap is a binary heap, you'd expect roughly half the elements to be in the bottom layer.
More specifically, let's do some math! How many leaves will a 7-ary heap with n nodes have? Well, each node in the tree will either
be a leaf, or
have seven children,
with one possible exception that, since the bottommost row might not be full, there might be one node with fewer than seven children. Since that's just a one-off, we can ignore that last node when we're dealing with millions of elements. A quick proof by induction can be used to show that any tree where each node either has no children or seven children will have seven times as many leaf nodes as internal nodes (prove this!), so we'd expect than (7/8)ths of the nodes will be leaves, for a total of 875,000 leaves to check.
As a result, the best answer here would be roughly 106 comparisons.
Min element can be any of the leaves of a max heap or any type, and there's no order there. All elements from A[10^6/7 + 1] onwards (where A is the array storing the leaves) are leaf nodes and need to be checked. This means 8571412 comparisons just to find the minimum. After that there is no simple way to 'remove' the minimum without introducing a gap that cannot be filled by simply shifting the leaves.
This is a misprint. Maybe the teacher wanted to ask removeMax, for which the answer is close to 50 -- see below:
There are 7 comparisons per level done by the heapify since each node has 7 children. If h is the height of the heap then that's 7*h comparisons.
Rough analysis: (here ~ means approximately)
h ~ log_7(10^6) = 7.1, hence total comparisons 7*7.1 ~ 50
More accurate analysis:
A 7-ary heap would have elements: 1 + 7 + 7^2 + ... + 7^h = 10^6
On the left side is a geometric series, that sums to: (7^h -1)/6 = 10^6
=> 7^h = 6*10^6 + 1
=> h = lg_7(6*10^6 + 1) = 8 (approximately) , hence 7*8 = 56, still from the options 50 is the closest.
*A is array to sort heap.

Intitutive idea behind the build heap function

I want to know about the following statement in build heap function
for i=A.length/2 downto 1
As this step was deduced by hit & trial to find out the parent of leaves or there was something else in the mind of the person who developed this algorithmBelow is build heap function-
Build_Max_Heap(A)
A.heap_size=A.length
for i=A.length/2 downto 1
Max_Heapify(A,i)
The nodes of the second half of the array are leaves(explanation follows in next paragraph) and are thus trivially 1-node max heaps already, and hence Max_Heapify need not be done for them.
Even if you call Max_Heapify for those nodes, no harm in terms of time complexity as the leaf nodes have no nodes below and Max_Heapify would return immediately anyway.
Mathematically it's easy to establish that the second half of the array are leaves by the way summation of a Geometric Progression works. Recall that a heap is a complete binary tree, which means it will have all nodes filled in each level, except possibly the last level, in which it will be filled partially from left to right. For sake of simplicity, let's assume the number of nodes in our heap as 2^N. Clearly this tree has N levels with all levels filled.
First level has 2^0 node
Second level has 2^1 nodes
Third level has 2^2 nodes
N-1 level has 2^N-2 nodes
Nth level has 2^N-1 nodes
Sum of the count of nodes in all levels except last = 2^0 + 2^1...+ 2^N-2
= 2^N-1 - 1
This is the number of nodes in the last level off by one. That is, the the total number of nodes in all levels except the last is almost same as the number of nodes in just the last level, which directly implies that the last level must have half the total number of all nodes in the heap in it. By this observation, we get A.length/2.

Why in B-tree and B+_tree store from half-full to complete-full in each non-leaf node

I've just learn B-tree and B+-tree in DBMS.
I don't understand why a non-leaf node in tree has between [n/2] and n children, when n is fix for particular tree.
Why is that? and advantage of that?
Thanks !
This is the feature that makes the B+ and B-tree balanced, and due to it, we can easily compute the complexity of ops on the tree and bound it to O(logn) [where n is the number of elements in the data set].
If a node could have more then B sons, we could create a tree with depth 2: a root, and all other nodes will be leaves, from the root. searching for an element will be then O(n), and not the desired O(logn).
If a node could have less then B/2 sons, we could create a tree which is actually a linked list [n nodes, each with 1 son], with height n - and a search op will again be O(n) instead of O(logn)
Small currection: every non-leaf node - except the root, has B/2 to B children. the root alone is allowed to have less then B/2 sons.
The basic assumption of this structure is to have a fixed block size, this is why each internal block has n slots for indexing its children.
When there is a need to add a child to a block that is full (has exactly n children), the block is split into two blocks, which then replace the original block in its parent's index. The number of children in each of the two blocks is obviously n div 2 (assuming n is even). This is what the lower limit comes from.
If the parent is full, the operation repeats, potentially up to the root itself.
The split operation and allowing for n/2-filled blocks allows for most of the insertions/deletions to only cause local changes instead of re-balancing huge parts of the tree.

min/max number of records on a B+Tree?

I was looking at the best & worst case scenarios for a B+Tree (http://en.wikipedia.org/wiki/B-tree#Best_case_and_worst_case_heights) but I don't know how to use this formula with the information I have.
Let's say I have a tree B with 1,000 records, what is the maximum (and maximum) number of levels B can have?
I can have as many/little keys on each page. I can also have as many/little number of pages.
Any ideas?
(In case you are wondering, this is not a homework question, but it will surely help me understand some stuff for hw.)
I don't have the math handy, but...
Basically, the primary factor to tree depth is the "fan out" of each node in the tree.
Normally, in a simply B-Tree, the fan out is 2, 2 nodes as children for each node in the tree.
But with a B+Tree, typically they have a fan out much larger.
One factor that comes in to play is the size of the node on disk.
For example, if you have a 4K page size, and, say, 4000 byte of free space (not including any other pointers or other meta data related to the node), and lets say that a pointer to any other node in the tree is a 4 byte integer. If your B+Tree is in fact storing 4 byte integers, then the combined size (4 bytes of pointer information + 4 bytes of key information) = 8 bytes. 4000 free bytes / 8 bytes == 500 possible children.
That give you a fan out of 500 for this contrived case.
So, with one page of index, i.e. the root node, or a height of 1 for the tree, you can reference 500 records. Add another level, and you're at 500*500, so for 501 4K pages, you can reference 250,000 rows.
Obviously, the large the key size, or the smaller the page size of your node, the lower the fan out that the tree is capable of. If you allow variable length keys in each node, then the fan out can easily vary.
But hopefully you can see the gist of how this all works.
It depends on the arity of the tree. You have to define this value. If you say that each node can have 4 children then and you have 1000 records, then the height is
Best case log_4(1000) = 5
Worst case log_{4/2}(1000) = 10
The arity is m and the number of records is n.
The best and worst case depends on the no. of children each node can have. For the best case, we consider the case, when each node has the maximum number of children (i.e. m for an m-ary tree) with each node having m-1 keys. So,
1st level(or root) has m-1 entries
2nd level has m*(m-1) entries (since the root has m children with m-1 keys each)
3rd level has m^2*(m-1) entries
....
Hth level has m^(h-1)*(m-1)
Thus, if H is the height of the tree, the total number of entries is equal to n=m^H-1
which is equivalent to H=log_m(n+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the best case height will be equal to log_m(1000+1)
Similarly, for the worst case scenario:
Level 1(root) has at least 1 entry (and minimum 2 children)
2nd level has as least 2*(d-1) entries (where d=ceil(m/2) is the minimum number of children each internal node (except root) can have)
3rd level has 2d*(d-1) entries
...
Hth level has 2*d^(h-2)*(d-1) entries
Thus, if H is the height of the tree, the total number of entries is equal to n=2*d^H-1 which is equivalent to H=log_d((n+1)/2+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the worst case height will be equal to log_d((1000+1)/2+1)

Resources