min/max number of records on a B+Tree? - data-structures

I was looking at the best & worst case scenarios for a B+Tree (http://en.wikipedia.org/wiki/B-tree#Best_case_and_worst_case_heights) but I don't know how to use this formula with the information I have.
Let's say I have a tree B with 1,000 records, what is the maximum (and maximum) number of levels B can have?
I can have as many/little keys on each page. I can also have as many/little number of pages.
Any ideas?
(In case you are wondering, this is not a homework question, but it will surely help me understand some stuff for hw.)

I don't have the math handy, but...
Basically, the primary factor to tree depth is the "fan out" of each node in the tree.
Normally, in a simply B-Tree, the fan out is 2, 2 nodes as children for each node in the tree.
But with a B+Tree, typically they have a fan out much larger.
One factor that comes in to play is the size of the node on disk.
For example, if you have a 4K page size, and, say, 4000 byte of free space (not including any other pointers or other meta data related to the node), and lets say that a pointer to any other node in the tree is a 4 byte integer. If your B+Tree is in fact storing 4 byte integers, then the combined size (4 bytes of pointer information + 4 bytes of key information) = 8 bytes. 4000 free bytes / 8 bytes == 500 possible children.
That give you a fan out of 500 for this contrived case.
So, with one page of index, i.e. the root node, or a height of 1 for the tree, you can reference 500 records. Add another level, and you're at 500*500, so for 501 4K pages, you can reference 250,000 rows.
Obviously, the large the key size, or the smaller the page size of your node, the lower the fan out that the tree is capable of. If you allow variable length keys in each node, then the fan out can easily vary.
But hopefully you can see the gist of how this all works.

It depends on the arity of the tree. You have to define this value. If you say that each node can have 4 children then and you have 1000 records, then the height is
Best case log_4(1000) = 5
Worst case log_{4/2}(1000) = 10
The arity is m and the number of records is n.

The best and worst case depends on the no. of children each node can have. For the best case, we consider the case, when each node has the maximum number of children (i.e. m for an m-ary tree) with each node having m-1 keys. So,
1st level(or root) has m-1 entries
2nd level has m*(m-1) entries (since the root has m children with m-1 keys each)
3rd level has m^2*(m-1) entries
....
Hth level has m^(h-1)*(m-1)
Thus, if H is the height of the tree, the total number of entries is equal to n=m^H-1
which is equivalent to H=log_m(n+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the best case height will be equal to log_m(1000+1)
Similarly, for the worst case scenario:
Level 1(root) has at least 1 entry (and minimum 2 children)
2nd level has as least 2*(d-1) entries (where d=ceil(m/2) is the minimum number of children each internal node (except root) can have)
3rd level has 2d*(d-1) entries
...
Hth level has 2*d^(h-2)*(d-1) entries
Thus, if H is the height of the tree, the total number of entries is equal to n=2*d^H-1 which is equivalent to H=log_d((n+1)/2+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the worst case height will be equal to log_d((1000+1)/2+1)

Related

How many comparisons a call to removeMin() will make in max heap of 7-ary tree?

Assume that a max heap with 10^6 elements is stored in a complete 7-ary tree. Approximately how many comparisons a call to removeMin() will make?
5000
50
10^6
500
5
My solution: The number of comparisons should be equal to the number of leaf nodes at most because in max heap, the min. can be found at any of the leaf nodes which is not in the above options. Better approach was to take the square of ( log of 10^6 to the base 7) which gives 50 but this is only when we are sure that the minimum element will follow a single branch across tree which in the case of max heap is not correct.
I hope that you can help.
There's no "natural" way to remove the minimum value from a max heap. You simply have to look at all the leaf nodes to figure out which one happens to be the minimum.
The question then is how many leaf nodes there are. Intuitively, we'd expect the fraction of nodes in the heap that are leaves to be pretty close to the total number of nodes. Take it to the limit - if you have a 1,000,000-ary heap, you'd have one node in the top layer and all remaining 999,999 elements in the next layer. Even in the smallest case where the heap is a binary heap, you'd expect roughly half the elements to be in the bottom layer.
More specifically, let's do some math! How many leaves will a 7-ary heap with n nodes have? Well, each node in the tree will either
be a leaf, or
have seven children,
with one possible exception that, since the bottommost row might not be full, there might be one node with fewer than seven children. Since that's just a one-off, we can ignore that last node when we're dealing with millions of elements. A quick proof by induction can be used to show that any tree where each node either has no children or seven children will have seven times as many leaf nodes as internal nodes (prove this!), so we'd expect than (7/8)ths of the nodes will be leaves, for a total of 875,000 leaves to check.
As a result, the best answer here would be roughly 106 comparisons.
Min element can be any of the leaves of a max heap or any type, and there's no order there. All elements from A[10^6/7 + 1] onwards (where A is the array storing the leaves) are leaf nodes and need to be checked. This means 8571412 comparisons just to find the minimum. After that there is no simple way to 'remove' the minimum without introducing a gap that cannot be filled by simply shifting the leaves.
This is a misprint. Maybe the teacher wanted to ask removeMax, for which the answer is close to 50 -- see below:
There are 7 comparisons per level done by the heapify since each node has 7 children. If h is the height of the heap then that's 7*h comparisons.
Rough analysis: (here ~ means approximately)
h ~ log_7(10^6) = 7.1, hence total comparisons 7*7.1 ~ 50
More accurate analysis:
A 7-ary heap would have elements: 1 + 7 + 7^2 + ... + 7^h = 10^6
On the left side is a geometric series, that sums to: (7^h -1)/6 = 10^6
=> 7^h = 6*10^6 + 1
=> h = lg_7(6*10^6 + 1) = 8 (approximately) , hence 7*8 = 56, still from the options 50 is the closest.
*A is array to sort heap.

Why node in heap data structure in many types has only two children?

From the wiki
The maximum number of children each node can have depends on the type of heap, but in many types it is at most two, which is known as a binary heap.
I can't understand why in many types the node in heap at most has only two children? Why three children or four children and so on is not common? Thanks~
It's not true that most types of heap have at most two children per node, but it is true that the binary heap -- which does have at most two children per node -- is the most commonly implemented type. It's the most commonly implemented type because it is simple, cache-friendly, and memory-efficient.
The data structures used for binary heaps could be used with a different number of children per node. The common operations in an x-ary heap would still take O(log N) time, if we consider x to be constant. To decide on the best x, however, we have to let it vary, and in that case common operations take O(x * log N / log x) time.
To determine the most efficient number of children per node, we can choose x to minimize the factor x/log x.
If you plot that you can see that the best number of children per node is actually 3 (the minimum is at x=e, but we require an integer):
... but the difference between 2 and 3 is not significant, and the code is simpler using 2 children per node, so that is the common practice.

Breadth first search branching factor and depth

I want to know how time and memory is calculate i tried to use this O(b^d)
but not give me the same values
The most important calculation is in the second column: the number of nodes in a complete tree. The formula for that is presented in the answer to "What is the total number of nodes in a full k-ary tree, in terms of the number of leaves?". Just replace k by 10, as your table talks about "branching factor b = 100":
N = (10d+1 - 1) / 9
For some reason the table you present, does not count the root node, because with the root node included, the count for a tree with depth 2 would be 111, not 110. But that is just a detail.
The time in seconds is calculated as the number of nodes (i.e. the value in column 2) divided by 100,000, as indicated in the footnote ("100,000 nodes/second"). It is quite trivial to translate a big number of seconds to minutes, hours, days, years, etc.
The footnote further mentions the assumption that the memory consumption is "1000 bytes/node", so it is a matter of multiplying the number of nodes (value in second column) by 1000. The table then actually uses the JEDEC memory standards for storage, where a kilobyte is not exactly 1000 bytes, but 1024 bytes. So you need to divide the number of nodes by that factor to get the number of kilobytes, then again for getting the number of megabytes, ...etc. See for instance "How to convert byte size into human readable format in java?".

Sequentially Constructing Full B-Trees

If I have a sorted set of data, which I want to store on disk in a way that is optimal for both reading sequentially and doing random lookups on, it seems that a B-Tree (or one of the variants is a good choice ... presuming this data-set does not all fit in RAM).
The question is can a full B-Tree be constructed from a sorted set of data without doing any page splits? So that the sorted data can be written to disk sequentially.
Constructing a "B+ tree" to those specifications is simple.
Choose your branching factor k.
Write the sorted data to a file. This is the leaf level.
To construct the next highest level, scan the current level and write out every kth item.
Stop when the current level has k items or fewer.
Example with k = 2:
0 1|2 3|4 5|6 7|8 9
0 2 |4 6 |8
0 4 |8
0 8
Now let's look for 5. Use binary search to find the last number less than or equal to 5 in the top level, or 0. Look at the interval in the next lowest level corresponding to 0:
0 4
Now 4:
4 6
Now 4 again:
4 5
Found it. In general, the jth item corresponds to items jk though (j+1)k-1 at the next level. You can also scan the leaf level linearly.
We can make a B-tree in one pass, but it may not be the optimal storage method. Depending on how often you make sequential queries vs random access ones, it may be better to store it in sequence and use binary search to service a random access query.
That said: assume that each record in your b-tree holds (m - 1) keys (m > 2, the binary case is a bit different). We want all the leaves on the same level and all the internal nodes to have at least (m - 1) / 2 keys. We know that a full b-tree of height k has (m^k - 1) keys. Assume that we have n keys total to store. Let k be the smallest integer such that m^k - 1 > n. Now if 2 m^(k - 1) - 1 < n we can completely fill up the inner nodes, and distribute the rest of the keys evenly to the leaf nodes, each leaf node getting either the floor or ceiling of (n + 1 - m^(k - 1))/m^(k - 1) keys. If we cannot do that then we know that we have enough to fill all of the nodes at depth k - 1 at least halfway and store one key in each of the leaves.
Once we have decided the shape of our tree, we need only do an inorder traversal of the tree sequentially dropping keys into position as we go.
Optimal meaning that an inorder traversal of the data will always be seeking forward through the file (or mmaped region), and a random lookup is done in a minimal number of seeks.

How many elements can be held in a B-tree of order n?

Is it 2n? Just checking.
Terminology
The Order of a B-Tree is inconstantly defined in the literature.
(see for example the terminology section of Wikipedia's article on B-Trees)
Some authors consider it to be the minimum number of keys a non-leaf node may hold, while others consider it to be the maximum number of children nodes a non-leaf node may hold (which is one more than the maximum number of keys such a node could hold).
Yet many others skirt around the ambiguity by assuming a fixed length key (and fixed sized nodes), which makes the minimum and maximum the same, hence the two definitions of the order produce values that differ by 1 (as said the number of keys is always one less than the number of children.)
I define depth as the number of nodes found in the search path to a leaf record, and inclusive of the root node and the leaf node. In that sense, a very shallow tree with only a root node pointing directly to leaf nodes has depth 2. If that tree were to grow and require an intermediate level of non-leaf nodes, its depth would be 3 etc.
How many elements can be held in a B-Tree of order n?
Assuming fixed length keys, and assuming that "order" n is defined as the maximum number of child nodes, the answer is:
(Average Number of elements that fit in one Leaf-node) * n ^ (depth - 1)
How do I figure?...:
The data (the "elements") is only held in leaf nodes. So the number of element held is the average number of elements that fit in one node, times the number of leaf nodes.
The number of leaf nodes is itself driven by the number of children that fit in a non-leaf node (the order). For example the non-leaf node just above a leaf node, points to n (the order) leaf-nodes. Then, the non-leaf node above this non-leaf node points to n similar nodes etc, hence "to the power of (depth -1)".
Note that the formula above generally holds using the averages (of key held in a non-leaf node, and of elements held in a leaf node) rather than assuming fixed key length and fixed record length: trees will typically have a node size that is commensurate with the key and record sizes, hence holding a number key or records that is big enough that the effective number of keys or record held in any leaf will vary relatively little compared with the average.
Example:
A tree of depth 4 (a root node, two level of non-leaf nodes and one level [obviously] of leaf nodes) and of order 12 (non-leaf nodes can hold up to 11 keys, hence point to 12 nodes below them) and such that leaf nodes can contain 5 element each, would:
- have its root node point to 12 nodes below it
- each node below it points to 12 nodes below them (hence there will be 12 * 12 nodes in the layer "3" (assuming the root is layer 1 etc., this numbering btw is also ambiguously defined...)
- each node in "layer 3" will point to 12 leaf-nodes (hence there will be 12 * 12 * 12 leaf nodes.
- each leaf node has 5 elements (in this example case)
Hence.. such a tree will hold...
Nb Of Elements in said tree = 5 * 12 * 12 * 12
= 5 * (12 ^ 3)
= 5 * (12 ^ depth -1)
= 8640
Recognize the formula on the 3rd line.
What is generally remarkable for B-Tree, and which makes for their popularity is that a relatively shallow tree (one with a limited number of "hops" between the root and the sought record), can hold a relatively high number record. This number is multiplied by the order at each level.
My book says that the order of a B-tree is the maximum number of pointers that can be stored in a node. (p. 348) The number of "keys" is one less than the order. So a B-tree of order n can hold n-1 elements.
The book is "File Structures", second edition, by Michael J. Folk.
If your formula for the number of elements doesn't include an exponentiation somewhere, you've done it wrong.
A binary tree of order 5 can hold 2^0 + 2^1 + 2^2 + 2^3 + 2^4 elements, so 31 .. (which is 2^order - 1).
Edit:
I appear to have gotten order and depth / length mixed up. What on earth is the order of a binary tree? You appear to discuss B-trees as if they don't, by the very nature of their definition, hold a maximum of two child elements per element.
Let Order of b-tree is 'm' means maximum number of nodes that can be inserted at same level in a b-tree=m-1.After that nodes will splits.
for ex: if order is 3 then only 2 maximum node can be inserted on arrival of 3rd element ,nodes will splits by following the property of binary search tree or self balancing tree.

Resources