Why in B-tree and B+_tree store from half-full to complete-full in each non-leaf node - algorithm

I've just learn B-tree and B+-tree in DBMS.
I don't understand why a non-leaf node in tree has between [n/2] and n children, when n is fix for particular tree.
Why is that? and advantage of that?
Thanks !

This is the feature that makes the B+ and B-tree balanced, and due to it, we can easily compute the complexity of ops on the tree and bound it to O(logn) [where n is the number of elements in the data set].
If a node could have more then B sons, we could create a tree with depth 2: a root, and all other nodes will be leaves, from the root. searching for an element will be then O(n), and not the desired O(logn).
If a node could have less then B/2 sons, we could create a tree which is actually a linked list [n nodes, each with 1 son], with height n - and a search op will again be O(n) instead of O(logn)
Small currection: every non-leaf node - except the root, has B/2 to B children. the root alone is allowed to have less then B/2 sons.

The basic assumption of this structure is to have a fixed block size, this is why each internal block has n slots for indexing its children.
When there is a need to add a child to a block that is full (has exactly n children), the block is split into two blocks, which then replace the original block in its parent's index. The number of children in each of the two blocks is obviously n div 2 (assuming n is even). This is what the lower limit comes from.
If the parent is full, the operation repeats, potentially up to the root itself.
The split operation and allowing for n/2-filled blocks allows for most of the insertions/deletions to only cause local changes instead of re-balancing huge parts of the tree.

Related

when exactly should root split in a B Tree

I learned B trees recently and from what I understand a node can have minimum t-1 keys and maximum 2t-1 keys given minimum degree t. Exception being root can have even 1 key.
Here is the example from CLRS 3rd edition Fig 18.7 (Page 498) where t=3
min keys = 3-1 = 2
max keys = 2*3-1 = 5
In the d) example when L is inserted why is the root splitted when it doesn't violate the B tree properties at the moment (It has 5 keys which is maximum allowed).
Why isn't inserting L into [J K L] without splitting [G M P T X] considered.
Should I always split the root when it reaches the maximum?
There are several variants of the insertion algorithm for B-trees. In this case the insertion algorithm is the "single pass down the tree" variant.
The background for this variant is given on page 493:
Since we cannot insert a key into a leaf node that is full, we introduce an operation that splits a full node 𝑦 (having 2𝑡 − 1 keys) around its median key 𝑦:key𝑡 into two nodes having only 𝑡 − 1 keys each. The median key moves up into 𝑦’s parent to identify the dividing point between the two new trees. But if 𝑦’s parent is also full, we must split it before we can insert the new key, and thus we could end up splitting full nodes all the way up the tree.
As with a binary search tree, we can insert a key into a B-tree in a single pass down the tree from the root to a leaf. To do so, we do not wait to find out whether we will actually need to split a full node in order to do the insertion. Instead, as we travel down the tree searching for the position where the new key belongs, we split each full node we come to along the way (including the leaf itself). Thus whenever we want to split a full node 𝑦, we are assured that its parent is not full.
In other words, this insertion algorithm will split a node earlier than might be strictly needed, in order to avoid to have to split nodes while backtracking out of recursion.
This algorithm is further described on page 495 with pseudo code.
This explains why at the insertion of L the root node is split immediately before any recursive call is made.
Alternative algorithms would not do this, and would delay the split up to the point when it is inevitable.

Why is the number of sub-trees gained from a range tree query is O(log(n))?

I'm trying to figure out this data structure, but I don't understand how can we
tell there are O(log(n)) subtrees that represents the answer to a query?
Here is a picture for illustration:
Thanks!
If we make the assumption that the above is a purely functional binary tree [wiki], so where the nodes are immutable, then we can make a "copy" of this tree such that only elements with a value larger than x1 and lower than x2 are in the tree.
Let us start with a very simple case to illustrate the point. Imagine that we simply do not have any bounds, than we can simply return the entire tree. So instead of constructing a new tree, we return a reference to the root of the tree. So we can, without any bounds return a tree in O(1), given that tree is not edited (at least not as long as we use the subtree).
The above case is of course quite simple. We simply make a "copy" (not really a copy since the data is immutable, we can just return the tree) of the entire tree. So let us aim to solve a more complex problem: we want to construct a tree that contains all elements larger than a threshold x1. Basically we can define a recursive algorithm for that:
the cutted version of None (or whatever represents a null reference, or a reference to an empty tree) is None;
if the node has a value is smaller than the threshold, we return a "cutted" version of the right subtree; and
if the node has a value greater than the threshold, we return an inode that has the same right subtree, and as left subchild the cutted version of the left subchild.
So in pseudo-code it looks like:
def treelarger(some_node, min):
if some_tree is None:
return None
if some_node.value > min:
return Node(treelarger(some_node.left, min), some_node.value, some_node.right)
else:
return treelarger(some_node.right, min)
This algorithm thus runs in O(h) with h the height of the tree, since for each case (except the first one), we recurse to one (not both) of the children, and it ends in case we have a node without children (or at least does not has a subtree in the direction we need to cut the subtree).
We thus do not make a complete copy of the tree. We reuse a lot of nodes in the old tree. We only construct a new "surface" but most of the "volume" is part of the old binary tree. Although the tree itself contains O(n) nodes, we construct, at most, O(h) new nodes. We can optimize the above such that, given the cutted version of one of the subtrees is the same, we do not create a new node. But that does not even matter much in terms of time complexity: we generate at most O(h) new nodes, and the total number of nodes is either less than the original number, or the same.
In case of a complete tree, the height of the tree h scales with O(log n), and thus this algorithm will run in O(log n).
Then how can we generate a tree with elements between two thresholds? We can easily rewrite the above into an algorithm treesmaller that generates a subtree that contains all elements that are smaller:
def treesmaller(some_node, max):
if some_tree is None:
return None
if some_node.value < min:
return Node(some_node.left, some_node.value, treesmaller(some_node.right, max))
else:
return treesmaller(some_node.left, max)
so roughly speaking there are two differences:
we change the condition from some_node.value > min to some_node.value < max; and
we recurse on the right subchild in case the condition holds, and on the left if it does not hold.
Now the conclusions we draw from the previous algorithm are also conclusions that can be applied to this algorithm, since again it only introduces O(h) new nodes, and the total number of nodes can only decrease.
Although we can construct an algorithm that takes the two thresholds concurrently into account, we can simply reuse the above algorithms to construct a subtree containing only elements within range: we first pass the tree to the treelarger function, and then that result through a treesmaller (or vice versa).
Since in both algorithms, we introduce O(h) new nodes, and the height of the tree can not increase, we thus construct at most O(2 h) and thus O(h) new nodes.
Given the original tree was a complete tree, then it thus holds that we create O(log n) new nodes.
Consider the search for the two endpoints of the range. This search will continue until finding the lowest common ancestor of the two leaf nodes that span your interval. At that point, the search branches with one part zigging left and one part zagging right. For now, let's just focus on the part of the query that branches to the left, since the logic is the same but reversed for the right branch.
In this search, it helps to think of each node as not representing a single point, but rather a range of points. The general procedure, then, is the following:
If the query range fully subsumes the range represented by this node, stop searching in x and begin searching the y-subtree of this node.
If the query range is purely in range represented by the right subtree of this node, continue the x search to the right and don't investigate the y-subtree.
If the query range overlaps the left subtree's range, then it must fully subsume the right subtree's range. So process the right subtree's y-subtree, then recursively explore the x-subtree to the left.
In all cases, we add at most one y-subtree in for consideration and then recursively continue exploring the x-subtree in only one direction. This means that we essentially trace out a path down the x-tree, adding in at most one y-subtree per step. Since the tree has height O(log n), the overall number of y-subtrees visited this way is O(log n). And then, including the number of y-subtrees visited in the case where we branched right at the top, we get another O(log n) subtrees for a total of O(log n) total subtrees to search.
Hope this helps!

Why does a Binary Heap has to be a Complete Binary Tree?

The heap property says:
If A is a parent node of B then the key of node A is ordered with
respect to the key of node B with the same ordering applying across
the heap. Either the keys of parent nodes are always greater than or
equal to those of the children and the highest key is in the root node
(this kind of heap is called max heap) or the keys of parent nodes are
less than or equal to those of the children and the lowest key is in
the root node (min heap).
But why in this wiki, the Binary Heap has to be a Complete Binary Tree? The Heap Property doesn't imply that in my impression.
According to the wikipedia article you provided, a binary heap must conform to both the heap property (as you discussed) and the shape property (which mandates that it is a complete binary tree). Without the shape property, one would lose the runtime advantage that the data structure provides (i.e. the completeness ensures that there is a well defined way to determine the new root when an element is removed, etc.)
Every item in the array has a position in the binary tree, and this position is calculated from the array index. The positioning formula ensures that the tree is 'tightly packed'.
For example, this binary tree here:
is represented by the array
[1, 2, 3, 17, 19, 36, 7, 25, 100].
Notice that the array is ordered as if you're starting at the top of the tree, then reading each row from left-to-right.
If you add another item to this array, it will represent the slot below the 19 and to the right of the 100. If this new number is less than 19, then values will have to be swapped around, but nonetheless, that is the slot that will be filled by the 10th item of the array.
Another way to look at it: try constructing a binary heap which isn't a complete binary tree. You literally cannot.
You can only guarantee O(log(n)) insertion and (root) deletion if the tree is complete. Here's why:
If the tree is not complete, then it may be unbalanced and in the worst case, simply a linked list, requiring O(n) to find a leaf, and O(n) for insertion and deletion. With the shape requirement of completeness, you are guaranteed O(log(n)) operations since it takes constant time to find a leaf (last in array), and you are guaranteed that the tree is no deeper than log2(N), meaning the "bubble up" (used in insertion) and "sink down" (used in deletion) will require at most log2(N) modifications (swaps) of data in the heap.
This being said, you don't absolutely have to have a complete binary tree, but you just loose these runtime guarantees. In addition, as others have mentioned, having a complete binary tree makes it easy to store the tree in array format forgoing object reference representation.
The point that 'complete' makes is that in a heap all interior (not leaf) nodes have two children, except where there are no children left -- all the interior nodes are 'complete'. As you add to the heap, the lowest level of nodes is filled (with childless leaf nodes), from the left, before a new level is started. As you remove nodes from the heap, the right-most leaf at the lowest level is removed (and pushed back in at the top). The heap is also perfectly balanced (hurrah!).
A binary heap can be looked at as a binary tree, but the nodes do not have child pointers, and insertion (push) and deletion (pop or from inside the heap) are quite different to those procedures for an actual binary tree.
This is a direct consequence of the way in which the heap is organised. The heap is held as a vector with no gaps between the nodes. The parent of the i'th item in the heap is item (i - 1) / 2 (assuming a binary heap, and assuming the top of the heap is item 0). The left child of the i'th item is (i * 2) + 1, and the right child one greater than that. When there are n nodes in the heap, a node has no left child if (i * 2) + 1 exceeds n, and no right child if (i * 2) + 2 does.
The heap is a beautiful thing. It's one flaw is that you do need a vector large enough for all entries... unlike a real binary tree, you cannot allocate a node at a time. So if you have a heap for an indefinite number of items, you have to be ready to extend the underlying vector as and when needed -- or run some fragmented structure which can be addressed as if it was a vector.
FWIW: when stepping down the heap, I find it convenient to step to the right child -- (i + 1) * 2 -- if that is < n then both children are present, if it is == n only the left child is present, otherwise there are no children.
By maintaining binary heap as a complete binary gives multiple advantages such as
1.heap is complete binary tree so height of heap is minimum possible i.e log(size of tree). And insertion, build heap operation depends on height. So if height is minimum then their time complexity will be reduced.
2.All the items of complete binary tree stored in contiguous manner in array so random access is possible and it also provide cache friendliness.
In order for a Binary Tree to be considered a heap two it must meet two criteria. 1) It must have the heap property. 2) it must be a complete tree.
It is possible for a structure to have either of these properties and not have the other, but we would not call such a data structure a heap. You are right that the heap property does not entail the shape property. They are separate constraints.
The underlying structure of a heap is an array where every node is an index in an array so if the tree is not complete that means that one of the index is kept empty which is not possible beause it is coded in such a way that each node is an index .I have given a link below so that u can see how the heap structure is built
http://www.sanfoundry.com/java-program-implement-min-heap/
Hope it helps
I find that all answers so far either do not address the question or are, essentially, saying "because the definition says so" or use a similar circular argument. They are surely true but (to me) not very informative.
To me it became immediately obvious that the heap must be a complete tree when I remembered that you insert a new element not at the root (as you do in a binary search tree) but, rather, at the bottom right.
Thus, in a heap, a new element propagates from the bottom up - it is "moved up" within the tree till it finds a suitable place.
In a binary search tree a newly inserted element moves the other way round - it is inserted at the root and it "moves down" till it finds its place.
The fact that each new element in a heap starts as the bottom right node means that the heap is going to be a complete tree at all times.

How to efficiently check whether it's height balanced for a massively skewed binary search tree?

I was reading this answer on how to check if a BST is height balanced, and really hooked by the bonus question:
Suppose the tree is massively unbalanced. Like, a million nodes deep on one side and three deep on the other. Is there a scenario in which this algorithm blows the stack? Can you fix the implementation so that it never blows the stack, even when given a massively unbalanced tree?
What would be a good strategy here?
I am thinking to do a level order traversal and track the depth, if a leaf is found and current node depth is bigger than the leaf node depth + 2, then it's not balanced. But how to combine this with height checking?
Edit: below is the implementation in the linked answer
IsHeightBalanced(tree)
return (tree is empty) or
(IsHeightBalanced(tree.left) and
IsHeightBalanced(tree.right) and
abs(Height(tree.left) - Height(tree.right)) <= 1)
To review briefly: a tree is defined as being either null or a root node with pointers .left to a left child and .right to a right child, where each child is in turn a tree, the root node appears in neither child, and no node appears in both children. The depth of a node is the number of pointers that must be followed to reach it from the root node. The height of a tree is -1 if it's null or else the maximum depth of a node that appears in it. A leaf is a node whose children are null.
First let me note the two distinct definitions of "balanced" proposed by answerers of the linked question.
EL-balanced A tree is EL-balanced if and only if, for every node v, |height(v.left) - height(v.right)| <= 1.
This is the balance condition for AVL trees.
DF-balanced A tree is DF-balanced if and only if, for every pair of leaves v, w, we have |depth(v) - depth(w)| <= 1. As DF points out, DF-balance for a node implies DF-balance for all of its descendants.
DF-balance is used for no algorithm known to me, though the balance condition for binary heaps is very similar, requiring additionally that the deeper leaves be as far left as possible.
I'm going to outline three approaches to testing balance.
Size bounds for balanced trees
Expand the recursive function to have an extra parameter, maxDepth. For each recursive call, pass maxDepth - 1, so that maxDepth roughly tracks how much stack space is left. If maxDepth reaches 0, report the tree as unbalanced (e.g., by returning "infinity" for the height), since no balanced tree that fits in main memory could possibly be that tall.
This approach relies on an a priori size bound on main memory, which is available in practice if not in all theoretical models, and the fact that no subtrees are shared. (PROTIP: unless you're very careful, your subtrees will be shared at some point during development.) We also need height bounds on balanced trees of at most a given size.
EL-balanced Via mutual induction, we prove a lower bound, L(h), on the number of nodes belonging to an EL-balanced tree of a given height h.
The base cases are
L(-1) = 0
L(0) = 1,
more or less by definition. The inductive case is trickier. An EL-balanced tree of height h > 0 is a node with an EL-balanced child of height h - 1 and another EL-balanced child of height either h - 1 or h - 2. This means that
L(h) = 1 + L(h - 1) + min(L(h - 2), L(h - 1)).
Add 1 to both sides and rearrange.
L(h) + 1 = L(h - 1) + 1 + min(L(h - 2) + 1, L(h - 1) + 1).
A little while later (spoiler), we find that
L(h) <= phi^(h + 2)/sqrt(5),
where phi = (1 + sqrt(5))/2 ~ 1.618.
maxDepth then should be set to the floor of the base-phi logarithm of the maximum number of nodes, plus a small constant that depends on fenceposty things.
DF-balanced Rather than write out an induction proof, I'm going to appeal to your intuition that the worst case is a complete binary tree with one extra leaf on the bottom. Then the proper setting for maxDepth is the base-2 logarithm of the maximum number of nodes, plus a small constant.
Iterative deepening depth-first search
This is the theoretician's version of the answer above. Because, for some reason, we don't know how much RAM our computer has (and with logarithmic space usage, it's not as though we need a tight bound), we again include the maxDepth parameter, but this time, we use it to truncate the tree implicitly below the specified depth. If the height of the tree comes back below the bound, then we know that the algorithm ran successfully. Alternatively, if the truncated tree is unbalanced, then so is the whole tree. The problem case is when the truncated tree is balanced but with height equal to maxDepth. Then we increase maxDepth and retry.
The simplest retry strategy is to increase maxDepth by 1 every time. Since balanced trees with n nodes have height O(log n), the running time is O(n log n). In fact, for DF-balanced trees, the running time is also O(n), since, except for the last couple traversals, the size of the truncated tree increases by a factor of 2 each time, leading to a geometric series.
Another strategy, doubling maxDepth each time, gives an O(n) running time for EL-balanced trees, since the largest tree of height h, with 2^(h + 1) - 1 nodes, is much smaller than the smallest tree of height 2h, with approximately (phi^2)^h nodes. The downside of doubling is that we may use twice as much stack space. With increase-by-1, however, in the family of minimum-size EL-balanced trees we constructed implicitly in defining L(h), the number of nodes at depth h - k in the tree of height h is polynomial of degree k. Accordingly, the last few scans will incur some superlinear term.
Temporarily mutating pointers
If there are parent pointers, then it's easy to traverse depth-first in place, because the parent pointers can be used to derive the relevant information on the stack in an efficient manner. If we don't have parent pointers but can mutate the tree temporarily, then, for descent into a child, we can cannibalize the pointer to that child to store temporarily the node's parent. The problem is determining on the way up whether we came from a left or a right child. If we can sneak a bit (say because pointers are 2-byte aligned, or because there's a spare bit in the balance factor, or because we're copying the tree for stop-and-copy garbage collection and can determine which arena we're in), then that's one way. Another test assumes that the tree is a binary search tree. It turns out that we don't need additional assumptions, however: Explain Morris inorder tree traversal without using stacks or recursion .
The one fly in the ointment is that this approach only works, as far as I know, on DF-balance, since there's no space on the stack to put the partial results for EL-balance.

Data structure supporting Add and Partial-Sum

Let A[1..n] be an array of real numbers. Design an algorithm to perform any sequence of the following operations:
Add(i,y) -- Add the value y to the ith number.
Partial-sum(i) -- Return the sum of the first i numbers, i.e.
There are no insertions or deletions; the only change is to the values of the numbers. Each operation should take O(logn) steps. You may use one additional array of size n as a work space.
How to design a data structure for above algorithm?
Construct a balanced binary tree with n leaves; stick the elements along the bottom of the tree in their original order.
Augment each node in the tree with "sum of leaves of subtree"; a tree has #leaves-1 nodes so this takes O(n) setup time (which we have).
Querying a partial-sum goes like this: Descend the tree towards the query (leaf) node, but whenever you descend right, add the subtree-sum on the left plus the element you just visited, since those elements are in the sum.
Modifying a value goes like this: Find the query (left) node. Calculate the difference you added. Travel to the root of the tree; as you travel to the root, update each node you visit by adding in the difference (you may need to visit adjacent nodes, depending if you're storing "sum of leaves of subtree" or "sum of left-subtree plus myself" or some variant); the main idea is that you appropriately update all the augmented branch data that needs updating, and that data will be on the root path or adjacent to it.
The two operations take O(log(n)) time (that's the height of a tree), and you do O(1) work at each node.
You can probably use any search tree (e.g. a self-balancing binary search tree might allow for insertions, others for quicker access) but I haven't thought that one through.
You may use Fenwick Tree
See this question

Resources