B+ Tree creation - data-structures

I am trying to understand B+ trees. I have done some reading about it.
One thing I am confused about. For creation of tree some articles give no. of keys=n, no. of pointers=1+n and some increase them by 1.
For example I have to make a B+ tree of order 3 with
6,2,9,16,12,17,21,18
Here the root shall have 3 numbers and 4 pointers OR 4 numbers and 5 pointers.

The order measures the branching factor, or the maximum number of keys. When the root is alone it has one pointer, to its own key. Once more nodes are added it will have 1 pointer to its key and then n pointers to its children, where n is the order of the tree. In this case the B+ tree root will have one pointer to number (key) and up to 3 pointer to node. for a total of 4 pointers.
For more one b+ tree creation look at the insertion section:
B+ Tree Insertion Wikipedia

Related

B+ Trees internal nodes

I was working through a textbook and got stuck on this question:
"Consider a B+ Tree where each leaf block can contain a maximum of 3 records, each internal block can contain a maximum of 3 keys, all record blocks in the tree are fully occupied with 3 records each and the records have key values: 5,10,15,..., and there are 4 record blocks in the file"
Question: "Draw this tree in a single diagram"
So far I've added all the records in the leaf level, there's 4 blocks with 3 keys so 12 values total, so my leaf level has all multiples of 5 from 5 to 60. I'm now stuck on what to add on the level above it (internal block).
You have already done the right thing for the leaf level. There is only one internal block needed, which will have 4 pointers to those leaf blocks, and 3 keys. Those 3 keys are keys that typically are copies from the least keys in the blocks below that block. No key of the first block is repeated in that internal block, only of the other blocks.
One way of illustrating this structure, is like this:
Often the leaf blocks are linked together, in a singly or doubly linked list, although this is not a strict requirement for B+ trees. I have not depicted this above.

Range Minimum Query <O(n), O(1)> approach (from tree to restricted RMQ)

So, I read this TopCoder tutorial on RMQ (Range Minimum Query), and I got a big question.
On the section where he introduced the approach, what I can understand until now is this:
(The whole approach actually uses methodology introduced in Sparse Table (ST) Algorithm, Reduction from LCA to RMQ, and from RMQ to LCA)
Given an array A[N], we need to transform it to a Cartesian Tree, thus making a RMQ problem a LCA (Lowest Common Ancestor) problem. Later, we can get a simplified version of the array A, and make it a restricted RMQ problem.
So it's basically two transforms. So the first RMQ to LCA part is simple. By using a stack, we can make the transform in O(n) time, resulting an array T[N] where T[i] is the element i's parent. And the tree is completed.
But here's what I can't understand. The O(n) approach needs an array where |A[i] - A[i-1]| = 1, and that array is introduced in Reduction from LCA to RMQ section of the tutorial. That involves a Euler Tour of this tree. But how can this be achieved with my end result from the transform? My approach to it is not linear, so it should be considered bad in this approach, what would be the linear approach for this?
UPDATE: The point that confuse me
Here's the array A[]:
n : 0 1 2 3 4 5 6 7 8 9
A[n]: 2 4 3 1 6 7 8 9 1 7
Here's the array T[]:
n : 0 1 2 3 4 5 6 7 8 9
T[n]: 3 2 0 * 8 4 5 6 3 8 // * denotes -1, which is the root of the tree
//Above is from RMQ to LCA, it's from LCA to RMQ part that confuses me, more below.
A picture of the tree:
A Euler Tour needs to know each node's child, just like a DFS (Depth-First Search) while T[n] have only each element's root, not child.
Here is my current understanding of what's confusing you:
In order to reduce RMQ to LCA, you need to convert the array into a tree and then do an Euler tour of that tree.
In order to do an Euler tour, you need to store the tree such that each node points at its children.
The reduction that is listed from RMQ to LCA has each node point to its parent, not its children.
If this is the case, your concerns are totally justified, but there's an easy way to fix this. Specifically, once you have the array of all the parent pointers, you can convert it to a tree where each node points to its children in O(n) time. The idea is the following:
Create an array of n nodes. Each node has a value field, a left child, and a right child.
Initially, set the nth node to have a null left child, null right child, and the value of the nth element from the array.
Iterate across the T array (where T[n] is the parent index of n) and do the following:
If T[n] = *, then the nth entry is the root. You can store this for later use.
Otherwise, if T[n] < n, then you know that node n must be a right child of its parent, which is stored at position T[n]. So set the T[n]th node's right child to be the nth node.
Otherwise, if T[n] > n, then you know that node n must be a left child of its parent, which is stored at position T[n]. So set the T[n]th node's left child to be the nth node.
This runs in time O(n), since each node is processed exactly once.
Once this is done, you've explicitly constructed the tree structure that you need and have a pointer to the root. From there, it should be reasonably straightforward to proceed with the rest of the algorithm.
Hope this helps!

Why in B-tree and B+_tree store from half-full to complete-full in each non-leaf node

I've just learn B-tree and B+-tree in DBMS.
I don't understand why a non-leaf node in tree has between [n/2] and n children, when n is fix for particular tree.
Why is that? and advantage of that?
Thanks !
This is the feature that makes the B+ and B-tree balanced, and due to it, we can easily compute the complexity of ops on the tree and bound it to O(logn) [where n is the number of elements in the data set].
If a node could have more then B sons, we could create a tree with depth 2: a root, and all other nodes will be leaves, from the root. searching for an element will be then O(n), and not the desired O(logn).
If a node could have less then B/2 sons, we could create a tree which is actually a linked list [n nodes, each with 1 son], with height n - and a search op will again be O(n) instead of O(logn)
Small currection: every non-leaf node - except the root, has B/2 to B children. the root alone is allowed to have less then B/2 sons.
The basic assumption of this structure is to have a fixed block size, this is why each internal block has n slots for indexing its children.
When there is a need to add a child to a block that is full (has exactly n children), the block is split into two blocks, which then replace the original block in its parent's index. The number of children in each of the two blocks is obviously n div 2 (assuming n is even). This is what the lower limit comes from.
If the parent is full, the operation repeats, potentially up to the root itself.
The split operation and allowing for n/2-filled blocks allows for most of the insertions/deletions to only cause local changes instead of re-balancing huge parts of the tree.

How many elements can be held in a B-tree of order n?

Is it 2n? Just checking.
Terminology
The Order of a B-Tree is inconstantly defined in the literature.
(see for example the terminology section of Wikipedia's article on B-Trees)
Some authors consider it to be the minimum number of keys a non-leaf node may hold, while others consider it to be the maximum number of children nodes a non-leaf node may hold (which is one more than the maximum number of keys such a node could hold).
Yet many others skirt around the ambiguity by assuming a fixed length key (and fixed sized nodes), which makes the minimum and maximum the same, hence the two definitions of the order produce values that differ by 1 (as said the number of keys is always one less than the number of children.)
I define depth as the number of nodes found in the search path to a leaf record, and inclusive of the root node and the leaf node. In that sense, a very shallow tree with only a root node pointing directly to leaf nodes has depth 2. If that tree were to grow and require an intermediate level of non-leaf nodes, its depth would be 3 etc.
How many elements can be held in a B-Tree of order n?
Assuming fixed length keys, and assuming that "order" n is defined as the maximum number of child nodes, the answer is:
(Average Number of elements that fit in one Leaf-node) * n ^ (depth - 1)
How do I figure?...:
The data (the "elements") is only held in leaf nodes. So the number of element held is the average number of elements that fit in one node, times the number of leaf nodes.
The number of leaf nodes is itself driven by the number of children that fit in a non-leaf node (the order). For example the non-leaf node just above a leaf node, points to n (the order) leaf-nodes. Then, the non-leaf node above this non-leaf node points to n similar nodes etc, hence "to the power of (depth -1)".
Note that the formula above generally holds using the averages (of key held in a non-leaf node, and of elements held in a leaf node) rather than assuming fixed key length and fixed record length: trees will typically have a node size that is commensurate with the key and record sizes, hence holding a number key or records that is big enough that the effective number of keys or record held in any leaf will vary relatively little compared with the average.
Example:
A tree of depth 4 (a root node, two level of non-leaf nodes and one level [obviously] of leaf nodes) and of order 12 (non-leaf nodes can hold up to 11 keys, hence point to 12 nodes below them) and such that leaf nodes can contain 5 element each, would:
- have its root node point to 12 nodes below it
- each node below it points to 12 nodes below them (hence there will be 12 * 12 nodes in the layer "3" (assuming the root is layer 1 etc., this numbering btw is also ambiguously defined...)
- each node in "layer 3" will point to 12 leaf-nodes (hence there will be 12 * 12 * 12 leaf nodes.
- each leaf node has 5 elements (in this example case)
Hence.. such a tree will hold...
Nb Of Elements in said tree = 5 * 12 * 12 * 12
= 5 * (12 ^ 3)
= 5 * (12 ^ depth -1)
= 8640
Recognize the formula on the 3rd line.
What is generally remarkable for B-Tree, and which makes for their popularity is that a relatively shallow tree (one with a limited number of "hops" between the root and the sought record), can hold a relatively high number record. This number is multiplied by the order at each level.
My book says that the order of a B-tree is the maximum number of pointers that can be stored in a node. (p. 348) The number of "keys" is one less than the order. So a B-tree of order n can hold n-1 elements.
The book is "File Structures", second edition, by Michael J. Folk.
If your formula for the number of elements doesn't include an exponentiation somewhere, you've done it wrong.
A binary tree of order 5 can hold 2^0 + 2^1 + 2^2 + 2^3 + 2^4 elements, so 31 .. (which is 2^order - 1).
Edit:
I appear to have gotten order and depth / length mixed up. What on earth is the order of a binary tree? You appear to discuss B-trees as if they don't, by the very nature of their definition, hold a maximum of two child elements per element.
Let Order of b-tree is 'm' means maximum number of nodes that can be inserted at same level in a b-tree=m-1.After that nodes will splits.
for ex: if order is 3 then only 2 maximum node can be inserted on arrival of 3rd element ,nodes will splits by following the property of binary search tree or self balancing tree.

Difference between a LinkedList and a Binary Search Tree

What are the main differences between a Linked List and a BinarySearchTree? Is BST just a way of maintaining a LinkedList? My instructor talked about LinkedList and then BST but did't compare them or didn't say when to prefer one over another. This is probably a dumb question but I'm really confused. I would appreciate if someone can clarify this in a simple manner.
Linked List:
Item(1) -> Item(2) -> Item(3) -> Item(4) -> Item(5) -> Item(6) -> Item(7)
Binary tree:
Node(1)
/
Node(2)
/ \
/ Node(3)
RootNode(4)
\ Node(5)
\ /
Node(6)
\
Node(7)
In a linked list, the items are linked together through a single next pointer.
In a binary tree, each node can have 0, 1 or 2 subnodes, where (in case of a binary search tree) the key of the left node is lesser than the key of the node and the key of the right node is more than the node. As long as the tree is balanced, the searchpath to each item is a lot shorter than that in a linked list.
Searchpaths:
------ ------ ------
key List Tree
------ ------ ------
1 1 3
2 2 2
3 3 3
4 4 1
5 5 3
6 6 2
7 7 3
------ ------ ------
avg 4 2.43
------ ------ ------
By larger structures the average search path becomes significant smaller:
------ ------ ------
items List Tree
------ ------ ------
1 1 1
3 2 1.67
7 4 2.43
15 8 3.29
31 16 4.16
63 32 5.09
------ ------ ------
A Binary Search Tree is a binary tree in which each internal node x stores an element such that the element stored in the left subtree of x are less than or equal to x and elements stored in the right subtree of x are greater than or equal to x.
Now a Linked List consists of a sequence of nodes, each containing arbitrary values and one or two references pointing to the next and/or previous nodes.
In computer science, a binary search tree (BST) is a binary tree data structure which has the following properties:
each node (item in the tree) has a distinct value;
both the left and right subtrees must also be binary search trees;
the left subtree of a node contains only values less than the node's value;
the right subtree of a node contains only values greater than or equal to the node's value.
In computer science, a linked list is one of the fundamental data structures, and can be used to implement other data structures.
So a Binary Search tree is an abstract concept that may be implemented with a linked list or an array. While the linked list is a fundamental data structure.
I would say the MAIN difference is that a binary search tree is sorted. When you insert into a binary search tree, where those elements end up being stored in memory is a function of their value. With a linked list, elements are blindly added to the list regardless of their value.
Right away you can some trade offs:
Linked lists preserve insertion order and inserting is less expensive
Binary search trees are generally quicker to search
A linked list is a sequential number of "nodes" linked to each other, ie:
public class LinkedListNode
{
Object Data;
LinkedListNode NextNode;
}
A Binary Search Tree uses a similar node structure, but instead of linking to the next node, it links to two child nodes:
public class BSTNode
{
Object Data
BSTNode LeftNode;
BSTNode RightNode;
}
By following specific rules when adding new nodes to a BST, you can create a data structure that is very fast to traverse. Other answers here have detailed these rules, I just wanted to show at the code level the difference between node classes.
It is important to note that if you insert sorted data into a BST, you'll end up with a linked list, and you lose the advantage of using a tree.
Because of this, a linkedList is an O(N) traversal data structure, while a BST is a O(N) traversal data structure in the worst case, and a O(log N) in the best case.
They do have similarities, but the main difference is that a Binary Search Tree is designed to support efficient searching for an element, or "key".
A binary search tree, like a doubly-linked list, points to two other elements in the structure. However, when adding elements to the structure, rather than just appending them to the end of the list, the binary tree is reorganized so that elements linked to the "left" node are less than the current node and elements linked to the "right" node are greater than the current node.
In a simple implementation, the new element is compared to the first element of the structure (the root of the tree). If it's less, the "left" branch is taken, otherwise the "right" branch is examined. This continues with each node, until a branch is found to be empty; the new element fills that position.
With this simple approach, if elements are added in order, you end up with a linked list (with the same performance). Different algorithms exist for maintaining some measure of balance in the tree, by rearranging nodes. For example, AVL trees do the most work to keep the tree as balanced as possible, giving the best search times. Red-black trees don't keep the tree as balanced, resulting in slightly slower searches, but do less work on average as keys are inserted or removed.
Linked lists and BSTs don't really have much in common, except that they're both data structures that act as containers. Linked lists basically allow you to insert and remove elements efficiently at any location in the list, while maintaining the ordering of the list. This list is implemented using pointers from one element to the next (and often the previous).
A binary search tree on the other hand is a data structure of a higher abstraction (i.e. it's not specified how this is implemented internally) that allows for efficient searches (i.e. in order to find a specific element you don't have to look at all the elements.
Notice that a linked list can be thought of as a degenerated binary tree, i.e. a tree where all nodes only have one child.
It's actually pretty simple. A linked list is just a bunch of items chained together, in no particular order. You can think of it as a really skinny tree that never branches:
1 -> 2 -> 5 -> 3 -> 9 -> 12 -> |i. (that last is an ascii-art attempt at a terminating null)
A Binary Search Tree is different in 2 ways: the binary part means that each node has 2 children, not one, and the search part means that those children are arranged to speed up searches - only smaller items to the left, and only larger ones to the right:
5
/ \
3 9
/ \ \
1 2 12
9 has no left child, and 1, 2, and 12 are "leaves" - they have no branches.
Make sense?
For most "lookup" kinds of uses, a BST is better. But for just "keeping a list of things to deal with later First-In-First-Out or Last-In-First-Out" kinds of things, a linked list might work well.
The issue with a linked list is searching within it (whether for retrieval or insert).
For a single-linked list, you have to start at the head and search sequentially to find the desired element. To avoid the need to scan the whole list, you need additional references to nodes within the list, in which case, it's no longer a simple linked list.
A binary tree allows for more rapid searching and insertion by being inherently sorted and navigable.
An alternative that I've used successfully in the past is a SkipList. This provides something akin to a linked list but with extra references to allow search performance comparable to a binary tree.
A linked list is just that... a list. It's linear; each node has a reference to the next node (and the previous, if you're talking of a doubly-linked list). A tree branches---each node has a reference to various child nodes. A binary tree is a special case in which each node has only two children. Thus, in a linked list, each node has a previous node and a next node, and in a binary tree, a node has a left child, right child, and parent.
These relationships may be bi-directional or uni-directional, depending on how you need to be able to traverse the structure.
Linked List is straight Linear data with adjacent nodes connected with each other e.g. A->B->C. You can consider it as a straight fence.
BST is a hierarchical structure just like a tree with the main trunk connected to branches and those branches in-turn connected to other branches and so on. The "Binary" word here means each branch is connected to a maximum of two branches.
You use linked list to represent straight data only with each item connected to a maximum of one item; whereas you can use BST to connect an item to two items. You can use BST to represent a data such as family tree, but that'll become n-ary search tree as there can be more than two children to each person.
A binary search tree can be implemented in any fashion, it doesn't need to use a linked list.
A linked list is simply a structure which contains nodes and pointers/references to other nodes inside a node. Given the head node of a list, you may browse to any other node in a linked list. Doubly-linked lists have two pointers/references: the normal reference to the next node, but also a reference to the previous node. If the last node in a doubly-linked list references the first node in the list as the next node, and the first node references the last node as its previous node, it is said to be a circular list.
A binary search tree is a tree that splits up its input into two roughly-equal halves based on a binary search comparison algorithm. Thus, it only needs a very few searches to find an element. For instance, if you had a tree with 1-10 and you needed to search for three, first the element at the top would be checked, probably a 5 or 6. Three would be less than that, so only the first half of the tree would then be checked. If the next value is 3, you have it, otherwise, a comparison is done, etc, until either it is not found or its data is returned. Thus the tree is fast for lookup, but not nessecarily fast for insertion or deletion. These are very rough descriptions.
Linked List from wikipedia, and Binary Search Tree, also from wikipedia.
They are totally different data structures.
A linked list is a sequence of element where each element is linked to the next one, and in the case of a doubly linked list, the previous one.
A binary search tree is something totally different. It has a root node, the root node has up to two child nodes, and each child node can have up to two child notes etc etc. It is a pretty clever data structure, but it would be somewhat tedious to explain it here. Check out the Wikipedia artcle on it.

Resources