B+ Trees internal nodes - data-structures

I was working through a textbook and got stuck on this question:
"Consider a B+ Tree where each leaf block can contain a maximum of 3 records, each internal block can contain a maximum of 3 keys, all record blocks in the tree are fully occupied with 3 records each and the records have key values: 5,10,15,..., and there are 4 record blocks in the file"
Question: "Draw this tree in a single diagram"
So far I've added all the records in the leaf level, there's 4 blocks with 3 keys so 12 values total, so my leaf level has all multiples of 5 from 5 to 60. I'm now stuck on what to add on the level above it (internal block).

You have already done the right thing for the leaf level. There is only one internal block needed, which will have 4 pointers to those leaf blocks, and 3 keys. Those 3 keys are keys that typically are copies from the least keys in the blocks below that block. No key of the first block is repeated in that internal block, only of the other blocks.
One way of illustrating this structure, is like this:
Often the leaf blocks are linked together, in a singly or doubly linked list, although this is not a strict requirement for B+ trees. I have not depicted this above.

Related

Why does B Tree can decrease the number of IO

I'm reading an article about why we need B-Tree. It's telling me that B-Tree can decrease the number of IO whereas other trees, such as Red-Black tree, can't. And the number of IO equals to the height of B-Tree.
Here is an example.
We are looking for the value 9. With the B-Tree, there are three times of IO, but with the binary tree, there are maybe four times of IO.
Now I'm confused. Why can the B-Tree ensure that there are at most only three times of IO? In other words, who can ensure that the node 3 and the node 7 must be located at the same disk block? I've thought that the data structure of each node in a B-Tree may be an array so that they were sequential, and sequential data is normally located at the same disk block (seriously, I'm not sure...), but it seems that the data structure of the node in a B-Tree is a List, which means that they are not sequential. So as my understanding, it's also possible to generate two times of IO while accessing 3 and 7. In this case, can't we say that accessing 9 may also need 4 times of IO?
On disk, every B-tree node is in a single contiguous block, and every node contains thousands of keys.
For each key there is a pointer to the corresponding node in the next level. On disk, this "pointer" is the address of the contiguous block that contains the target node.
So, for example, if there are 10^9 leaf-level keys, there could be 1000000 leaf-level nodes. On the parent level, there are 1000000 keys that point to those nodes, distributed among 1000 parent nodes. On the root level, there are 1000 keys in a single node.

Cache-aware tree impementation

I have a tree where every node may have 0 to N children.
Use-case is the following query: Given pointers to two nodes: Are these nodes within the same branch of the tree?
Examples
q(2,7) => true
q(5,4) => false
By the book (slow)
The straight forward implementation would be to store a pointer to the parent and a pointer to a list of children at each node. But this would lead to bad performance because the tree would be fragmented in memory and therefor not cache-aware.
Question
What would be a good way to represent the tree in compact form? The whole tree has about 100,000 nodes. So it should be possible to find a way to make it fit completely in the CPU-cache.
Binary trees for example are often represented implicitly as an array and are therefor perfect to be completely stored in the CPU-cache (if small enough).
You can pre-allocate a contiguous block of memory where you concatenate the information for all nodes.
Afterwards, each node would only need a way to retrieve the beginning of its information, and the length of that information.
In this case, the information for each node could be represented by the parent, followed by the list of children (let's assume that we use -1 when there is no parent, i.e. for the root).
For example, for the tree posted in the question, the information for node 1 would be: -1 2 3 4, the information for node 2 is: 1 5, and so on.
The contiguous array would be obtained by concatenating these arrays, resulting in something like:
-1 2 3 4 1 5 1 9 10 1 11 12 13 14 2 3 5 5 5 3 3 4 4 4 15 4
Each node would use some metadata to allow retrieving its associated information. As mentioned, this metadata would need to consist of a startIndex and length. E.g. for node 3, we would have startIndex = 6, length = 3, which allows to retrieve the 1 9 10 subarray, indicating that the parent is node 1, and its children are nodes 9 and 10.
In addition, the metadata information can also be stored in the contiguous memory block, at the beginning. The metadata has fixed length for each node (two values), thus we can easily obtain the position of the metadata for a certain node, given its index.
In this way, all the information about the graph will be stored in a contiguous, cache-friendly, memory block.

Geneal B+ Tree split logic

I just want to know if you would split a leaf node after the insert or before the insert. lets say our capacity in the leaf is 4 elements and we already have 3 elements in there. would you add the 4th element and immediately split after the insert so we have now two nodes holding 2 elements each. Or would you just add the 4th element so that the leaf is full. Now if you add the 5th element (which would cause an overflow) we do the split and add the element which would result in 2 leaf nodes one holding 2 and one holding 3 elements.
EDIT: Since I have seed both approaches out there in the www. I would like to know the reason when to choose solution 1 or 2. Or if one of them even is incorrect for some reason.
https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html
This visualization is very useful to understand B+ tree logic.

B+ Tree creation

I am trying to understand B+ trees. I have done some reading about it.
One thing I am confused about. For creation of tree some articles give no. of keys=n, no. of pointers=1+n and some increase them by 1.
For example I have to make a B+ tree of order 3 with
6,2,9,16,12,17,21,18
Here the root shall have 3 numbers and 4 pointers OR 4 numbers and 5 pointers.
The order measures the branching factor, or the maximum number of keys. When the root is alone it has one pointer, to its own key. Once more nodes are added it will have 1 pointer to its key and then n pointers to its children, where n is the order of the tree. In this case the B+ tree root will have one pointer to number (key) and up to 3 pointer to node. for a total of 4 pointers.
For more one b+ tree creation look at the insertion section:
B+ Tree Insertion Wikipedia

Difference between a LinkedList and a Binary Search Tree

What are the main differences between a Linked List and a BinarySearchTree? Is BST just a way of maintaining a LinkedList? My instructor talked about LinkedList and then BST but did't compare them or didn't say when to prefer one over another. This is probably a dumb question but I'm really confused. I would appreciate if someone can clarify this in a simple manner.
Linked List:
Item(1) -> Item(2) -> Item(3) -> Item(4) -> Item(5) -> Item(6) -> Item(7)
Binary tree:
Node(1)
/
Node(2)
/ \
/ Node(3)
RootNode(4)
\ Node(5)
\ /
Node(6)
\
Node(7)
In a linked list, the items are linked together through a single next pointer.
In a binary tree, each node can have 0, 1 or 2 subnodes, where (in case of a binary search tree) the key of the left node is lesser than the key of the node and the key of the right node is more than the node. As long as the tree is balanced, the searchpath to each item is a lot shorter than that in a linked list.
Searchpaths:
------ ------ ------
key List Tree
------ ------ ------
1 1 3
2 2 2
3 3 3
4 4 1
5 5 3
6 6 2
7 7 3
------ ------ ------
avg 4 2.43
------ ------ ------
By larger structures the average search path becomes significant smaller:
------ ------ ------
items List Tree
------ ------ ------
1 1 1
3 2 1.67
7 4 2.43
15 8 3.29
31 16 4.16
63 32 5.09
------ ------ ------
A Binary Search Tree is a binary tree in which each internal node x stores an element such that the element stored in the left subtree of x are less than or equal to x and elements stored in the right subtree of x are greater than or equal to x.
Now a Linked List consists of a sequence of nodes, each containing arbitrary values and one or two references pointing to the next and/or previous nodes.
In computer science, a binary search tree (BST) is a binary tree data structure which has the following properties:
each node (item in the tree) has a distinct value;
both the left and right subtrees must also be binary search trees;
the left subtree of a node contains only values less than the node's value;
the right subtree of a node contains only values greater than or equal to the node's value.
In computer science, a linked list is one of the fundamental data structures, and can be used to implement other data structures.
So a Binary Search tree is an abstract concept that may be implemented with a linked list or an array. While the linked list is a fundamental data structure.
I would say the MAIN difference is that a binary search tree is sorted. When you insert into a binary search tree, where those elements end up being stored in memory is a function of their value. With a linked list, elements are blindly added to the list regardless of their value.
Right away you can some trade offs:
Linked lists preserve insertion order and inserting is less expensive
Binary search trees are generally quicker to search
A linked list is a sequential number of "nodes" linked to each other, ie:
public class LinkedListNode
{
Object Data;
LinkedListNode NextNode;
}
A Binary Search Tree uses a similar node structure, but instead of linking to the next node, it links to two child nodes:
public class BSTNode
{
Object Data
BSTNode LeftNode;
BSTNode RightNode;
}
By following specific rules when adding new nodes to a BST, you can create a data structure that is very fast to traverse. Other answers here have detailed these rules, I just wanted to show at the code level the difference between node classes.
It is important to note that if you insert sorted data into a BST, you'll end up with a linked list, and you lose the advantage of using a tree.
Because of this, a linkedList is an O(N) traversal data structure, while a BST is a O(N) traversal data structure in the worst case, and a O(log N) in the best case.
They do have similarities, but the main difference is that a Binary Search Tree is designed to support efficient searching for an element, or "key".
A binary search tree, like a doubly-linked list, points to two other elements in the structure. However, when adding elements to the structure, rather than just appending them to the end of the list, the binary tree is reorganized so that elements linked to the "left" node are less than the current node and elements linked to the "right" node are greater than the current node.
In a simple implementation, the new element is compared to the first element of the structure (the root of the tree). If it's less, the "left" branch is taken, otherwise the "right" branch is examined. This continues with each node, until a branch is found to be empty; the new element fills that position.
With this simple approach, if elements are added in order, you end up with a linked list (with the same performance). Different algorithms exist for maintaining some measure of balance in the tree, by rearranging nodes. For example, AVL trees do the most work to keep the tree as balanced as possible, giving the best search times. Red-black trees don't keep the tree as balanced, resulting in slightly slower searches, but do less work on average as keys are inserted or removed.
Linked lists and BSTs don't really have much in common, except that they're both data structures that act as containers. Linked lists basically allow you to insert and remove elements efficiently at any location in the list, while maintaining the ordering of the list. This list is implemented using pointers from one element to the next (and often the previous).
A binary search tree on the other hand is a data structure of a higher abstraction (i.e. it's not specified how this is implemented internally) that allows for efficient searches (i.e. in order to find a specific element you don't have to look at all the elements.
Notice that a linked list can be thought of as a degenerated binary tree, i.e. a tree where all nodes only have one child.
It's actually pretty simple. A linked list is just a bunch of items chained together, in no particular order. You can think of it as a really skinny tree that never branches:
1 -> 2 -> 5 -> 3 -> 9 -> 12 -> |i. (that last is an ascii-art attempt at a terminating null)
A Binary Search Tree is different in 2 ways: the binary part means that each node has 2 children, not one, and the search part means that those children are arranged to speed up searches - only smaller items to the left, and only larger ones to the right:
5
/ \
3 9
/ \ \
1 2 12
9 has no left child, and 1, 2, and 12 are "leaves" - they have no branches.
Make sense?
For most "lookup" kinds of uses, a BST is better. But for just "keeping a list of things to deal with later First-In-First-Out or Last-In-First-Out" kinds of things, a linked list might work well.
The issue with a linked list is searching within it (whether for retrieval or insert).
For a single-linked list, you have to start at the head and search sequentially to find the desired element. To avoid the need to scan the whole list, you need additional references to nodes within the list, in which case, it's no longer a simple linked list.
A binary tree allows for more rapid searching and insertion by being inherently sorted and navigable.
An alternative that I've used successfully in the past is a SkipList. This provides something akin to a linked list but with extra references to allow search performance comparable to a binary tree.
A linked list is just that... a list. It's linear; each node has a reference to the next node (and the previous, if you're talking of a doubly-linked list). A tree branches---each node has a reference to various child nodes. A binary tree is a special case in which each node has only two children. Thus, in a linked list, each node has a previous node and a next node, and in a binary tree, a node has a left child, right child, and parent.
These relationships may be bi-directional or uni-directional, depending on how you need to be able to traverse the structure.
Linked List is straight Linear data with adjacent nodes connected with each other e.g. A->B->C. You can consider it as a straight fence.
BST is a hierarchical structure just like a tree with the main trunk connected to branches and those branches in-turn connected to other branches and so on. The "Binary" word here means each branch is connected to a maximum of two branches.
You use linked list to represent straight data only with each item connected to a maximum of one item; whereas you can use BST to connect an item to two items. You can use BST to represent a data such as family tree, but that'll become n-ary search tree as there can be more than two children to each person.
A binary search tree can be implemented in any fashion, it doesn't need to use a linked list.
A linked list is simply a structure which contains nodes and pointers/references to other nodes inside a node. Given the head node of a list, you may browse to any other node in a linked list. Doubly-linked lists have two pointers/references: the normal reference to the next node, but also a reference to the previous node. If the last node in a doubly-linked list references the first node in the list as the next node, and the first node references the last node as its previous node, it is said to be a circular list.
A binary search tree is a tree that splits up its input into two roughly-equal halves based on a binary search comparison algorithm. Thus, it only needs a very few searches to find an element. For instance, if you had a tree with 1-10 and you needed to search for three, first the element at the top would be checked, probably a 5 or 6. Three would be less than that, so only the first half of the tree would then be checked. If the next value is 3, you have it, otherwise, a comparison is done, etc, until either it is not found or its data is returned. Thus the tree is fast for lookup, but not nessecarily fast for insertion or deletion. These are very rough descriptions.
Linked List from wikipedia, and Binary Search Tree, also from wikipedia.
They are totally different data structures.
A linked list is a sequence of element where each element is linked to the next one, and in the case of a doubly linked list, the previous one.
A binary search tree is something totally different. It has a root node, the root node has up to two child nodes, and each child node can have up to two child notes etc etc. It is a pretty clever data structure, but it would be somewhat tedious to explain it here. Check out the Wikipedia artcle on it.

Resources