What's a B*Tree? - data-structures

What's a B*Tree? Did they just mean binary search tree?

Nope. Note that the * indicates the nodes are at least 2/3 full.

No. A node in a B*Tree can have many keys (which point to many children). They operate by comparing keys in order to select a child node, much like a binary tree. But, the intent is that each node is stored on disk, and can be read into memory at once. Thus, the number of disk accesses required would match the depth of the tree.

Related

What are the maximum number of duplicates for a given key in a B+-tree

Given a B+-tree with a branching factor of g and s being the number of keys contained, what would be the maximum number of duplicates allowed for a single key lets call this si? And how do we calculate that number?
My first idea was to say that each level can have have one instance of si, so my idea would be to maximise the depth, which would be our answer, however I'm not sure about this.
I have searched online but it seems no one has asked this question before, first time asking a question here so any feedback is welcome.
Many thanks.
A proper B-tree cannot contain duplicate 'keys' because the values in question would not uniquely identify their associated data and hence they would not be keys.
The reason is that B-trees do not distinguish between nodes for routing searches (internal nodes) and nodes for storing data (leaf nodes) as some other data structures do. All nodes in a classic B-tree are data-bearing, and the internal nodes just happen to also route search traffic.
B+ trees, on the other hand, store all data in leaf nodes and use the nodes above the leaf level (a.k.a. 'index layer') only for routing. The values in internal nodes - a.k.a. 'separator keys' - only have to guide searches but they do not have to uniquely identify any data records. They do not even have to correspond to any actually existing key values (i.e. ones that have associated data). That makes it often possible to shorten the separator keys drastically, as long as they keep separating the same things. For example, "F" is just as effective at separting "Arthur Dent" from "Zaphod Beeblebrox" as "Ford Prefect" is.
One consequence is that one and the same value could potentially occur at each and every level of the B+ tree without any ill effect, since only the one and only occurrence at the leaf level actually works as a data key; the ones in internal nodes serve only to guide searches.
Another consequence is that internal nodes in B+ trees can usually hold orders of magnitude more keys than internal nodes in B-trees (which are diluted by record data and do not allow shortening of key values). This increases I/O and cache efficiency, and it usually makes B+ trees more shallow than B-trees containing the same data.
As for calculating the height: outside of homework assignments you actually need two node capacity values to describe a B+ tree, one for the maximum number of separator keys in an internal node and one for the maximum number of keys (data records) in a leaf node. Basically, you divide the number of records by the leaf node capacity to determine the number of 'outputs' that the index layer needs to have, and then feed that into the usual B-tree height formula in order to determine the height of the index layer.

Implementing the Rope data structure using binary search trees (splay trees)

In a standard implementation of the Rope data structure using splay trees, the nodes would be ordered according to a rank statistic measuring the position of each one from the start of the string, so the keys normally found in binary search tree would be irrelevant, would they not?
I ask because the keys shown in the graphic below (thanks Wikipedia!) are letters, which would presumably become non-unique once the number of nodes exceeded the length of the chosen alphabet. Wouldn't it be better to use integers or avoid using keys altogether?
Separately, can anyone point me to a good implementation of the logic to recompute rank statistics after each operation?
Presumably, if the index for a split falls within the substring attached to a particular node, say, between "Hel" and "llo_" on the node E above, you would remove the substring from E, split it and reattach it as two children of E. Correct?
Finally, after a certain number of such operations, the tree could, I suppose, end up with as many leaves as letters. What would be the best way to keep track of that and prune the tree (by combining substrings) as necessary?
Thanks!
For what it's worth, you can implement a Rope using Splay Trees by attaching a substring to each node of the binary search tree (not just to the leaf nodes as shown above).
The rank of each node is its size plus the size of its left subtree. But when recomputing ranks during splay operations, you need to remember to walk down the node.left.right branch, too.
If each node records a reference to the substring it represents (cf. the actual substring itself), everything runs faster. That way when a split operation falls within an existing node, you just need to modify the node's attributes to reflect the right part of the substring you want to split, then add another node to represent the left part and merge it with the left subtree.
Done as above, each node records (in addition its left, right and parent attributes etc.) its rank, size (in characters) and the location of the first character it represents in the string you're trying to modify. That way, you never actually modify the initial string: you just do your operations on bits of the tree and reproduce the final string when you're ready by walking it in order.

Why B-Tree for file systems?

I know this is a common question and I saw a few threads in Stack Overflow but still couldn't get it.
Here is an accepted answer from Stack overflow:
" Disk seeks are expensive. B-Tree structure is designed specifically to
avoid disk seeks as much as possible. Therefore B-Tree packs much more
keys/pointers into a single node than a binary tree. This property
makes the tree very flat. Usually most B-Trees are only 3 or 4 levels
deep and the root node can be easily cached. This requires only 2-3
seeks to find anything in the tree. Leaves are also "packed" this way,
so iterating a tree (e.g. full scan or range scan) is very efficient,
because you read hundreds/thousands data-rows per single block (seek).
In binary tree of the same capacity, you'd have several tens of levels
and sequential visiting every single value would require at least one
seek. "
I understand that B-Tree has more nodes (Order) than a BST. So it's definitely flat and shallow than a BST.
But these nodes are again stored as linked lists right?
I don't understand when they say that the keys are read as a block thereby minimising the no of I/Os.
Isn't the same argument hold good for BSTs too? Except that the links will be downwards?
Please someone explain it to me?
I understand that B-Tree has more nodes (Order) than a BST. So it's definitely flat and shallow than a BST. I don't understand when they say that the keys are read as a block thereby minimising the no of I/Os.
Isn't the same argument hold good for BSTs too? Except that the links will be downwards?
Basically, the idea behind using a B+tree in file systems is to reduce the number of disk reads. Imagine that all the blocks in a drive are stored as a sequentially allocated array. In order to search for a specific block you would have to do a linear scan and it would take O(n) every time to find a block. Right?
Now, imagine that you got smart and decided to use a BST, great! You would store all your blocks in a BST an that would take roughly O(log(n)) to find a block. Remember that every branch is a disk access, which is highly expensive!
But, we can do better! The problem now is that a BST is really "tall". Because every node only has a fanout (number of children) factor of 2, if we had to store N objects, our tree would be in the order of log(N) tall. So we would have to perform at most log(N) access to find our leaves.
The idea behind the B+tree structure is to increase the fanout factor (number of children), reducing the height of tree and, thus, reducing the number of disk access that we have to make in order to find a leave. Remember that every branch is a disk access. For instance, if you pack X keys in a node of a B+tree every node will point to at most X+1 children.
Also, remember that a B+tree is structured in a way that only the leaves store the actual data. That way, you can pack more keys in the internal nodes in order to fill up one disk block, that, for instance, stores one node of a B+tree. The more keys you pack in a node the more children it will point to and the shorter your tree will be, thus reducing the number of disk access in order to find one leave.
But these nodes are again stored as linked lists right?
Also, in a B+tree structure, sometimes the leaves are stored in a linked list fashion. Remember that only the leaves store the actual data. That way, with the linked list idea, when you have to perform a sequential access after finding one block you would do it faster than having to traverse the tree again in order to find the next block, right? The problem is that you still have to find the first block! And for that, the B+tree is way better than the linked list.
Imagine that if all the accesses were sequential and started in the first block of the disk, an array would be better than the linked list, because in a linked list you still have to deal with the pointers.
But, the majority of disk accesses, according to Tanenbaum, are not sequential and are accesses to files of small sizes (like 4KB or less). Imagine the time it would take if you had to traverse a linked list every time to access one block of 4KB...
This article explains it way better than me and uses pictures as well:
https://loveforprogramming.quora.com/Memory-locality-the-magic-of-B-Trees
A B-tree node is essentially an array, of pairs {key, link}, of a fixed size which is read in one chunk, typically some number of disk blocks. The links are all downwards. At the bottom layer the links point to the associated records (assuming a B+-tree, as in any practical implementation).
I don't know where you got the linked list idea from.
Each node in a B-tree implemented in disk storage consists of a disk block (normally a handful of kilobytes) full of keys and "pointers" that are accessed as an array and not - as you said - a linked list. The block size is normally file-system dependent and chosen to use the file system's read and write operations efficiently. The pointers are not normal memory pointers, but rather disk addresses, again chosen to be easily used by the supporting file system.
The main reason for B-tree is how it behaves on changes. If you have permanent structure, BST is OK, but in that case Hash function is even better. In case of file systems, you want a structure which changes as a whole as little as possible on inserts or deletes, and where you can perform find operation with as little reads as possible - these properties have B-trees.

Is there only one correct answer in heapsort?

If starting with an empty heap representing a priority queue, where numbers have to be inserted in order and then represented as a binary tree, is there only one strict answer to that? I have tried different Java heap generators etc. and they are all giving me different answers.
If you mean the sorted representation, then it's obviously unique.
If you mean the binary tree representation, then yes it's unique too - a heap is a complete tree - binary tree in which every level, except possibly the last, is completely filled, and all nodes are as far left as possible.
After all heap operations, may they be insert, delete-max, build-heap, siftup, siftdown, the heap remains in a predictable state, we can know how it's binary tree representation will look like.
Can you give more details about how you got different answers?

BTree- predetermined size?

I read this on wikipedia:
In B-trees, internal (non-leaf) nodes can have a variable number of
child nodes within some pre-defined range. When data is inserted or
removed from a node, its number of child nodes changes. In order to
maintain the pre-defined range, internal nodes may be joined or split.
Because a range of child nodes is permitted, B-trees do not need
re-balancing as frequently as other self-balancing search trees, but
may waste some space, since nodes are not entirely full.
We have to specify this range for B trees. Even when I looked up CLRS (Intro to Algorithms), it seemed to make to use of arrays for keys and children. My question is- is there any way to reduce this wastage in space by defining the keys and children as lists instead of predetermined arrays? Is this too much of a hassle?
Also, for the life of me I'm not able to get a decent psedocode on btreeDeleteNode. Any help here is appreciated too.
When you say "lists", do you mean linked lists?
An array of some kind of element takes up one element's worth of memory per slot, whether that slot is filled or not. A linked list only takes up memory for elements it actually contains, but for each one, it takes up one element's worth of memory, plus the size of one pointer (two if it's a doubly-linked list, unless you can use the xor trick to overlap them).
If you are storing pointers, and using a singly-linked list, then each list link is twice the size of each array slot. That means that unless the list is less than half full, a linked list will use more memory, not less.
If you're using a language whose runtime has per-object overhead (like Java, and like C unless you are handling memory allocation yourself), then you will also have to pay for that overhead on each list link, but only once on an array, and the ratio is even worse.
I would suggest that your balancing algorithm should keep tree nodes at least half full. If you split a node when it is full, you will create two half-full nodes. You then need to merge adjacent nodes when they are less than half full. You can then use an array, safe in the knowledge that it is more efficient than a linked list.
No idea about the details of deletion, sorry!
B-Tree node has an important characteristic, all keys in the node is sorted. When finding a specific key, binary search is used to find the right position. Using binary search keeps the complexity of search algorithm in B-Tree O(logn).
If you replace the preallocated array with some kind of linked list, you lost the ordering. Unless you use some complex data structures, like skip list, to keep the search algorithm with O(logn). But it's totally unnecessary, skip list itself is better.

Resources