What are the maximum number of duplicates for a given key in a B+-tree - data-structures

Given a B+-tree with a branching factor of g and s being the number of keys contained, what would be the maximum number of duplicates allowed for a single key lets call this si? And how do we calculate that number?
My first idea was to say that each level can have have one instance of si, so my idea would be to maximise the depth, which would be our answer, however I'm not sure about this.
I have searched online but it seems no one has asked this question before, first time asking a question here so any feedback is welcome.
Many thanks.

A proper B-tree cannot contain duplicate 'keys' because the values in question would not uniquely identify their associated data and hence they would not be keys.
The reason is that B-trees do not distinguish between nodes for routing searches (internal nodes) and nodes for storing data (leaf nodes) as some other data structures do. All nodes in a classic B-tree are data-bearing, and the internal nodes just happen to also route search traffic.
B+ trees, on the other hand, store all data in leaf nodes and use the nodes above the leaf level (a.k.a. 'index layer') only for routing. The values in internal nodes - a.k.a. 'separator keys' - only have to guide searches but they do not have to uniquely identify any data records. They do not even have to correspond to any actually existing key values (i.e. ones that have associated data). That makes it often possible to shorten the separator keys drastically, as long as they keep separating the same things. For example, "F" is just as effective at separting "Arthur Dent" from "Zaphod Beeblebrox" as "Ford Prefect" is.
One consequence is that one and the same value could potentially occur at each and every level of the B+ tree without any ill effect, since only the one and only occurrence at the leaf level actually works as a data key; the ones in internal nodes serve only to guide searches.
Another consequence is that internal nodes in B+ trees can usually hold orders of magnitude more keys than internal nodes in B-trees (which are diluted by record data and do not allow shortening of key values). This increases I/O and cache efficiency, and it usually makes B+ trees more shallow than B-trees containing the same data.
As for calculating the height: outside of homework assignments you actually need two node capacity values to describe a B+ tree, one for the maximum number of separator keys in an internal node and one for the maximum number of keys (data records) in a leaf node. Basically, you divide the number of records by the leaf node capacity to determine the number of 'outputs' that the index layer needs to have, and then feed that into the usual B-tree height formula in order to determine the height of the index layer.

Related

Most performant way to find all the leaf nodes in a tree data structure

I have a tree data structure where each node can have any number of children, and the tree can be of any height. What is the optimal way to get all the leaf nodes in the tree? Is it possible to do better than just traversing every path in the tree until I hit the leaf nodes?
In practice the tree will usually have a max depth of 5 or so, and each node in the tree will have around 10 children.
I'm open to other types of data structures or special trees that would make getting the leaf nodes especially optimal.
I'm using javascript but really just looking for general recommendations, any language etc.
Thanks!
Memory layout is essential to optimal retrieval, so the child lists should be contiguous and not linked list, the nodes should be place after each other in retrieval order.
The more static your tree is, the better layout can be done.
All in one layout
All in one array totally ordered
Pro
memory can be streamed for maximal throughput (hardware pre-fetch)
no unneeded page lookups
normal lookups can be made
no extra memory to make linked lists.
internal nodes use offset to find the child relative to itself
Con
inserting / deleting can be cumbersome
insert / delete O(N)
insert might lead to resize of the array leading to a costly copy
Two array layout
One array for internal nodes
One array for leafs
Internal nodes points to the leafs
Pro
leaf nodes can be streamed at maximum throughput (maybe the best layout if your mostly only interested in the leafs).
no unneeded page lookups
indirect lookups can be made
Con
if all leafs are ordered insert / delete can be cumbersome
if leafs are unordered insertion is ease, just add at the end.
deleting unordered leafs is also a problem if no tombstones are allowed as the last leaf would have to be moved back and the internal nodes would need fix up. (via a further indirection this can also be fixed see slot-map)
resizing of the either might lead to a large copy, though less than the All-in-one as they could be done independently.
Array of arrays (dynamic sized, C++ vector of vectors)
using contiguous arrays for referencing the children of each node
Pro
running through each child list is fast
each child array may be resized independently
Con
while removing much of the extra work of linked list children the individual lists are dispersed among all other data making lookup taking extra time.
insert might cause resize and copy of an array.
Finding the leaves of a tree is O(n), which is optimal for a tree, because you have to look at O(n) places to retrieve all n things, plus the branch nodes along the way. The constant overhead is the branch nodes.
If we increase the branching factor, e.g. letting each branch have 32 children instead of 2, we significantly decrease the number of overhead nodes, which might make the traversal faster.
If we skip a branch, we're not including the values in that branch, so we have to look at all branches.

non-binary tree search and insertion

I searched a bit but haven't found the answer to this question..
I built a non-binary tree, so each node can have any number of children (called n-ary tree i think)
To help with searching, I gave every node a number when i built the tree, so that every node's child nodes will be bigger, and all the node to the right of the it will be bigger as well.
something like this:
This way I get logn time for search
The problem comes when I want to insert nodes. This model would not work if I want to insert nodes anywhere but the end.
I thought of a few ways to do it,
insert the new nodes at the desired location, then update the number of all the nodes "behind" it.
instead of using a single number to represent each node, use an array of numbers. The numbers in the array will represent its location on that specific level. For example, node 1 will be {0}. Node 9 will be {0,2}. Node 7 will be {0, 0, 1, 2}. Now when inserting, I just need to change the numbers on that level.
forget all the numbering and just compare every node until the correct one is found. Insertion don't need to care about numbers either.
My question is, which way would be better? I'm not sure if using an array of integers to represent each node is very fast.. maybe it is still faster than the first way? Are there other ways of going about this?
Thank you in advance.
I gather that the problem you have is to assign a unique identifier to each node in such a way that you can find the node given its unique id in sublinear time.
This is not usually a problem for transient (in-memory) data structures, because typical tree implementations never move a node in memory (until it is deleted). That lets you use the address of the node as a unique identifier, which provides O(1) access. Languages other than C dress this up in an object like a tree iterator or node reference, but under the hood the principle is the same.
However, there are times when you really need to be able to attach a fixed-for-all-time identifier to a tree node, in a way which will be resilient against, for example, persisting the tree to permanent storage and then reserializing it in a different executable image.
One well-known hack is to use floating-point ids. When a new node is inserted, its id is assigned to be the average of its immediate neighbours. For the purpose of this computation, we pretend that there is a node on the left of the tree with id 0.0 and a node to the right with id 1.0, so every node has two neighbours, even if it is the new left- or right-most node. In particular, the root node is given the id 0.5, which is the average of the 0.0 and 1.0 imaginary boundary nodes.
Unfortunately, floating point precision is not infinite, and this hack works best if insertions are always at random places in the tree. If you insert all the nodes at the end, you will rapidly exhaust the floating point precision. You could periodically renumber all the nodes, but that defeats the purpose of having a permanent unchanging unique node id. (For some problem domains, it's acceptable, though.)
Of course, you don't really have to use floating point. A double on standard architecture has 53 bits of precision, which is plenty if your insertions are stochastic and very little if you always insert at the same place; you can use all 64 bits of an unsigned 64-bit integer by (conceptually) locating a fixed binary point prior to the high-order bit. The average computation works the same, except that you need to special case the computation with the 1.0 value.
This is essentially the same as your idea of labelling the nodes with a vector of indices. That scheme has the advantage of never running out of precision, and the disadvantage that the vectors can get quite long. You could also use a hybrid solution where you start a new level only when you run out of precision with the current level.

How to calculate that a B+ tree is O(log(n)) for lookups

I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct? Then how are lookups made? If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?
I read the wikipedia article on B+ trees and I understand the structure but not how an actual lookup is performed. Could you guide me perhaps with some link to reading material?
What are some other uses of B+ trees besides database indexing?
I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct?
No, the index is formed by the inner nodes (non-leaves). Depending on the implementation the leaves may contain either key/value pairs or key/pointer to value pairs. For example, a database index uses the latter, unless it is an IOT (Index Organized Table) in which case the values are inlined in the leaves. This depends mainly on whether the value is insanely large wrt the key.
Then how are lookups made?
In the general case where the root node is not a leaf (it does happen, at first), the root node contains a sorted array of N keys and N+1 pointers. You binary search for the two keys S0 and S1 such that S0 <= K < S1 (where K is what you are looking for) and this gives you the pointer to the next node.
You repeat the process until you (finally) hit a leaf node, which contains a sorted list of key-values pairs and make a last binary search pass on those.
If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?
Binary trees are simpler to implement. One though cookie with B+Trees is to size the number of keys/pointers in inner nodes and the number of key/values pairs in leaves nodes. Another though cookie is to decide on the low and high watermark that leads to grouping two nodes or exploding one.
Binary trees also offer memory stability: an element inserted is not moved, at all, in memory. On the other hand, inserting an element in a B+Tree or removing one is likely to lead to elements shuffling
B+Trees are tailored for small keys/large values cases. They also require that keys can be duplicated (hopefully cheaply).
Could you guide me perhaps with some link to reading material?
I hope the rough algorithm I explained helped out, otherwise feel free to ask in the comments.
What are some other uses of B+ trees besides database indexing?
In the same vein: file-system indexing also benefits.
The idea is always the same: a B+Tree is really great with small keys/large values and caching. The idea is to have all the keys (inner nodes) in your fast memory (CPU Cache >> RAM >> Disk), and the B+Tree achieves that for large collections by pushing keys to the bottom. With all inner nodes in the fast memory, you only have one slow memory access at each search (to fetch the value).
B+ trees are better than binary tree all the dbms use them,
a lookup in B+Tree is LOGF N being F the base of LOG and the fan out. The lookup is performed exactly like in a binary tree but with a bigger fan out and lower height thats why it is way better.
B+Tree are usually known for having the data in the leaf(if they are unclustered probably not), this means you dont have to make another jump to the disk to get the data, you just take it from the leaf.
B+Tree is used almost everywhere, Operating Systems use them, datawarehouse (not so much here but still), lots of applications.
B+Tree are perfect for range queries, and are used whenever you have unique values, like a primary key, or any field with low cardinality.
If you can get this book http://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638 its one of the best. Its basically the bible for any database guy.

BTree- predetermined size?

I read this on wikipedia:
In B-trees, internal (non-leaf) nodes can have a variable number of
child nodes within some pre-defined range. When data is inserted or
removed from a node, its number of child nodes changes. In order to
maintain the pre-defined range, internal nodes may be joined or split.
Because a range of child nodes is permitted, B-trees do not need
re-balancing as frequently as other self-balancing search trees, but
may waste some space, since nodes are not entirely full.
We have to specify this range for B trees. Even when I looked up CLRS (Intro to Algorithms), it seemed to make to use of arrays for keys and children. My question is- is there any way to reduce this wastage in space by defining the keys and children as lists instead of predetermined arrays? Is this too much of a hassle?
Also, for the life of me I'm not able to get a decent psedocode on btreeDeleteNode. Any help here is appreciated too.
When you say "lists", do you mean linked lists?
An array of some kind of element takes up one element's worth of memory per slot, whether that slot is filled or not. A linked list only takes up memory for elements it actually contains, but for each one, it takes up one element's worth of memory, plus the size of one pointer (two if it's a doubly-linked list, unless you can use the xor trick to overlap them).
If you are storing pointers, and using a singly-linked list, then each list link is twice the size of each array slot. That means that unless the list is less than half full, a linked list will use more memory, not less.
If you're using a language whose runtime has per-object overhead (like Java, and like C unless you are handling memory allocation yourself), then you will also have to pay for that overhead on each list link, but only once on an array, and the ratio is even worse.
I would suggest that your balancing algorithm should keep tree nodes at least half full. If you split a node when it is full, you will create two half-full nodes. You then need to merge adjacent nodes when they are less than half full. You can then use an array, safe in the knowledge that it is more efficient than a linked list.
No idea about the details of deletion, sorry!
B-Tree node has an important characteristic, all keys in the node is sorted. When finding a specific key, binary search is used to find the right position. Using binary search keeps the complexity of search algorithm in B-Tree O(logn).
If you replace the preallocated array with some kind of linked list, you lost the ordering. Unless you use some complex data structures, like skip list, to keep the search algorithm with O(logn). But it's totally unnecessary, skip list itself is better.

What are the advantages of storing all elements in the leaf nodes?

I'm reading Advanced Data Structures by Peter Brass.
In the beginning of the chapter on search trees, he stated that there is two models of search trees - one where nodes contain the actual object (the value if the tree is used as a dictionary), and an other where all objects are stored in leaves and internal nodes are only for comparisons.
What are the advantages of the second model over the first one?
One of the big advantages of a binary tree where data is only in the leaf nodes is that you can partition based on elements that are not in your dataset.
For example, if I have a possible dataset of 0-1 million, but the vast majority of items are either at the high end or low end but not in the middle, I may still want my first compare against 500,000 - even though that number is not in my data set. If every node had data, I could not do this. While not normally needed in theory, I've run into many times that partitioning based on a value outside my data simplified implementation.
B+ trees are an example of a case where all key/values are stored in leaf nodes. The primary advantage here is that since all items are in the leaf nodes, the leaf nodes can be linked together to form a linked list which allows rapid in-order traversal. If you access a particular element, you can always find the next element in the sequence without visiting any parents because the leaf nodes are linked together. Filesystems and database storage systems can take advantage of this structures for range searches and stuff.
Lets say you are building tree over some objects on some complex criteria. On example calculated from multiple properties. Sometimes you can't change this object to store calculated value and calculating this criteria is expansive. So you calculate this criteria only once, and store objects in leafs based on criteria result. Then when your tree is complete you can find required object much faster because you don't have to calculate criteria for each tree node in your path.
well storing information objects in the nodes, we talking in this case about a trie, is usefull for fast retrival of information(faster than storing stuff in an array/hashtable, where the worst case auf acces is O(n), in the trie this is O(m) [m is the lenght of n])
look here:
https://en.wikipedia.org/wiki/Trie
In a search tree this oerations can be much more complicated(look AVL Tree O(log n) ) and so can be slower and is more compley to implement.
What data structure to choose??
Well this depends on what u want to do

Resources