How values larger than a page size are stored in b-trees? - b-tree

Values that fit into a b-tree node (represented by a page, typically 4kb) are directly put into the nodes which are then flushed to disk. That is a node can roughly have 1000 4-byte values. But how do large values the size of which exceeds the size of nodes are written to disk? How are large values represented in nodes in memory? Obviously a node cannot have a 15kb value.

Tree nodes and disk blocks are two different layers. Node is a logical level, while blocks is physical. A node can take more space than the size of the block and it is an often situation. It's up to IO code how to split the node into multiple blocks. Related blocks (empty ones or the ones that belong to one node) may be linked to each other via some 'nextBlockId' property.

Related

is not the benefit of B-Tree lost when it is saved in File?

I was reading about B-Tree and it was interesting to know that it is specifically built for storing in secondary memory. But i am little puzzled with few points:
If we save the B-Tree in secondary memory (via serialization in Java) is not the advantage of B-Tree lost ? because once the node is serialized we will not have access to reference to child nodes (as we get in primary memory). So what that means is, we will have to read all the nodes one by one (as no reference is available for child). And if we have to read all the nodes then whats the advantage of the tree ? i mean, in that case we are not using the binary search on the tree. Any thoughts ?
When a B-Tree is used on disk, it is not read from file, deserialized, modified, and serialized, and written back.
A B-Tree on disk is a disk-based data structure consisting of blocks of data, and those blocks are read and written one block at a time. Typically:
Each node in the B-Tree is a block of data (bytes). Blocks have fixed sizes.
Blocks are addressed by their position in the file, if a file is used, or by their sector address if B-Tree blocks are mapped directly to disk sectors.
A "pointer to a child node" is just a number that is the node's block address.
Blocks are large. Typically large enough to have 1000 children or more. That's because reading a block is expensive, but the cost doesn't depend much on the block size. By keeping blocks big enough so that there are only 3 or 4 levels in the whole tree, we minimize the number of reads or writes required to access any specific item.
Caching is usually used so that most accesses only need to touch the lowest level of the tree on disk.
So to find an item in a B-Tree, you would read the root block (it will probably come out of cache), look through it to find the appropriate child block and read that (again probably out of cache), maybe do that again, finally read the appropriate leaf block and extract the data.

What is the difference between node.disk_used v.s. index.store_size?

What is the difference between node.disk used v.s. index data/store size?
How can index size total be bigger than disk used?
In ElasticSearch, store_size is the store size that is taken from primary and replica shards, while disk_used is the used disk space. Thus, node.disk_used represents the used disk space on the node, while store_size is finding the store size from the collection of documents. Within a node, you can declare multiple indexes. In relation to the second part of your question, this is an interesting overview on the problem you are having.

Can B+tree search perform better than Binary Search Tree search where all keys-data of the leaf nodes are in the memory?

Assume that we are implementing a B+ tree in memory, keys are at the internal nodes and key-data pairs are in the leaf nodes.
If B+tree with a fan-out f, this means that B+ tree will have a height of log_f N where N is the number of keys, whereas the corresponding BST will have height of log_2 N.
If we are not doing any disk reads and writes, can B+tree search performance be better than Binary Search Tree search performance? How?
Since for B+tree at each internal node we have make a decision on F many choices instead if 1 for BST?
At least when compared to cache, main memory has many of the same characteristics as a disk drive--it has fairly high bandwidth, but much higher latency than cache. It has a fairly large minimum read size, and gives substantially higher bandwidth when reads are predictable (e.g., when you read a number a number of cache lines at contiguous addresses). As such, it benefits from the same general kinds of optimizations (though the details often vary a bit).
B-trees (and variants like B* and B+ trees) were explicitly designed to work well with the access patterns supported well by disk drives. Since you have to read a fairly substantial amount of data anyway, you might as well pack the data to maximize the amount you accomplish from the memory you have to read. In both cases, you also frequently get a substantial bandwidth gain by reading some multiple of the minimum read in a predictable pattern (especially, a number of successive reads at successive addresses). As such, it often makes sense to increase the size of a single page to something even larger than the minimum you can read at once.
Likewise, in both cases we can plan on descending through a number of layers of nodes in the tree before we find the data we really care about. Much like when reading from disk, we benefit from maximizing the density of keys in the data we read, until we've actually found the data we care about. With a typical binary tree:
template <class T, class U>
struct node {
T key;
U data;
node *left;
node *right;
};
...we end up reading a number of data items for which we have no real use. It's only when we've found the right key that we need/want to get the associated data. In fairness, we can do that with a binary tree as well, with only a fairly minor modification to the node structure:
template <class T, class U>
struct node {
T key;
U *data;
node *left;
node *right;
};
Now the node contains only a pointer to the data rather than the data itself. This won't accomplish anything if data is small, but can accomplish a great deal if it's large.
Summary: from the viewpoint of the CPU, reads from main memory have the same basic characteristics as reads from disk; a disk just shows a more extreme version of those same characteristics. As such, most of the design considerations that led to the design of B-trees (and variants) now apply similarly to data stored in main memory.
B-trees work well and often provide substantial benefits when used for in-memory storage.

How to decide order of a B-tree

B trees are said to be particularly useful in case of huge amount of data that cannot fit in main memory.
My question is then how do we decide order of B tree or how many keys to store in a node ? Or how many children a node should have ?
I came across that everywhere people are using 4/5 keys per node. How does it solve the huge data and disk read problem ?
Typically, you'd choose the order so that the resulting node is as large as possible while still fitting into the block device page size. If you're trying to build a B-tree for an on-disk database, you'd probably pick the order such that each node fits into a single disk page, thereby minimizing the number of disk reads and writes necessary to perform each operation. If you wanted to build an in-memory B-tree, you'd likely pick either the L2 or L3 cache line sizes as your target and try to fit as many keys as possible into a node without exceeding that size. In either case, you'd have to look up the specs to determine what size to use.
Of course, you could always just experiment and try to determine this empirically as well. :-)
Hope this helps!

KD-Tree on secondary memory?

I know some of the range searching data structure, for example kd-tree, Range tree and quad-tree.
But all the implementation is in memory, how can I implementation them on secondary memory with high performance I/O efficiency?
Here is the condition:
1): a static set of points on the two dimension.
2): just for query, no inset or delete.
3): adapt for secondary memory.
Thanks.
If you can fit the tree into memory during construction:
Build the kd-tree.
Bottom, up, collect as many points as possible that fit into a block of your hardware size.
Write the data to this block.
Repeat 2.-3. recursively, until you've written all data to disk.
When querying, load a page from disk, process this part of the tree until you reach a reference to another page. Then load this page and continue there.
Alternatively, you can do the same top-down, but then you will likely need more disk space. In above approach, only the root page can be near-empty.

Resources