B+ tree as an index structure in in-memory databases: how to avoid rebalancing? - b-tree

I have just read the paper "An efficient B+-tree design for main-memory database systems with strong access locality" by Pei-Lun Suei and his colleagues, where implementing B+ trees as an index structure with auxiliary-trees is suggested to avoid overhead of multiple rebalances. Now I am looking for other approaches which are implementing B+ trees in in-memory databases in a different way. What other approaches exist or more specifically:
How is it possible to avoid the overhead of rebalancing B+ trees in in-memory databases when using B+ trees as an index structure?

Related

A tree that is both memory efficient and disk-space efficient?

I recently started reading about Data Structures in detail. I came across trees. AVL trees are designed taking fast memory access into consideration and B trees are designed taking efficient disk storage into consideration. Suppose I want to design a tree which is both memory efficient and disk storage efficient, what tree should I use? Is there any way I can combine AVL tree and B Tree? Is there any other tree that can do both? Is this fundamentally possible in a real-world scenario?
I want to design a tree which is both memory efficient and disk storage efficient (...) Is there any way I can combine AVL tree and B Tree?
Short answer is no, there isn't, unless you make a breakthrough discovery in the field of data structures. Both of them were designed with specific optimization requirements in mind, you can't have the best of both worlds.
There's a concept in computing called Space–time tradeoff which can be extended to other types of tradeoffs, like the one you're interested in. You can think of it like this: to improve a property of an already optimized algorithm you will have to worsen another (unless you discover some new approach no one thought before).
I suggest you take a look at the available implementations of optimized Binary Trees and start with the one that best fits your needs.

How are large tree data structures traversed?

I was studying tree algorithms and almost all algorithms use recursion for traversing of course traversal can be done without recursion as well (by creating stack data structure and while loop). But out of curiosity wanted to know how these tree data structures are traversed when there are millions or billions of nodes exist in tree ? of course these questions are asked in interviews as well.
Some of the approaches I can just think of are
Store tree in multiple files as different subtrees and traverse
through files
Distribute tree across different machines
Store tree in database in table structure and design query for
traversal
Any better approaches, if any one could share link to study material for such kind of problems will be help.
If the tree fits in memory, you can just walk it. I build tools that build ASTs with millions of nodes (both from lots of trees, and sometimes from incredibly deep trees); we store our trees in memory. A recursive walk works just fine. And, it only takes tens on nanoseconds per node (cache line miss time) to do it, if done right.
Fixed sized stacks usually screw this up because such stacks prevent arbitrarily deep recursion. See How does a stackless language work? The languages in which I code tree operations don't have fixed size stacks.
You can store the tree distributed across machines or (worse!) in database. You can still walk over that tree, but the algorithms are clumsier, and the extra delays in communication (to remote machines, to database tables) make this into such a slow operation, that hardly anybody does it.

B-tree and BST used-case

I have read that B-tree was primarily intended for secondary storage look-ups owing to minimized disk seeks.
But, considering the locality of reference it provides - and the consequent reduced possibility of cache misses; wouldn't it be a preferred candidate for primary (in-memory) look-ups too?
Why would I ever use a BST over this?
If by B-tree you mean a balanced search tree as defined here, locality of memory is not its only purpose, but also favourable complexity for data access. In total, a B-tree might be preferrable also for in-memory data structures, but might be harder to implement.

Practical use of m-way tree

I have started studying data structures again . I found very few practical uses of this. One of those were about file system on disk . Can someone give me more example of practical uses
of m-way tree .
M-way trees come up in a lot of arenas. Here's a small sampling:
B-trees: these are search trees like a binary search tree with a huge branching factor. They're designed in such a way that each node can fit just inside of the memory that can be read from a hard disk in one pass. They have all the same asymPtotic guarantees of regular BSTs, but are designed to minimize the number of nodes searched to find a particular element. Consequently, many giant database systems use B-trees or other related structures to store large tables on disks. That way, the number of expensive disk reads is minimized and the overall efficiency is much greater.
Octrees. Octrees and their two-dimensional cousins quadtrees are data structures for storing points in three dimensional space. They're used extensively in video games for fast collision detection and real-time rendering computations, and we would be much the worse odd if not for them.
Link/cut trees. These specialized trees are used in network flow problems to efficiently compute matchings or find maximum flows much faster than conventional approaches, which has huge applicability in operations research.
Disjoint-set forests. These multiway trees are used in minimum-spanning tree algorithms to compute connectivity blindingly fast, optimizing the runtime to around the theoretical limit.
Tries. These trees are used to encode string data and allow for extremely fast lookup, storage, and maintenance of sets of strings. They're also used in some regular expression marchers.
Van Emde Boas Trees- a lightning fast implementation of priority queues of integers that is backed by a forest of trees with enormous branching factor.
Suffix trees. These jewels of the text processing world allow for fast string searches. They also typically have a branching factor much greater than two.
PQ-trees. These trees for encoding permutations allow for linear-time planarity testing, which has applications in circuit layout and graph drawing.
Phew! That's a lot of trees. Hope this helps!
By m-way, do you mean a generalized tree? If so, pretty much any 'single parent' hierarchy.

How do I determine which kind of tree data structure to choose?

Ok, so this is something that's always bothered me. The tree data structures I know of are:
Unbalanced binary trees
AVL trees
Red-black trees
2-3 trees
B-trees
B*-trees
Tries
Heaps
How do I determine what kind of tree is the best tool for the job? Obviously heaps are canonically used to form priority queues. But the rest of them just seem to be different ways of doing the same thing. Is there any way to choose the best one for the job?
Let’s pick them off one by one, shall we?
Unbalanced binary trees
For search tasks, never. Basically, their performance characteristics will be completely unpredictable and the overhead of balancing a tree won’t be so big as to make unbalanced trees a viable alternative.
Apart from that, unbalanced binary trees of course have other uses, but not as search trees.
AVL trees
They are easy to develop but their performance is generally surpassed by other balancing strategies because balancing them is comparatively time-intensive. Wikipedia claims that they perform better in lookup-intensive scenarios because their height is slightly less in the worst case.
Red-black trees
These are used inside most of C++’ std::map implemenations and probably in a few other standard libraries as well. However, there’s good evidence that they are actually worse than B(+) trees in every scenario due to caching behaviour of modern CPUs. Historically, when caching wasn’t as important (or as good), they surpassed B trees when used in main memory.
2-3 trees
B-trees
B*-trees
These require the most careful consideration of all the trees, since the different constants used are basically “magical” constans which relate in weird and sometimes unpredictable way to the underlying hardware architecture. For example, the optimal number of child nodes per level can depend on the size of a memory page or cache line.
I know of no good, general rule to distinguish between them.
Tries
Completely different. Tries are also search trees, but for text retrieval of substrings in a corpus. A trie is an uncompressed prefix tree (i.e. a tree in which the paths from root to leaf nodes correspond to all the prefixes of a given string).
Tries should be compared to, and offset against, suffix trees, suffix arrays and q-gram indices – not so much against other search trees because the data that they search is different: instead of discrete words in a corpus, the latter index structures allow a factor search.
Heaps
As you’ve already said, they are not search trees at all.
The same as any other data structure, you have to know the characteristics (complexity of search, insert, and delete operations) of each type of tree, and the requirements of the job you're selecting a tool for. The tree that has the best performance for the type of operations you'll do most often is usually the best tool for the job.
You can usually find the general characteristics for any kind of data structure on Wikipedia. Introduction to Algorithms also has at least a section (in some cases a whole chapter) on most of the data structures you've listed, so it's another good reference.
Similar question: When to choose RB tree, B-Tree or AVL tree?
Offhand, I'd say, write the simplest code that could possibly work (availing yourself of library-provided data structures if possible). Then measure its performance problems, if any.
If your performance needs are really extreme, read Konrad Rudolph's awesome answer. :)
Each of these has different complexity for insertion, deletion and retrieval, All have mostly O log(n) access times.
Each tree has specific characteristics which make them usefull in a certain way. You should compare there characteristics with the needs you have.

Resources