When to choose RB tree, B-Tree or AVL tree? - data-structures

As a programmer when should I consider using a RB tree, B- tree or an AVL tree?
What are the key points that needs to be considered before deciding on the choice?
Can someone please explain with a scenario for each tree structure why it is chosen over others with reference to the key points?

Take this with a pinch of salt:
B-tree when you're managing more than thousands of items and you're paging them from a disk or some slow storage medium.
RB tree when you're doing fairly frequent inserts, deletes and retrievals on the tree.
AVL tree when your inserts and deletes are infrequent relative to your retrievals.

I think B+ trees are a good general-purpose ordered container data structure, even in main memory. Even when virtual memory isn't an issue, cache-friendliness often is, and B+ trees are particularly good for sequential access - the same asymptotic performance as a linked list, but with cache-friendliness close to a simple array. All this and O(log n) search, insert and delete.
B+ trees do have problems, though - such as the items moving around within nodes when you do inserts/deletes, invalidating pointers to those items. I have a container library that does "cursor maintenance" - cursors attach themselves to the leaf node they currently reference in a linked list, so they can be fixed or invalidated automatically. Since there's rarely more than one or two cursors, it works well - but it's an extra bit of work all the same.
Another thing is that the B+ tree is essentially just that. I guess you can strip off or recreate the non-leaf nodes depending on whether you need them or not, but with binary tree nodes you get a lot more flexibility. A binary tree can be converted to a linked list and back without copying nodes - you just change the pointers then remember that you're treating it as a different data structure now. Among other things, this means you get fairly easy O(n) merging of trees - convert both trees to lists, merge them, then convert back to a tree.
Yet another thing is memory allocation and freeing. In a binary tree, this can be separated out from the algorithms - the user can create a node then call the insert algorithm, and deletes can extract nodes (detach them from the tree, but dont free the memory). In a B-tree or B+-tree, that obviously doesn't work - the data will live in a multi-item node. Writing insert methods that "plan" the operation without modifying nodes until they know how many new nodes are needed and that they can be allocated is a challenge.
Red black vs. AVL? I'm not sure it makes any big difference. My own library has a policy-based "tool" class to manipulate nodes, with methods for double-linked lists, simple binary trees, splay trees, red-black trees and treaps, including various conversions. Some of those methods were only implemented because I was bored at one time or another. I'm not sure I've even tested the treap methods. The reason I chose red-black trees rather than AVL is because I personally understand the algorithms better - which doesn't mean they're simpler, it's just a fluke of history that I'm more familiar with them.
One last thing - I only originally developed my B+ tree containers as an experiment. It's one of those experiments that never ended really, but it's not something I'd encourage others to repeat. If all you need is an ordered container, the best answer is to use the one that your existing library provides - e.g. std::map etc in C++. My library evolved over years, it took quite a while to get it stable, and I just relatively recently discovered it's technically non-portable (dependent on a bit of undefined behaviour WRT offsetof).

In memory B-Tree has the advantage when the number of items is more than 32000... Look at speedtest.pdf from stx-btree.

When choosing data structures you are trading off factors such as
speed of retrieval v speed of update
how well the structure copes with worst case operations, for example insertion of records that arrive in a sorted order
space wasted
I would start by reading the Wikipedia articles referenced by Robert Harvey.
Pragmatically, when working in languages such as Java the average programmer tends to use the collection classes provided. If in a performance tuning activity one discovers that the collection performance is problematic then one can seek alternative implementations. It's rarely the first thing a business-led development has to consider. It's extremely rare that one needs to implement such data structures by hand, there are usually libraries that can be used.

Related

Merkle tree for finding data inconsistencies - optimizing number of queries

I understand the idea behind using Merkle tree to identify inconsistencies in data, as suggested by articles like
Key Concepts: Using Merkle trees to detect inconsistencies in data
Merkle Tree | Brilliant Math & Science Wiki
Essentially, we use a recursive algorithm to traverse down from root we want to verify, and follow the nodes where stored hash values are different from server (with trusted hash values), all the way to the inconsistent leaf/datablock.
If there's only one such block (leaf) that's corrupted, this means we following a single path down to leaf, which is log(n) queries.
However, in the case of multiple inconsistent data blocks/leaves, we need up to O(n) queries. In the extreme case, all data blocks are corrupted, and our algorithm will need to send every single node to server (authenticator). In the real world this becomes costly due to the network.
So my question is, is there any known improvement to the basic traverse-from-root algorithm? A possible improvement I could think of is to query the level of nodes in the middle. For example, in the tree below, we send the server the two nodes in the second level ('64' and '192'), and for any node that returns inconsistency, we recursively go to the middle level of that sub-tree - something like a binary search based on height.
This increases our best case time from O(1) to O(sqrt(n)), and probably reduces our worst case time to some extent (I have not calculated how much).
I wonder if there's any better approach than this? I've tried to search for relevant articles on Google Scholar, but looks like most of the algorithm-focused papers are concerned with the merkle-tree traversal problem, which is different from the problem above.
Thanks in advance!

Splay tree real life applications

Where would you use splay-tree in production. I mean a REAL LIFE example.
I was thinking about implementing autocomplete using tries and splay trees. For a large dataset it's not a good idea to traverse through trie from node x to the leaves to return results, so the idea was of having a splay tree inside a node in trie, so when user entered 'sta' it will go to s-t-a, 'a' - node and then return the top 5 elements in the splay tree (by BFS/level traversing, which doesn't necessarily mutates/modifies the tree)
Of course after the autocomplete variant was picked, we should traverse up the trie and update all splay trees inside those nodes.
Since splay trees are sensitive in concurrent environments I was questioning its' usage in production
Your ideas?
Splay trees are not a good match for data which rarely or never changes, particularly in a threaded environment. The extra mutations during read operations defeat memory caches and can create unnecessary lock contention. In any case, for read-only data structures, you can do a one-time computation of an optimal tree. Even if that computation is slow, it will have no impact on the long-term execution time.
I'm not entirely persuaded by the claim that large tries are slow, and certainly not in the case of autocompleters. On even not-so-modern hardware, the cost of a trie traversal is trivial compared to the time it takes for the user to type a character, or even the time it takes for the underlying keyboard driver and input processor to deliver the keypress to your application.
If you really need to optimise a trie, there is good reason to believe that a hybrid data structure with a trie at the root combined with a linear (or binary) search once the alternatives can fit in a cache line. This maximizes the benefit of the trie's large fan-out while avoiding the poor caching behaviour and excessive storage overhead at the end of the lines.
Splay trees are most useful (if they are useful at all) on data structures which are modified frequently. The ckassic example is a "rope" data structure (a tree of string segments), which is one way to attempt to optimise a text editor by avoiding large string copies. Compared with a deterministic tree-balancing algorithm such as RB-trees, the splay tree algorithm has the benefit of simplicity, as well as only touching nodes which form part of the tree traversal.
However, the ready availability of self-balancing tree libraries (part of the standard libraries of many modern programming languages) combined with often-disappointing empirical results make the splay algorithm a niche product at best, although it is certainly a fascinating idea.
I found a quite interesting usage of splay trees in Network load optimisations, it's called SplayNet. A Autonomous System (I think under Facebook) has implemented this maybe around 2015 and they have somehow managed with this to lower their internal communication load by around 40%(?).
So there is a good usage for Splaytrees!
Few weeks ago i was also reading about Splaytrees being usefully depending on the spread in the sequence of Search. If there is none you could also use f.a. binary trees or some static trees. But in the moment there is one, Splaytrees perform (if you use unlimited time) better.
In my thesis I use splay trees as pre processed data collection for the actual searching. So the splay tree only stores the results of the most common search requests. In the next step the search starts from the splay tree given node ... I think this is useful for big datasets, specially if it's stored on different computers/storages, so your program has a better guess where to start.
To say it the easy way - my splaytrees stores the FAQ of the given datastructure/dataset :)

How to calculate that a B+ tree is O(log(n)) for lookups

I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct? Then how are lookups made? If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?
I read the wikipedia article on B+ trees and I understand the structure but not how an actual lookup is performed. Could you guide me perhaps with some link to reading material?
What are some other uses of B+ trees besides database indexing?
I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct?
No, the index is formed by the inner nodes (non-leaves). Depending on the implementation the leaves may contain either key/value pairs or key/pointer to value pairs. For example, a database index uses the latter, unless it is an IOT (Index Organized Table) in which case the values are inlined in the leaves. This depends mainly on whether the value is insanely large wrt the key.
Then how are lookups made?
In the general case where the root node is not a leaf (it does happen, at first), the root node contains a sorted array of N keys and N+1 pointers. You binary search for the two keys S0 and S1 such that S0 <= K < S1 (where K is what you are looking for) and this gives you the pointer to the next node.
You repeat the process until you (finally) hit a leaf node, which contains a sorted list of key-values pairs and make a last binary search pass on those.
If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?
Binary trees are simpler to implement. One though cookie with B+Trees is to size the number of keys/pointers in inner nodes and the number of key/values pairs in leaves nodes. Another though cookie is to decide on the low and high watermark that leads to grouping two nodes or exploding one.
Binary trees also offer memory stability: an element inserted is not moved, at all, in memory. On the other hand, inserting an element in a B+Tree or removing one is likely to lead to elements shuffling
B+Trees are tailored for small keys/large values cases. They also require that keys can be duplicated (hopefully cheaply).
Could you guide me perhaps with some link to reading material?
I hope the rough algorithm I explained helped out, otherwise feel free to ask in the comments.
What are some other uses of B+ trees besides database indexing?
In the same vein: file-system indexing also benefits.
The idea is always the same: a B+Tree is really great with small keys/large values and caching. The idea is to have all the keys (inner nodes) in your fast memory (CPU Cache >> RAM >> Disk), and the B+Tree achieves that for large collections by pushing keys to the bottom. With all inner nodes in the fast memory, you only have one slow memory access at each search (to fetch the value).
B+ trees are better than binary tree all the dbms use them,
a lookup in B+Tree is LOGF N being F the base of LOG and the fan out. The lookup is performed exactly like in a binary tree but with a bigger fan out and lower height thats why it is way better.
B+Tree are usually known for having the data in the leaf(if they are unclustered probably not), this means you dont have to make another jump to the disk to get the data, you just take it from the leaf.
B+Tree is used almost everywhere, Operating Systems use them, datawarehouse (not so much here but still), lots of applications.
B+Tree are perfect for range queries, and are used whenever you have unique values, like a primary key, or any field with low cardinality.
If you can get this book http://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638 its one of the best. Its basically the bible for any database guy.

Useful data structure for the following case

Which can be the beste data structures for the following case.
1.Should have operations like search, insert and delete. Mostly searching activities will be there.Around 90% of the operations will be search and rest are delete and insert.
2 Insertion,deletion and searching will be based on the key of the objects. Each key will point to a object. The keys will be sorted.
Any suggestion for optimal data structure will be highly appreciated.
AVL tree, or at least BST.
If you want to acces often the same elements you might want to consider splay trees too.
(Should I explain why?)
Not sure by what you mean with "data structures"
I would suggest MySQL.
Read more here: WikiPedia
Self-balancing tree of sorts (AVL, RB), or a hash table.
My guess is that you want to optimize time. Overall, a red-black tree will have logarithmic-time performance in all three operations. It will probably be your best overall bet on execution time; however, red-black trees are complex to implement and require a node structure meaning they will be stored using more memory than the contained data itself requires.
You want a tree-backed Map; basically you just want a tree where the nodes are dynamically sorted ("self-balanced") by key, with your objects hanging off of each node with corresponding key.
If you would like an "optimal" data structure, that completely depends on the distribution of patterns of inputs you expect. The nice thing about a self-balancing tree is you don't really need to care too much about the pattern of inputs. If you really want the best-guess as-close-to-optimal as possible we know of, and you don't know much about the specific sequences of queries, you can use a http://en.wikipedia.org/wiki/Tango_tree which is O(log(log(N))-competitive. This grows so slowly that, for all practical purposes, you have something which performs no worse than effectively a constant factor from the best possible data structure you could have chosen.
However it's somewhat grungy to implement, you may just be better using a library for a self-balancing tree.
Python:
https://github.com/pgrafov/python-avl-tree/
Java:
If you're just Java, just use a TreeMap (red-black tree based) and ignore the implementation details. Most languages have similar data structures in their standard libraries.

What are the advantages of T-trees over B+/-trees?

I have explored the definitions of T-trees and B-/B+ trees. From papers on the web I understand that B-trees perform better in hierarchical memory, such as disk drives and cached memory.
What I can not understand is why T-trees were/are used even for flat memory?
They are advertised as space efficient alternative to AVL trees.
In the worst case, all leaf nodes of a T-tree contain just one element and all internal nodes contain the minimum amount allowed, which is close to full. This means that on average only half of the allocated space is utilized. Unless I am mistaken, this is the same utilization as the worst case of B-trees, when the nodes of a B-tree are half full.
Assuming that both trees store the keys locally in the nodes, but use pointers to refer to the records, the only difference is that B-trees have to store pointers for each of the branches. This would generally cause up to 50% overhead or less (over T-trees), depending on the size of the keys. In fact, this is close to the overhead expected in AVL trees, assuming no parent pointer, records embedded in the nodes, keys embedded in the records. Is this the expected efficiency gain that prevents us from using B-trees instead?
T-trees are usually implemented on top of AVL trees. AVL trees are more balanced than B-trees. Can this be connected with the application of T-trees?
I can give you a personal story that covers half of the answer, that is, why I wrote some Pascal code to program B+ trees some 18 years ago.
my target system was a PC with two disk drives, I had to store an index on non volatile memory and I wanted to understand better what I was learning at university. I was very dissatisfied with the performance of a commercial package, probably DBase III, or some Fox product, I can't remember.
anyhow: I needed these operations:
lookup
insertion
deletion
next item
previous item
maximum size of index was not known
so data had to reside on disk
each access to the support had high cost
reading a whole block cost the same as reading one byte
B+-trees made that small slow PC really fly through the data!
the leafs had two extra pointers so they formed a doubly linked list, for sequential searches.
In reality the difference lies in the system you use. As my tutor in university commented it : if your problem lies in memory shortage, or in hdd shortage will determine which tree and in which implementation you will use. Most probably it will be B+ tree.
Because there are hundreds of implementations, for instance with 2direction queue and one directional queues where you need to loop thought elements, and also there are multiple ways to store the index and retrieve it will determine the real cons and mins of any implementation.

Resources