How do you remove an element from a b-tree? - algorithm

I'm trying to learn about b-tree and every source I can find seem to omits the discussion about how to remove an element from the tree while preserving the b-tree properties.
Can someone explain the algorithm or point me to resource that do explain how it's done?

There's an explanation of it on the Wikipedia page. B-tree - Deletion

If you haven't got it yet, I strongly recommend Carmen & al Introduction to Algorithms 3rd Edition.
It is not described because the operations naturally stem from the B-Tree properties.
Since you have a lower-bound on the number of elements in a node, if removing your elements violates this invariant, then you need to restore it, which generally involves merging with a neighbour (or stealing some of its elements).
If you merge with a neighbour, then you need to remove an element in the parent node, which triggers the same algorithm. And you apply recursively till you get to the top.
B-Tree don't have rebalancing (at least not those I saw) so it's far less complicated that maintaining a red-black tree or an AVL tree which is probably why people didn't feel compelled to write about the removal.

About which b-trees are you talking about? With linked leaves or not? Also, there are different ways of removing an item (top-bottom, bottom-top, etc.). This paper might help: B-trees, Shadowing, and Clones (even though there are many file-system specific related stuff).

The deletion example from CLRS (2nd edition) is available here: http://ysangkok.github.io/js-clrs-btree/btree.html
Press "Init book" and then push the deletion buttons in order. That will cover all cases. Try and predict the new tree state before pushing each button, and try to recognize how the cases are all unique.

Related

Data structure: a graph that's similar to a tree - but not a tree

I have implemented a data structure in C, based upon a series of linked lists, that appears to be similar to a tree - but not enough to be referred as such, because in theory it allows the existence of cycles. Here's a basic outline of the nodes:
There is a single, identifiable root that doesn't have a parent node or brothers;
Each node contains a pointer to its "father", its nearest "brother" and the first of his "children";
There are "outer" nodes without children and brothers.
How can I name such a data structure? It cannot be a tree, because even if the pointers are clearly labelled and used differently, cycles like father->child->brother->father may very well exist. My question is: terms such as "father", "children" and "brother" can be used in the context of a graph or they are only reserved for trees? After quite a bit of research I'm still unable to clarify this matter.
Thanks in advance!
I'd say you can still call it a tree, because the foundation is a tree data structure. There is precedence for my claim: "Patricia Tries" are referred to as trees even though their leaf nodes may point up the tree (creating cycles). I'm sure there are other examples as well.
It sounds like the additional links you have are essentially just for convenience, and could be determined implicitly (rather than stored explicitly). By storing them explicitly you impose additional invariants on your tree operations (insert, delete, etc), but do not affect the underlying organization, which is a tree.
Precisely because you are naming and treating those additional links separately, they can be thought of as an "overlay" on top of your tree.
In the end it doesn't really matter what you call it or what category it falls into if it works for you. Reading a book like Sedgewick's "Algorithms in C" you realize that there are tons of data structures out there, and nothing wrong with inventing your own!
One more point: Trees are a special case of graphs, so there is nothing wrong with referring to it as a graph (or using graph algorithms on it) as well.

How do I balance a BK-Tree and is it necessary?

I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database.
I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is that I can't find very much information on this particular type of tree.
If I populate my BK-tree with arbitrary nodes, how likely am I to have a balance problem?
If it is possibly or likely for me to have a balance problem with BK-Trees, is there any way to balance such a tree after it has been constructed?
What would the algorithm look like to properly balance a BK-tree?
My thinking so far:
It seems that child nodes are distinct on distance, so I can't simply rotate a given node in the tree without re-calibrating the entire tree under it. However, if I can find an optimal new root node this might be precisely what I should do. I'm not sure how I'd go about finding an optimal new root node though.
I'm also going to try a few methods to see if I can get a fairly balanced tree by starting with an empty tree, and inserting pre-distributed data.
Start with an alphabetically sorted list, then queue from the middle. (I'm not sure this is a great idea because alphabetizing is not the same as sorting on edit distance).
Completely shuffled data. (This relies heavily on luck to pick a "not so terrible" root by chance. It might fail badly and might be probabilistically guaranteed to be sub-optimal).
Start with an arbitrary word in the list and sort the rest of the items by their edit distance from that item. Then queue from the middle. (I feel this is going to be expensive, and still do poorly as it won't calculate metric space connectivity between all words - just each word and a single reference word).
Build an initial tree with any method, flatten it (basically like a pre-order traversal), and queue from the middle for a new tree. (This is also going to be expensive, and I think it may still do poorly as it won't calculate metric space connectivity between all words ahead of time, and will simply get a different and still uneven distribution).
Order by name frequency, insert the most popular first, and ditch the concept of a balanced tree. (This might make the most sense, as my data is not evenly distributed and I won't have pure random words coming in).
FYI, I am not currently worrying about the name-synonym problem (Bill vs William). I'll handle that separately, and I think completely different strategies would apply.
There is a lisp example in the article: http://cliki.net/bk-tree. About unbalancing the tree I think the data structure and the method seems to be complicated enough and also the author didn't say anything about unbalanced tree. When you experience unbalanced tree maybe it's not for you?

What are the differences between B-tree and B*-tree, except the requirement for fullness?

I know about this question, but it's about B-tree and B+-tree. Sorry, if there's similar for B*-tree, but I couldn't find such.
So, what is the difference between these two trees? The wikipedia article about B*-trees is very short.
The only difference, that is noted there, is "non-root nodes to be at least 2/3 full instead of 1/2". But I guess there's something more.. There could be just one kind of tree - the B-tree, just with different constants (for the fullness of each non-root node), and no two different trees, if this was the only difference, right?
Also, one more thing, that made me thing about more differences:
"A B*-tree should not be confused with a B+ tree, which is one where the
leaf nodes of the tree are chained together in the form of a linked list"
So, B+-tree has something really specific - the linked list. What is the specific characteristic of B*-tree, or there isn't such?
Also, there are no any external links/references in the wikipedia's article. Are there any resources at all? Articles, tutorials, anything?
Thanks!
Edited
No difference other than min. fill factor.
Page #489
From the above book,
B*-tree is a variant of a B-tree that requires each internal node to be at least 2/3 full, rather than at least half full.
Knuth also defines the B* tree exactly like that (The art of computer programming, Vol. 3).
"The Ubiquitous B-Tree" has a whole sub-section on B*-trees. Here, Comer defines the B*-tree tree exactly as Knuth and Corment et al. do but also clarifies where the confusion comes from --B*-tree tree search algorithms and some unnamed B tree variants designed by Knuth which are now called B+-trees.
Maybe you should look at Ubiquitous B-Tree by Comer (ACM Computing Surveys, 1979).
Comer writes there something about the B*Tree (In the section B-Tree and its variants). And in that section, he also cites some more paper about that topic. That should help you to do further investigations on your own :)! (I'm not your researcher ;) )
However, I don't understand the point where you cite a part which says that the B*Tree does not have a linked list in the leaf node level. I'm pretty sure, that also those nodes are linked together.
Regarding having only one B-Tree. Actually, you have that. The other ones like B+Tree, Prefix B+Tree and so on are just variants of the standard B-Tree. Just look at the paper Ubiquitous B-Tree.

Balancing a ternary search tree

How does one go about 'balancing' a ternary search tree? Most tst implementations don't address balancing, but suggest inserting in an optimal order (which I can't control.)
The article in Dr. Dobbs about Ternary Search Trees says: D.D. Sleator and R.E. Tarjan describe theoretical balancing algorithms for ternary search trees in "Self-Adjusting Binary Search Trees" (Journal of the ACM, July 1985). You can find online versions of this paper with your favorite search engine.
One simple optimization is to make it a red-black tree, which can avoid some worst-case scenarios. TSTs are really just binary trees where the value of a given node is another TST. So, the "middle" child of a node is not really part of the tree that is being balanced at each level, as it cannot move to a different parent anyway.
This ensures that each tier of the trie is traversed in log(R) time, although you could probably do even better by taking into account the size of the subtries at each node. That looks to be a lot more complicated though!
A generalization of the binary search tree is the B-Tree, which works for fanouts anywhere from 2 and up. That's not the only way to do it, but it's a common one.
Roughly the way it works is if an insert or delete would put the tree out of balance, it steals an element or a space from a neighboring node. If even that isn't enough to keep the tree in balance, its height by will be changed to make room.
read this article:
"Self-Adjusting of Ternary Search Tries Using Conditional Rotations and Randomized Heuristics"
by
"Ghada Hany Badr∗ and B. John Oommen †"
it will help you to understanding self-adjusting and balancing TSTs.

When to choose RB tree, B-Tree or AVL tree?

As a programmer when should I consider using a RB tree, B- tree or an AVL tree?
What are the key points that needs to be considered before deciding on the choice?
Can someone please explain with a scenario for each tree structure why it is chosen over others with reference to the key points?
Take this with a pinch of salt:
B-tree when you're managing more than thousands of items and you're paging them from a disk or some slow storage medium.
RB tree when you're doing fairly frequent inserts, deletes and retrievals on the tree.
AVL tree when your inserts and deletes are infrequent relative to your retrievals.
I think B+ trees are a good general-purpose ordered container data structure, even in main memory. Even when virtual memory isn't an issue, cache-friendliness often is, and B+ trees are particularly good for sequential access - the same asymptotic performance as a linked list, but with cache-friendliness close to a simple array. All this and O(log n) search, insert and delete.
B+ trees do have problems, though - such as the items moving around within nodes when you do inserts/deletes, invalidating pointers to those items. I have a container library that does "cursor maintenance" - cursors attach themselves to the leaf node they currently reference in a linked list, so they can be fixed or invalidated automatically. Since there's rarely more than one or two cursors, it works well - but it's an extra bit of work all the same.
Another thing is that the B+ tree is essentially just that. I guess you can strip off or recreate the non-leaf nodes depending on whether you need them or not, but with binary tree nodes you get a lot more flexibility. A binary tree can be converted to a linked list and back without copying nodes - you just change the pointers then remember that you're treating it as a different data structure now. Among other things, this means you get fairly easy O(n) merging of trees - convert both trees to lists, merge them, then convert back to a tree.
Yet another thing is memory allocation and freeing. In a binary tree, this can be separated out from the algorithms - the user can create a node then call the insert algorithm, and deletes can extract nodes (detach them from the tree, but dont free the memory). In a B-tree or B+-tree, that obviously doesn't work - the data will live in a multi-item node. Writing insert methods that "plan" the operation without modifying nodes until they know how many new nodes are needed and that they can be allocated is a challenge.
Red black vs. AVL? I'm not sure it makes any big difference. My own library has a policy-based "tool" class to manipulate nodes, with methods for double-linked lists, simple binary trees, splay trees, red-black trees and treaps, including various conversions. Some of those methods were only implemented because I was bored at one time or another. I'm not sure I've even tested the treap methods. The reason I chose red-black trees rather than AVL is because I personally understand the algorithms better - which doesn't mean they're simpler, it's just a fluke of history that I'm more familiar with them.
One last thing - I only originally developed my B+ tree containers as an experiment. It's one of those experiments that never ended really, but it's not something I'd encourage others to repeat. If all you need is an ordered container, the best answer is to use the one that your existing library provides - e.g. std::map etc in C++. My library evolved over years, it took quite a while to get it stable, and I just relatively recently discovered it's technically non-portable (dependent on a bit of undefined behaviour WRT offsetof).
In memory B-Tree has the advantage when the number of items is more than 32000... Look at speedtest.pdf from stx-btree.
When choosing data structures you are trading off factors such as
speed of retrieval v speed of update
how well the structure copes with worst case operations, for example insertion of records that arrive in a sorted order
space wasted
I would start by reading the Wikipedia articles referenced by Robert Harvey.
Pragmatically, when working in languages such as Java the average programmer tends to use the collection classes provided. If in a performance tuning activity one discovers that the collection performance is problematic then one can seek alternative implementations. It's rarely the first thing a business-led development has to consider. It's extremely rare that one needs to implement such data structures by hand, there are usually libraries that can be used.

Resources