How Balanced are balanced B-Trees - b-tree

Say I have a B-Tree with nodes in a 3-4 configuration (3 elements and 4 pointers). Assuming I build up my tree legally according to the rules, is it possible for me to reach a situation where there are two nodes in a layer and one node has 4 exiting pointers and the other has only two exiting pointers?
In general, what guarantees do I have as to the balancedness of a properly used B-Tree

The idea behind balance (in general balanced tree data structures) is that the difference in depths between any two sub-trees is zero or one (depending on the tree). In other words, the number of comparisons used to find a leaf node is always similar.
So yes, you can end up in the situation you describe, simply because the depths are the same. The number of elements in each node is not a concern to the balance (but see below).
This is perfectly legal even though there's more items in the left node than the right (null pointers are not shown):
+---+---+---+
| 8 | | |
+---+---+---+
________/ |
/ |
| |
+---+---+---+ +---+---+---+
| 1 | 2 | 3 | | 9 | | |
+---+---+---+ +---+---+---+
However, it's very unusual to have a 3-4 BTree (some would actually say that's not a BTree at all, but some other data structure).
With BTrees, you usually have an even number of keys as maximum in each node (e.g., a 4-5 tree) so that the splitting and combining is easier. With a 4-5 tree, the decision as to which key is promoted when a node fills up is easy - it's the middle one of the five. That's not such a clear-cut matter with a 3-4 tree since it could be one of two (there is no definite middle for four elements).
It also allows you to follow the rule that your nodes should contain between n and 2n elements. In addition (for "proper" BTrees), the leaf nodes are all at the same depth, not just within one of each other.
If you added the following values to an empty BTree, you could end up with the situation you describe:
Add Tree Structure
--- ----------------
1 1
2 1,2
5 1,2,5
6 1,2,5,6
7 5
/ \
1,2 6,7
8 5
/ \
1,2 6,7,8
9 5
/ \
1,2 6,7,8,9

Related

How can I eliminate gaps in the breadth-first ordering of a binary search tree?

A gap-less binary search tree is a self-balancing binary search tree with the gap-less property. The gap-less property states that there are no gaps in the breadth-first ordering of the tree. A gap in the breadth-first ordering is best defined through a diagram. In the image below, the areas highlighted by red dashed circles are considered gaps in the breadth-first ordering:
If this tree were restructured to eliminate the gaps, it would look like this:
If the number 7 were added to this restructured tree without re-balancing, it would look like this:
Again, after removing the gaps:
Is there a log(n) algorithm to ensure the gap-less property after insertions and deletions to trees of arbitrary sizes?
Is there a log(n) algorithm to ensure the gap-less property after insertions and deletions to trees of arbitrary sizes?
No.
To see why, consider this tree (which has the gap-less property):
4
/ \
2 6
/| |\
1 3 5 7
To insert 8, you'd need to end up with this:
5
/ \
3 7
/| |\
2 4 6 8
/
1
which clearly requires visiting every node at least once, because every single node has a different parent afterward than it had before. Therefore, you cannot possibly guarantee better than O(n) time.
Likewise, to remove 1, you'd need to end up with this:
5
/ \
3 7
/| |
2 4 6
which, same problem.

O(1) Algorithm for Counting Left-Children in Complete Binary Tree

I have a complete binary tree (i.e. a tree where "every level, except possibly the last, is completely filled, and all nodes are as far left as possible"). This tree is stored in depth-first, left-to-right order. My problem is, given a node by index and the tree's total size, tell me how many nodes are in that node's left subtree, in O(1).
For example, suppose the tree's total size is 10. This implies the following complete tree (note: the numbers are the node index in the depth-first, left-to-right order):
0
/ \
1 7
/ \ |\
2 5 8 9
/| /
3 4 6
Now, given a node index I need to find how many left-children it has. For this example:
Node 0 has 6 left-children.
Node 1 has 3 left-children.
Node 2 has 1 left-child.
Node 3 has 0 left-children.
Node 4 has 0 left-children.
Node 5 has 1 left-child.
Node 6 has 0 left-children.
Node 7 has 1 left-child.
Node 8 has 0 left-children.
Node 9 has 0 left-children.
Each such query must take O(1) time and be a function only of the node index and tree size (e.g., I cannot store anything in the tree).
I feel like this should be a fairly simple problem, but so-far I haven't been able to figure it out.
Strictly speaking, this is a bit of a simplification of my problem; I actually want 1+ this value, and I'll never call the function on leaves. But the core problem is this.

Can an m-way B-Tree have m odd?

I read on book CLRS that we have m-way B-tree where m is even. But is there is B-Tree where m is odd, if there is then how can we make changes in the code given in this book.
By an m-way B-tree I assume you mean a B-tree where each internal node is allowed to have at most m children. According to CLRS's definition of a B-tree:
Nodes have lower and upper bounds on the number of keys they can contain. We express these bounds in terms of a fixed integer t 􏰄≥ 2 called the minimum degree of the B-tree: ... an internal node may have at most 2t children.
So the maximum number of children will always be even – by this definition it can not be odd.
However, this is not the only definition of B-tree. There are many definitions with slight differences that ultimately, make little difference to the overall performance. This can cause confusion. There are some B-tree definitions that allow for odd upper bounds and those which don't. CLRS's definition does not odd upper bounds for the children count of internal nodes.
However, another formal definition of a B-tree is by Knuth [1998] (The Art of Computer Programming, Volume 3 (Second ed.), Addison-Wesley, ISBN 0-201-89685-0). Knuth's definition does allow odd upper bounds. While CLRS enumerates all min-max tree bounds of the form (t, 2t) for t ≥ 2, Knuth enumerates all tree bounds of the form (ceil(k/2), k) for k ≥ 2.
Knuth Order, k | (min,max) | CLRS Degree, t
---------------|-------------|---------------
0 | - | –
1 | – | –
2 | – | –
3 | (2,3) | –
4 | (2,4) | t = 2
5 | (3,5) | –
6 | (3,6) | t = 3
7 | (4,7) | –
8 | (4,8) | t = 4
9 | (5,9) | –
10 | (5,10) | t = 5
So for example, a 2-3 tree, (2,3), is a B-tree with Knuth order 3. But it is not a valid CLRS tree because it has an odd upper bound.
Changing code will not be easy as B-trees have a lot of code depending on variable t. One of the biggest changes would be inside: B-TREE-SPLIT-CHILD(x,i), you'd need find a way to split a child with an odd number of children (an even number of keys) into nodes y and z. One of these two resulting nodes will have one more key than the other. If you're looking for code, I'd recommend looking on the Internet for an implementation of a B-tree that uses a definition similar to Knuth's (e.g. search for "Knuth Order B-tree").

Optimize a list of text additions and deletions

I've got a list containing positions of text additions and deletions, like this:
Type Position Text/Length
1. + 2 ab // 'ab' was added at position 2
2. + 1 cde // 'cde' was added at position 1
3. - 4 1 // a character was deleted at position 4
To make it more clear, this is what these operations will do:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
---------------------------------
t | e | x | t | | | | |
1. t | a | b | e | x | t | | |
2. c | d | e | t | a | b | e | x | t
3. c | d | e | a | b | e | x | t |
The number of actions can be reduced to:
Type Position Text/Length
1. - 1 1 // 't' was deleted at position 1
2. + 1 cdeab // 'cdeab' was added at position 1
Or:
Type Position Text/Length
1. + 1 cdeab // 'cdeab' was added at position 1
2. - 6 1 // 't' was deleted at position 6
These actions are to be saved in my database and in order to optimize this: how can I reduce the number of actions that are to be done to get the same result? Is there a faster way than O(n*n)?
Note that these actions are chronological, changing the order of the actions will give another result.
Not a solution, just some thoughts:
Rule 1: if two consecutive operations don't have overlapping ranges, they can be swapped (with positions adjusted)
Rule 2: two consecutive inserts or removals at the same position can be joined
Rule 3: when an insert is followed by a removal that is completely contained in the insert, they can be joined
I don't see a straightforward algorithm for the shortest solution. However, an heuristic approach using Rule 1 + 2 might be:
move operations "up" unless
you'd violate Rule 1
you'd move an insert before a removal
the position is less than that of that predecessor
join consecutive inserts/removals at the same position
Applied to the sample, this would mean:
+ 2 ab
+ 1 cde
- 4 1
Rule 1 (2x):
+ 2 ab
- 1 1 // position adjusted by -3
+ 1 cde
.
- 1 1
+ 1 ab // position adjusted
+ 1 cde
Rule 2:
- 1 1
+ 1 cdeab // watch correct order!
A primitive implementation will be O(N*N) - basically, a bubble sort with additonal stopping conditions. I'm not sure about beating down that complexity, since standard algorithms are of no use here due to having to adjust the position.
However, you might be able to improve things notably - e.g. you don't need a "full sort"
Make a binary tree representing the document before and after applying all the changes. Each node represents either original text or inserted/deleted text; the latter kind of node includes both the amount of original text to delete (possibly 0) and the string of text to insert (possibly empty).
Initially the tree has just one node, "0 to end: original text". Apply all the changes to it merging changes as you go wherever possible. Then walk the tree from beginning to end emitting the final set of edits. This is guaranteed to produce the optimal result.
Applying an insert: Find the appropriate point in the tree. If it's in the middle of or adjacent to inserted text, just change that node's text-to-insert string. Otherwise add a node.
Applying a delete: Find the starting and ending points in the tree—unlike an insert, a delete may cover a whole range of existing nodes. Modify the starting and ending nodes accordingly, and kill all the nodes in between. After you're done, check to see if you have adjacent "inserted/deleted text" nodes. If so, join them.
The only tricky bit is making sure you can find points in the tree, without updating the entire tree every time you make a change. This is done by caching, at each node, the total amount of text represented by that subtree. Then when you make a change, you only have to update these cached values on nodes directly above the nodes you changed.
This looks strictly O(n log n) to me for all input, if you bother to implement a balanced tree and use ropes for the inserted text. If you ditch the whole tree idea and use vectors and strings, it's O(n2) but might work fine in practice.
Worked example. Here is how this algorithm would apply to your example, step by step. Instead of doing complicated ascii art, I'll turn the tree on its side, show the nodes in order, and show the tree structure by indentation. I hope it's clear.
Initial state:
*: orig
I said above we would cache the amount of text in each subtree. Here I just put a * for the number of bytes because this node contains the whole document, and we don't know how long that is. You could use any large-enough number, say 0x4000000000000000L.
After inserting "ab" at position 2:
2: orig, 2 bytes
*: insert "ab", delete nothing
*: orig, all the rest
After inserting "cde" at position 1:
1: orig, 1 byte
5: insert "cde", delete nothing
1: orig, 1 byte
*: insert "ab", delete nothing
*: orig, all the rest
The next step is to delete a character at position 4. Pause here to see how we find position 4 in the tree.
Start at the root. Look at the first child node: that subtree contains 5 characters. So position 4 must be in there. Move to that node. Look at its first child node. This time it contains only 1 character. Not in there. This edit contains 3 characters, so it's not in here either; it's immediately after. Move to the second child node. (This algorithm is about 12 lines of code.)
After deleting 1 character at position 4, you get this...
4: orig, 1 byte
3: insert "cde", delete nothing
*: insert "ab", delete nothing
*: orig, all the rest
...and then, noticing two adjacent insert nodes, you merge them. (Note that given two adjacent nodes, one is always somewhere above the other in the tree hierarchy. Merge the data into that higher node; then delete the lower one and update the cached subtree sizes in between.)
1: orig, 1 byte
*: insert "cdeab", delete nothing
*: orig, all the rest
The "diff" tools used in source code control systems use algorithms that produce the minimum edit needed to transform one piece of source code to another - it might be worth investigating them. I think most of them are based (eventually) on this algorithm, but it's a while since I did any reading on this subject.
I believe that this can be done considerably faster than O(n²) on average (it is likely that input can be engineered not to allow fast analysis). You can regard consecutive additions or deletions as sets. You can analyze one operation at a time, and you will have to do some conditional transformations:
If an addition follows an addition, or a set of additions, it might
touch (one or more of) the previous addition(s): then, you can unite these additions
not touch: you can order them (you will have to adjust the positions)
If a deletion follows an addition, or a set of additions, it might
only delete characters from the addition: then, you can modify the addition (unless it would split an addition)
only delete characters not from the set of additions: then, you can move the deletion to a position before the set of additions, and perhaps unite additions; after that, the set of deletions before the current set of additions might have to be applied to the additions before that
do both: then, you can first split it into two (or more) deletions and apply the respective method
If a deletion follows a deletion, or a set of deletions, it can:
touch (one or more of) the previous deletion(s): then, you can unite these deletions
not touch: you can order them (you will have to adjust the positions
in any case, you then have to apply analysis of the newly formed deletions on the previous additions
If an addition follows a deletion, no transformation is needed at this point
This is just a first rough draft. Some things may have to be done differently, e.g., it might be easier or more efficient to always apply all deletions, so that the result is always only one set of deletions followed by one set of additions.
Let's assume for simplicity that only letters a-z appear in your texts.
Initialize a list A with values a[i] = i for i = 1 to N (you will figure out yourself how big N should be).
Perform (simulate) all your operations on A. After this analyze A to find required operations:
Fist find required delete operations by finding missing numbers in A (they will form groups of consecutive values, one group stands for one delete operation).
After this you can find required insert operations by finding sequences of consecutive
letters (one sequence is one insert operation).
In your example:
init A:
1 2 3 4 5 6 7 8 9 10
Step 1 (+:2:ab):
1 a b 2 3 4 5 6 7 8 9 10
Step2 (+:1:cde):
c d e 1 a b 2 3 4 5 6 7 8 9 10
Step3 (-:4:1):
c d e a b 2 3 4 5 6 7 8 9 10
Now we search for missing numbers to find deletes. In our example only one number (namely number 1) is missing,
so only 1 delete is required, so we have one delete operation:
-:1:1
(In general there may be more numbers missing, every sequence of missing numbers is one delete operation.
For example if 1, 2, 3, 5, 6, 10 are all missing numbers, then there are 3 delete operations: -:1:3, -:2:2, -:5:1. Remember that after every delete operation all indexes are decreased, you have to store total sum of former delete operations to calculate the index of current delete operation.)
Now we search for character sequences to find insert operations. In our example there is only one sequence:
cdeab at index 1, so we have one insert operation: +:1:cdeab
Hope this is clear enough.
How to reduce the number of actions: An algorithmic approach could try to sort the actions. I think, that after sorting:
The chance that neighbouring actions can be joined (in the manner Svante and peterchen showed),
will rise.
This may lead to the minimum number of actions that have to be performed?
In the following "position-number" stands for the text insertion or deletion position.
Assuming it is possible to swap two neighboring actions (by adjusting position-numbers and
text/length property of this two actions), we can bring the action-list to any order we
like. I suggest to bring the deletion actions to the front of the action list with ascending
position-numbers. After the deletion actions the addition-actions are sorted with ascending
position-numbers.
The following examples should demonstrate, why i think it is possible to swap any neighboring actions.
Swaping following actions:
1. + 2 aaa -> taaaext
2. - 3 1 -> taaext
yields to one action:
1. + 2 aa -> taaext
Swaping following actions:
1. + 3 aaa -> teaaaxt
2. + 1 bb -> bbteaaaxt
yields to:
1. + 1 bb -> bbtext
2. + 5 aaa -> bbteaaaxt
Swaping following actions:
1. + 1 bb -> bbtext
2. - 2 2 -> bext
yields to:
1. - 1 1 -> ext
2. + 1 b -> bext
As the first example shows, in some cases a swap causes the removal of a deletion. This is a
benefiting side effect. This is also the matter why i suggest to move all deletions to the
front.
I hope that i didn't forget something and considered all circumstances.

Permuting a binary tree without the use of lists

I need to find an algorithm for generating every possible permutation of a binary tree, and need to do so without using lists (this is because the tree itself carries semantics and restraints that cannot be translated into lists). I've found an algorithm that works for trees with the height of three or less, but whenever I get to greater heights, I loose one set of possible permutations per height added.
Each node carries information about its original state, so that one node can determine if all possible permutations have been tried for that node. Also, the node carries information on weather or not it's been 'swapped', i.e. if it has seen all possible permutations of it's subtree. The tree is left-centered, meaning that the right node should always (except in some cases that I don't need to cover for this algorithm) be a leaf node, while the left node is always either a leaf or a branch.
The algorithm I'm using at the moment can be described sort of like this:
if the left child node has been swapped
swap my right node with the left child nodes right node
set the left child node as 'unswapped'
if the current node is back to its original state
swap my right node with the lowest left nodes' right node
swap the lowest left nodes two childnodes
set my left node as 'unswapped'
set my left chilnode to use this as it's original state
set this node as swapped
return null
return this;
else if the left child has not been swapped
if the result of trying to permute left child is null
return the permutation of this node
else
return the permutation of the left child node
if this node has a left node and a right node that are both leaves
swap them
set this node to be 'swapped'
The desired behaviour of the algoritm would be something like this:
branch
/ |
branch 3
/ |
branch 2
/ |
0 1
branch
/ |
branch 3
/ |
branch 2
/ |
1 0 <-- first swap
branch
/ |
branch 3
/ |
branch 1 <-- second swap
/ |
2 0
branch
/ |
branch 3
/ |
branch 1
/ |
0 2 <-- third swap
branch
/ |
branch 3
/ |
branch 0 <-- fourth swap
/ |
1 2
and so on...
The structure is just completely unsuited for permutations, but since you know it's left-centered you might be able to make some assumptions that help you out.
I tried working it in a manner similar to yours, and I always got caught on the fact that you only have a binary piece of information (swapped or not) which isn't sufficient. For four leaves, you have 4! (24) possible combinations, but you only really have three branches (3 bits, 8 possible combinations) to store the swapped state information. You simply don't have a place to store this information.
But maybe you could write a traverser that goes through the tree and uses the number of leaves to determine how many swaps are needed, and then goes through those swaps systematically instead of just leaving it to the tree itself.
Something like
For each permutation
Encode the permutation as a series of swaps from the original
Run these swaps on the original tree
Do whatever processing is needed on the swapped tree
That might not be appropriate for your application, but you haven't given that many details about why you need to do it the way you're doing it. The way you're doing it now simply won't work, since factorial (the number of permutations) grows faster than exponential (the number of "swapped" bits you have). If you had 8 leaves, you would have 7 branches and 8 leaves for a total of 15 bits. There are 40320 permutation of 8 leaves, and only 32768 possible combinations of 15 bits. Mathematically, you simply cannot represent the permutations.
What is wrong with making a list of all items in the tree, use generative means to build all possible orders (see Knuth Vol 4), and then re-map them to the tree structure?

Resources