My problem is not with the structure to hold the Tree but the way I am doing it; because I think this implementation will cost much in long run.
I have a tree structure in which a tree Node will contain the List of references of its children. But here the problem is that while finding the child of a Node, we need to go through the List of children which will take Linear time(Linear time complexity). And I also need to store these all as immediate child(as children word is used for the immediate children).
Now, is there any way I can put all the children other than List structure so that the retrieval and deletion of the children from the List will be efficient and logarithmic(if we can)?
If I am going to traverse the Tree then to go to the right children from the root node, I will have to check a condition for each child node. That check would be Linear search and check.
I just want a technique which will help in improving this algorithm of searching for the right child in the children list during traversal.
Instead of just having each node keep a regular list, have it either keep a sorted list(log n lookup) or a hashmap(constant time lookup). In this case sorting is probably the best so you can easily iterate over the elements and save space.
Related
Is there a balanced BST structure that also keeps track of subtree size in each node?
In Java, TreeMap is a red-black tree, but doesn't provide subtree size in each node.
Previously, I did write some BST that could keep track subtree size of each node, but it's not balanced.
The questions are:
Is it possible to implement such a tree, while keeping efficiency of (O(lg(n)) for basic operations)?
If yes, then is there any 3rd-party libraries provide such an impl?
A Java impl is great, but other languages (e.g c, go) would also be helpful.
BTW:
The subtree size should be kept track in each node.
So that could get the size without traversing the subtree.
Possible appliation:
Keep track of rank of items, whose value (that the rank depends on) might change on fly.
The Weight Balanced Tree (also called the Adams Tree, or Bounded Balance tree) keeps the subtree size in each node.
This also makes it possible to find the Nth element, from the start or end, in log(n) time.
My implementation in Nim is on github. It has properties:
Generic (parameterized) key,value map
Insert (add), lookup (get), and delete (del) in O(log(N)) time
Key-ordered iterators (inorder and revorder)
Lookup by relative position from beginning or end (getNth) in O(log(N)) time
Get the position (rank) by key in O(log(N)) time
Efficient set operations using tree keys
Map extensions to set operations with optional value merge control for duplicates
There are also implementations in Scheme and Haskell available.
That's called an "order statistic tree": https://en.wikipedia.org/wiki/Order_statistic_tree
It's pretty easy to add the size to any kind of balanced binary tree (red-black, avl, b-tree, etc.), or you can use a balancing algorithm that works with the size directly, like weight-balanced trees (#DougCurrie answer) or (better) size-balanced trees: https://cs.wmich.edu/gupta/teaching/cs4310/lectureNotes_cs4310/Size%20Balanced%20Tree%20-%20PEGWiki%20sourceMayNotBeFullyAuthentic%20but%20description%20ok.pdf
Unfortunately, I don't think there are any standard-library implementations, but you can find open source if you look for it. You may want to roll your own.
In B-tree insertion algorithm, I see that in order to solve the case in which we need to insert an element to a leaf with 2t-1 elements, we need to do split algorithm to the tree. Something I don't understand is why in the insertion algorithm during the descend in the tree (to the willing point) we split every node with 2t-1 elements, even though I seems useless. for example
example
I understand that there is a case in which couple of nodes above the leaf got 2t-1 elements, and in case we want move the median to them we face problem, but why not to give pinpoint solution for that, instead of doing split every time.
correct me if I say something wrong.
We split the full nodes on the way down to the target position because we don't know if we will need to "go back up." You can do it the way you are thinking, where we go down to the target node, split it, and then insert the median of the split into the parent, recursively splitting nodes as needed. But this requires us to go from the root, down to the target, and back up, potentially all the way to the root again. This might be undesirable, e.g. if accessing the nodes twice would be too expensive. In that case, it may be better to go in one pass straight down, where you split any full nodes to anticipate the need for more space.
For a demonstration, you can try inserting 10 into the trees in the middle and on the bottom of your drawing. The tree on the bottom, unsplit, needs to be split all the way to the root in the same way as the middle tree, because the two-pass algorithm didn't leave any space. In the middle tree, inserting 10 still causes a split, but it doesn't extend all the way up because the top two layers of the tree are very spacious.
There is an important caveat, though. Let t be the minimum number of children per node. For the two pass algorithm, the maximum number of children a node can have needs to be at least u = 2t - 1. If it is less, like 2t - 2, then splitting a full node (2t - 3 elements), even with the additional element to insert, will not be able to make two non-deficient nodes. The one pass algorithm requires a higher maximum, u = 2t. This is because the two-pass algorithm always has an element on hand to cancel exactly one deficiency. The one-pass algorithm does not have this ability, as it sometimes splits nodes unnecessarily, so it can't stick the element it's holding into one of the deficiencies. It might not belong there.
I've implemented B-trees several times, and have never split nodes on the way down.
Usually I do insert recursively, such that node->insert(key,data) can return a new key to insert in the parent. The parent calls insert on the child node, and if the child splits it returns a new key to the parent. If the parent splits then it returns the a key to it's parent, etc.
I've found that the insert implementation can stay pretty clean this way.
In a standard implementation of the Rope data structure using splay trees, the nodes would be ordered according to a rank statistic measuring the position of each one from the start of the string, so the keys normally found in binary search tree would be irrelevant, would they not?
I ask because the keys shown in the graphic below (thanks Wikipedia!) are letters, which would presumably become non-unique once the number of nodes exceeded the length of the chosen alphabet. Wouldn't it be better to use integers or avoid using keys altogether?
Separately, can anyone point me to a good implementation of the logic to recompute rank statistics after each operation?
Presumably, if the index for a split falls within the substring attached to a particular node, say, between "Hel" and "llo_" on the node E above, you would remove the substring from E, split it and reattach it as two children of E. Correct?
Finally, after a certain number of such operations, the tree could, I suppose, end up with as many leaves as letters. What would be the best way to keep track of that and prune the tree (by combining substrings) as necessary?
Thanks!
For what it's worth, you can implement a Rope using Splay Trees by attaching a substring to each node of the binary search tree (not just to the leaf nodes as shown above).
The rank of each node is its size plus the size of its left subtree. But when recomputing ranks during splay operations, you need to remember to walk down the node.left.right branch, too.
If each node records a reference to the substring it represents (cf. the actual substring itself), everything runs faster. That way when a split operation falls within an existing node, you just need to modify the node's attributes to reflect the right part of the substring you want to split, then add another node to represent the left part and merge it with the left subtree.
Done as above, each node records (in addition its left, right and parent attributes etc.) its rank, size (in characters) and the location of the first character it represents in the string you're trying to modify. That way, you never actually modify the initial string: you just do your operations on bits of the tree and reproduce the final string when you're ready by walking it in order.
The idea of deleting a node in BST is:
If the node has no child, delete it and update the parent's pointer to this node as null
If the node has one child, replace the node with its children by updating the node's parent's pointer to its child
If the node has two children, find the predecessor of the node and replace it with its predecessor, also update the predecessor's parent's pointer by pointing it to its only child (which only can be a left child)
the last case can also be done with use of a successor instead of predecessor!
It's said that if we use predecessor in some cases and successor in some other cases (giving them equal priority) we can have better empirical performance ,
Now the question is , how is it done ? based on what strategy? and how does it affect the performance ? (I guess by performance they mean time complexity)
What I think is that we have to choose predecessor or successor to have a more balanced tree ! but I don't know how to choose which one to use !
One solution is to randomly choose one of them (fair randomness) but isn't better to have the strategy based on the tree structure ? but the question is WHEN to choose WHICH ?
The thing is that is fundamental problem - to find correct removal algorithm for BST. For 50 years people were trying to solve it (just like in-place merge) and they didn't find anything better then just usual algorithm (with predecessor/successor removing). So, what is wrong with classic algorithm? Actually, this removing unbalances the tree. After several random operations add/remove you'll get unbalanced tree with height sqrt(n). And it is no matter what you choosed - remove successor or predecessor (or random chose beetwen these ways) - the result is the same.
So, what to choose? I'm guessing random based (succ or pred) deletion will postpone unbalancing of your tree. But, if you want to have perfectly balanced tree - you have to use red-black ones or something like that.
As you said, it's a question of balance, so in general the method that disturbs the balance the least is preferable. You can hold some metrics to measure the level of balance (e.g., difference from maximal and minimal leaf height, average height etc.), but I'm not sure whether the overhead worth it. Also, there are self-balancing data structures (red-black, AVL trees etc.) that mitigate this problem by rebalancing after each deletion. If you want to use the basic BST, I suppose the best strategy without apriori knowledge of tree structure and the deletion sequence would be to toggle between the 2 methods for each deletion.
For a B+ tree insertion why would you traverse down the tree then back upwards splitting the parents?
Wikipedia suggests this method of insertion:
Perform a search to determine what bucket the new record should go
into.
If the bucket is not full (at most b - 1 entries after the insertion), add the record.
Otherwise, split the bucket.
Allocate new leaf and move half the bucket's elements to the new bucket.
Insert the new leaf's
smallest key and address into the parent.
If the parent is full, split it too.
Add the middle key to the parent node.
Repeat until a parent is found that need not split.
If the root splits, create a new root which has one key and two
pointers.
Why would you traverse down then tree and then go back up performing the splits? Why not split the nodes as you encounter them on the way down?
To me, the proposed method performs twice the work and requires more bookkeeping as well.
Can anyone explain why this is the preferred method for insertion as opposed to splitting on the way down and what the disadvantages are for inserting during the traversal?
You have to backtrack up the tree because you don't actually know whether a split is required at the lowest level until you get there.
It's all there in the phrase "If the bucket is not full, ...".
You should also be aware that it's nowhere near twice the work. Since you're remembering all sorts of stuff on the way down (node pointers, indexes within the node, and so on), there's not as much calculation or searching on the way back up.