size for sub-trees after split in AVL tree - data-structures

So I am trying to implement split method in AVL tree(given a node with key X split the tree..) and since I have size field only for the AVL tree class,I can't find a way to find the size of each sub-tree after the split. I was thinking about adding each node a size field but this solution is too complicated for now because I will have to edit many of the code that I have written.
I will be glad for solution(if exists, under those conditions) to know how to find the size of each sub-tree after the split(without time complexity above O(logn)).thank you!

It is not possible to determine the size of each side of the split in sub-linear time without attaching additional data to the node.

Related

Is there a balanced BST with each node maintain the subtree size?

Is there a balanced BST structure that also keeps track of subtree size in each node?
In Java, TreeMap is a red-black tree, but doesn't provide subtree size in each node.
Previously, I did write some BST that could keep track subtree size of each node, but it's not balanced.
The questions are:
Is it possible to implement such a tree, while keeping efficiency of (O(lg(n)) for basic operations)?
If yes, then is there any 3rd-party libraries provide such an impl?
A Java impl is great, but other languages (e.g c, go) would also be helpful.
BTW:
The subtree size should be kept track in each node.
So that could get the size without traversing the subtree.
Possible appliation:
Keep track of rank of items, whose value (that the rank depends on) might change on fly.
The Weight Balanced Tree (also called the Adams Tree, or Bounded Balance tree) keeps the subtree size in each node.
This also makes it possible to find the Nth element, from the start or end, in log(n) time.
My implementation in Nim is on github. It has properties:
Generic (parameterized) key,value map
Insert (add), lookup (get), and delete (del) in O(log(N)) time
Key-ordered iterators (inorder and revorder)
Lookup by relative position from beginning or end (getNth) in O(log(N)) time
Get the position (rank) by key in O(log(N)) time
Efficient set operations using tree keys
Map extensions to set operations with optional value merge control for duplicates
There are also implementations in Scheme and Haskell available.
That's called an "order statistic tree": https://en.wikipedia.org/wiki/Order_statistic_tree
It's pretty easy to add the size to any kind of balanced binary tree (red-black, avl, b-tree, etc.), or you can use a balancing algorithm that works with the size directly, like weight-balanced trees (#DougCurrie answer) or (better) size-balanced trees: https://cs.wmich.edu/gupta/teaching/cs4310/lectureNotes_cs4310/Size%20Balanced%20Tree%20-%20PEGWiki%20sourceMayNotBeFullyAuthentic%20but%20description%20ok.pdf
Unfortunately, I don't think there are any standard-library implementations, but you can find open source if you look for it. You may want to roll your own.

Implementing the Rope data structure using binary search trees (splay trees)

In a standard implementation of the Rope data structure using splay trees, the nodes would be ordered according to a rank statistic measuring the position of each one from the start of the string, so the keys normally found in binary search tree would be irrelevant, would they not?
I ask because the keys shown in the graphic below (thanks Wikipedia!) are letters, which would presumably become non-unique once the number of nodes exceeded the length of the chosen alphabet. Wouldn't it be better to use integers or avoid using keys altogether?
Separately, can anyone point me to a good implementation of the logic to recompute rank statistics after each operation?
Presumably, if the index for a split falls within the substring attached to a particular node, say, between "Hel" and "llo_" on the node E above, you would remove the substring from E, split it and reattach it as two children of E. Correct?
Finally, after a certain number of such operations, the tree could, I suppose, end up with as many leaves as letters. What would be the best way to keep track of that and prune the tree (by combining substrings) as necessary?
Thanks!
For what it's worth, you can implement a Rope using Splay Trees by attaching a substring to each node of the binary search tree (not just to the leaf nodes as shown above).
The rank of each node is its size plus the size of its left subtree. But when recomputing ranks during splay operations, you need to remember to walk down the node.left.right branch, too.
If each node records a reference to the substring it represents (cf. the actual substring itself), everything runs faster. That way when a split operation falls within an existing node, you just need to modify the node's attributes to reflect the right part of the substring you want to split, then add another node to represent the left part and merge it with the left subtree.
Done as above, each node records (in addition its left, right and parent attributes etc.) its rank, size (in characters) and the location of the first character it represents in the string you're trying to modify. That way, you never actually modify the initial string: you just do your operations on bits of the tree and reproduce the final string when you're ready by walking it in order.

Is there only one correct answer in heapsort?

If starting with an empty heap representing a priority queue, where numbers have to be inserted in order and then represented as a binary tree, is there only one strict answer to that? I have tried different Java heap generators etc. and they are all giving me different answers.
If you mean the sorted representation, then it's obviously unique.
If you mean the binary tree representation, then yes it's unique too - a heap is a complete tree - binary tree in which every level, except possibly the last, is completely filled, and all nodes are as far left as possible.
After all heap operations, may they be insert, delete-max, build-heap, siftup, siftdown, the heap remains in a predictable state, we can know how it's binary tree representation will look like.
Can you give more details about how you got different answers?

Fair deletion of nodes in Binary Search Tree

The idea of deleting a node in BST is:
If the node has no child, delete it and update the parent's pointer to this node as null
If the node has one child, replace the node with its children by updating the node's parent's pointer to its child
If the node has two children, find the predecessor of the node and replace it with its predecessor, also update the predecessor's parent's pointer by pointing it to its only child (which only can be a left child)
the last case can also be done with use of a successor instead of predecessor!
It's said that if we use predecessor in some cases and successor in some other cases (giving them equal priority) we can have better empirical performance ,
Now the question is , how is it done ? based on what strategy? and how does it affect the performance ? (I guess by performance they mean time complexity)
What I think is that we have to choose predecessor or successor to have a more balanced tree ! but I don't know how to choose which one to use !
One solution is to randomly choose one of them (fair randomness) but isn't better to have the strategy based on the tree structure ? but the question is WHEN to choose WHICH ?
The thing is that is fundamental problem - to find correct removal algorithm for BST. For 50 years people were trying to solve it (just like in-place merge) and they didn't find anything better then just usual algorithm (with predecessor/successor removing). So, what is wrong with classic algorithm? Actually, this removing unbalances the tree. After several random operations add/remove you'll get unbalanced tree with height sqrt(n). And it is no matter what you choosed - remove successor or predecessor (or random chose beetwen these ways) - the result is the same.
So, what to choose? I'm guessing random based (succ or pred) deletion will postpone unbalancing of your tree. But, if you want to have perfectly balanced tree - you have to use red-black ones or something like that.
As you said, it's a question of balance, so in general the method that disturbs the balance the least is preferable. You can hold some metrics to measure the level of balance (e.g., difference from maximal and minimal leaf height, average height etc.), but I'm not sure whether the overhead worth it. Also, there are self-balancing data structures (red-black, AVL trees etc.) that mitigate this problem by rebalancing after each deletion. If you want to use the basic BST, I suppose the best strategy without apriori knowledge of tree structure and the deletion sequence would be to toggle between the 2 methods for each deletion.

What are the advantages of storing all elements in the leaf nodes?

I'm reading Advanced Data Structures by Peter Brass.
In the beginning of the chapter on search trees, he stated that there is two models of search trees - one where nodes contain the actual object (the value if the tree is used as a dictionary), and an other where all objects are stored in leaves and internal nodes are only for comparisons.
What are the advantages of the second model over the first one?
One of the big advantages of a binary tree where data is only in the leaf nodes is that you can partition based on elements that are not in your dataset.
For example, if I have a possible dataset of 0-1 million, but the vast majority of items are either at the high end or low end but not in the middle, I may still want my first compare against 500,000 - even though that number is not in my data set. If every node had data, I could not do this. While not normally needed in theory, I've run into many times that partitioning based on a value outside my data simplified implementation.
B+ trees are an example of a case where all key/values are stored in leaf nodes. The primary advantage here is that since all items are in the leaf nodes, the leaf nodes can be linked together to form a linked list which allows rapid in-order traversal. If you access a particular element, you can always find the next element in the sequence without visiting any parents because the leaf nodes are linked together. Filesystems and database storage systems can take advantage of this structures for range searches and stuff.
Lets say you are building tree over some objects on some complex criteria. On example calculated from multiple properties. Sometimes you can't change this object to store calculated value and calculating this criteria is expansive. So you calculate this criteria only once, and store objects in leafs based on criteria result. Then when your tree is complete you can find required object much faster because you don't have to calculate criteria for each tree node in your path.
well storing information objects in the nodes, we talking in this case about a trie, is usefull for fast retrival of information(faster than storing stuff in an array/hashtable, where the worst case auf acces is O(n), in the trie this is O(m) [m is the lenght of n])
look here:
https://en.wikipedia.org/wiki/Trie
In a search tree this oerations can be much more complicated(look AVL Tree O(log n) ) and so can be slower and is more compley to implement.
What data structure to choose??
Well this depends on what u want to do

Resources