Splitting a red-black tree destructively? - data-structures

I would like to implement a priority queue using a red-black tree. Using a binary heap is O(log n) worst case for deletion, and I will be removing many keys from the queue at once, so I want O(log n) worst case for the bulk deletion, rather than O(m log n) worst case, where m is the number of keys being removed at once. I will probably only be removing a minority of the keys.
I will not need the old tree anymore. How can I split a red-black tree destructively (which apparently can somehow be done in O(log n)) to accomplish this, while maintaining the black height invariant?

There is an implementation of the algorithm you need in the archive at
https://github.com/CGAL/cgal/releases/download/releases%2FCGAL-5.0.3/CGAL-5.0.3.zip
in the file include/cgal/MultiSet.h starting at line 2617 the function is Multiset<Type, Compare, Allocator>::split
The algorithm is also described in the paper at https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwjpqfz_sKrrAhVkFjQIHbnbDYYQFjADegQIAhAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.109.4875%26rep%3Drep1%26type%3Dpdf&usg=AOvVaw26DS8sY7M2fmunxpDfXUZn

Related

delete subtree from bst and balance the tree in logn time

Is it possible that we could perform m insert and delete operations on a balanced binary search tree such that delete operation deletes a node and the whole subtree below it and after that balance it? The whole process being in done in amortized O(log n) time per step?
Short answer: Yes it is possible.
What you are describing is a self balancing binary tree, like an AVL-tree or a Red-Black tree. Both take O(log n) for deletion, which includes reordering of the nodes. Here is a link to a page describing such trees and how they work in much more detail than I can, including illustrations. You can also check out the Wikipedia page of AVL-trees, they have a decent explanation as well as animations of the insertions. Here is a short version of what you were most interested in:
Deletions in an AVL tree are on average O(log n) and the rebalancing is on average O(log n), worst case O(1). This is done by doing rotations, again, well explained in both the sources.
The Wikipedia page also includes some code there, if you need to implement it.
EDIT:
For removing a subtree, you will still be able to do the same thing. Here is a link to a very good explanation of this. Short version: deleting the subtree is can be done O(log n) (keep in mind that deletion, regardless of the number of nodes deleted is still O(log n) as long as you do not directly rebalance the tree), then the tree would rebalance itself using rotations. This can also change the root of your tree. Removing a whole subtree is of course going to create a bigger height difference than just the deletion of one node at the end of the tree. Still, using rotation the tree can be rebalanced itself by finding the first node imbalance and then doing the AVL rebalancing scheme. Due to the use of the rotations, this should still all be O(log n). Here you will find how the tree rebalances itself after a deletion, which creates a height imbalance.

Why do we sort via Heaps instead of Binary Search Trees?

A heap can be constructed from a list in O(n logn) time, because inserting an element into a heap takes O(logn) time and there are n elements.
Similarly, a binary search tree can be constructed from a list in O(n logn) time, because inserting an element into a BST takes on average logn time and there are n elements.
Traversing a heap from min-to-max takes O(n logn) time (because we have to pop n elements, and each pop requires an O(logn) sink operation). Traversing a BST from min-to-max takes O(n) time (literally just inorder traversal).
So, it appears to me that constructing both structures takes equal time, but BSTs are faster to iterate over. So, why do we use "Heapsort" instead of "BSTsort"?
Edit: Thank you to Tobias and lrlreon for your answers! In summary, below are the points why we use heaps instead of BSTs for sorting.
Construction of a heap can actually be done in O(n) time, not O(nlogn) time. This makes heap construction faster than BST construction.
Additionally, arrays can be easily transformed into heaps in-place, because heaps are always complete binary trees. BSTs can't be easily implemented as an array, since BSTs are not guaranteed to be complete binary trees. This means that BSTs require additional O(n) space allocation to sort, while Heaps require only O(1).
All operations on heaps are guaranteed to be O(logn) time. BSTs, unless balanced, may have O(n) operations. Heaps are dramatically simpler to implement than Balanced BSTs are.
If you need to modify a value after creating the heap, all you need to do is apply the sink or swim operations. Modifying a value in a BST is much more conceptually difficult.
There are multiple reasons I can imagine you would want to prefer a (binary) heap over a search tree:
Construction: A binary heap can actually be constructed in O(n) time by applying the heapify operations bottom-up from the smallest to the largest subtrees.
Modification: All operations of the binary heap are rather straightforward:
Inserted an element at the end? Sift it up until the heap condition holds
Swapped the last element to the beginning? Swift it down until the heap condition holds
Changed the key of an entry? Sift it up or down depending on the direction of the change
Conceptual simplicity: Due to its implicit array representation, a binary heap can be implemented by anyone who knows the basic indexing scheme (2i+1, 2i+2 are the children of i) without considering many difficult special cases.
If you look at these operations in a binary search tree, in theory
they are also quite simple, but the tree has to be stored explicitly, e.g. using pointers, and most of the operations require the tree to be
rebalanced to preserve the O(log n) height, which requires complicated rotations (red black-trees) or splitting/merging
nodes (B-trees)
EDIT: Storage: As Irleon pointed out, to store a BST you also need more storage, as at least two child pointers need to be stored for every entry in addition to the value itself, which can be a large storage overhead especially for small value types. At the same time, the heap needs no additional pointers.
To answer your question about sorting: A BST takes O(n) time to traverse in-order, the construction process takes O(n log n) operations which, as mentioned before, are much more complex.
At the same time Heapsort can actually be implemented in-place by building a max-heap from the input array in O(n) time and and then repeatedly swapping the maximum element to tbe back and shrinking the heap. You can think of Heapsort as Insertion sort with a helpful data structure that lets you find the next maximum in O(log n) time.
If the sorting method consists of storing the elements in a data structure and after extracting in a sorted way, then, although both approaches (heap and bst) have the same asymptotic complexity O(n log n), the heap tends to be faster. The reason is the heap always is a perfectly balanced tree and its operations always are O(log n), in a determistic way, not on average. With bst's, depending on the approah for balancing, insertion and deletion tend to take more time than the heap, no matter which balancing approach is used. In addition, a heap is usually implemented with an array storing the level traversal of the tree, without the need of storing any kind of pointers. Thus, if you know the number of elements, which usually is the case, the extra storage required for a heap is less than the used for a bst.
In the case of sorting an array, there is a very important reason which it would rather be preferable a heap than a bst: you can use the same array for storing the heap; no need to use additional memory.

Priority queue with O(1) delete-max, insert-min, and find-min and O(log n) insertion and deletion

Is it possible for monotonic priority queue to have :
O(1) for finding and delete item with highest priority,
O(1) for inserting item assuming the priority given is lower than every other item,
O(log n) for inserting and deleting item without assumption?
I do know if the insertion and deletion is allowed to be O(n), by using linked list. I was also thinking of skip list. However, in worst case, inserting and deleting item is O(n).
Decrease-key is not required.
In an amortized sense, red-black trees have this property. In the worst-case, you can use one of many finger-tree designs, like Fleischer's "A Simple Balanced Search Tree with O(1) Worst Case Update Time."
I wrote a long overview of how these things work.

LSM Tree lookup time

What's the worst case time complexity in a log-structured merge tree for a simple search query (like querying a single WHERE clause)?
Is it O(log N)? O(N*Log N)? Something else?
How about for a multiple query, like searching for multiple WHERE clauses in a key-value database?
The wikipedia page on LSM trees is currently lacking this info.
And I'm trying to make sense of the original paper.
I have been wondering the same.
If you have a series of trees, getting smaller by a constant factor each time, and you need to search them all for a single key, the cost seems to be O(log(N)^2).
Say the first (binary) tree takes log_2(N) branches to reach a node. The second might be half the size, and take (log_2(N) - 1) branches to find a node. The smallest tree will be some O(1) constant in size and there are roughly log_2(N) trees total. Summing the series gives O(log_2(N)^2).
However, I'm wondering if there is some more clever scheme where arbitrary single-key lookups, insertions or deletions have amortized cost O(log(N)), but haven't been able to find an answer (yet).
For a simple search indexed by a LSM tree, it is O(log n). This is because the biggest tree in the LSM tree is a B tree, which is O(log n), and the other trees are subsets of B trees or in the case of in memory trees, more efficient trees, which are no worse than O(log n). The number of trees is a constant, so it doesn't affect the order of the search time.

Implementation of priority queue by AVL Tree data structure

Priority queue:
Basic operations: Insertion
Delete (Delete minumum element)
Goal: To provide efficient running time or order of growth for above functionality.
Implementation of Priority queue By:
Linked List: Insertion will take o(n) in case of insertion at end o(1) in case of
insertion at head.
Delet (Finding minumum and Delete this ) will take o(n)
BST:
Insertion/Deltion of minimum = In avg case it will take o(logn) worst case 0(n)
AVL Tree:
Insertion/deletion/searching: o(log n) in all cases.
My confusion goes here:
Why not we have used AVL Tree for implementation of Priority queue, Why we gone
for Binary heap...While as we know that in AVL Tree we can do insertion/ Deletion/searching in o(log n) in worst case.
Complexity isn't everything, there are other considerations for actual performance.
For most purposes, most people don't even use an AVL tree as a balanced tree (Red-Black trees are more common as far as I've seen), let alone as a priority queue.
This is not to say that AVL trees are useless, I quite like them. But they do have a relatively expensive insert. What AVL trees are good for (beating even Red-Black trees) is doing lots and lots of lookups without modification. This is not what you need for a priority queue.
As a separate consideration -- never mind your O(log n) insert for a binary heap, a fibonacci heap has O(1) insert and O(log N) delete-minimum. There are a lot of data structures to choose from with slightly different trade-offs, so you wouldn't expect to see everyone just pick the first thing that satisfies your (quite brief) criteria.
Binary heap is not Binary Search Tree (BST). If severely unbalanced / deteriorated into a list, it will indeed take O(n) time. Heaps are usually always O(log(n)) or better. IIRC Sedgewick claimed O(1) average-time for array-based heaps.
Why not AVL? Because it maintains too much order in a structure. Too much order means, too much effort went into maintaining that order. The less order we can get away with, the better - it will usually translate to faster operations. For example, RBTs are better than AVL trees. RBTs, red-black trees, are almost balanced trees - they save operations while still ensuring O(log(n)) time.
But any tree is totally-ordered structure, so heaps are generally better, because they only ensure that the minimal element is on top. They are only partially ordered.
Because in a binary heap the minimum element is the root.

Resources