Suppose I have a balanced BST (binary search tree). Each tree node contains a special field count, which counts all descendants of that node + the node itself. They call this data structure order statistics binary tree.
This data structure supports two operations of O(logN):
rank(x) -- number of elements that are less than x
findByRank(k) -- find the node with rank == k
Now I would like to add a new operation median() to find the median. Can I assume this operation is O(1) if the tree is balanced?
Unless the tree is complete, the median might be a leaf node. So in general case the cost will be O(logN). I guess there is a data structure with requested properties and with a O(1) findMedian operation (Perhaps a skip list + a pointer to the median node; I'm not sure about findByRank and rank operations though) but a balanced BST is not one of them.
If the tree is complete (i.e. all levels completely filled), yes you can.
In a balanced order statistics tree, finding the median is O(log N). If it is important to find the median in O(1) time, you can augment the data structure by maintaining a pointer to the median. The catch, of course, is that you would need to update this pointer during each Insert or Delete operation. Updating the pointer would take O(log N) time, but since those operations already take O(log N) time, the extra work of updating the median pointer does not change their big-O cost.
As a practical matter, this only makes sense if you do a lot of "find median" operations compared to the number of insertions/deletions.
If desired, you could reduce the cost of updating the median pointer during Insert/Delete to O(1) by using a (doubly) threaded binary tree, but Insert/Delete would still be O(log N).
Related
A heap can be constructed from a list in O(n logn) time, because inserting an element into a heap takes O(logn) time and there are n elements.
Similarly, a binary search tree can be constructed from a list in O(n logn) time, because inserting an element into a BST takes on average logn time and there are n elements.
Traversing a heap from min-to-max takes O(n logn) time (because we have to pop n elements, and each pop requires an O(logn) sink operation). Traversing a BST from min-to-max takes O(n) time (literally just inorder traversal).
So, it appears to me that constructing both structures takes equal time, but BSTs are faster to iterate over. So, why do we use "Heapsort" instead of "BSTsort"?
Edit: Thank you to Tobias and lrlreon for your answers! In summary, below are the points why we use heaps instead of BSTs for sorting.
Construction of a heap can actually be done in O(n) time, not O(nlogn) time. This makes heap construction faster than BST construction.
Additionally, arrays can be easily transformed into heaps in-place, because heaps are always complete binary trees. BSTs can't be easily implemented as an array, since BSTs are not guaranteed to be complete binary trees. This means that BSTs require additional O(n) space allocation to sort, while Heaps require only O(1).
All operations on heaps are guaranteed to be O(logn) time. BSTs, unless balanced, may have O(n) operations. Heaps are dramatically simpler to implement than Balanced BSTs are.
If you need to modify a value after creating the heap, all you need to do is apply the sink or swim operations. Modifying a value in a BST is much more conceptually difficult.
There are multiple reasons I can imagine you would want to prefer a (binary) heap over a search tree:
Construction: A binary heap can actually be constructed in O(n) time by applying the heapify operations bottom-up from the smallest to the largest subtrees.
Modification: All operations of the binary heap are rather straightforward:
Inserted an element at the end? Sift it up until the heap condition holds
Swapped the last element to the beginning? Swift it down until the heap condition holds
Changed the key of an entry? Sift it up or down depending on the direction of the change
Conceptual simplicity: Due to its implicit array representation, a binary heap can be implemented by anyone who knows the basic indexing scheme (2i+1, 2i+2 are the children of i) without considering many difficult special cases.
If you look at these operations in a binary search tree, in theory
they are also quite simple, but the tree has to be stored explicitly, e.g. using pointers, and most of the operations require the tree to be
rebalanced to preserve the O(log n) height, which requires complicated rotations (red black-trees) or splitting/merging
nodes (B-trees)
EDIT: Storage: As Irleon pointed out, to store a BST you also need more storage, as at least two child pointers need to be stored for every entry in addition to the value itself, which can be a large storage overhead especially for small value types. At the same time, the heap needs no additional pointers.
To answer your question about sorting: A BST takes O(n) time to traverse in-order, the construction process takes O(n log n) operations which, as mentioned before, are much more complex.
At the same time Heapsort can actually be implemented in-place by building a max-heap from the input array in O(n) time and and then repeatedly swapping the maximum element to tbe back and shrinking the heap. You can think of Heapsort as Insertion sort with a helpful data structure that lets you find the next maximum in O(log n) time.
If the sorting method consists of storing the elements in a data structure and after extracting in a sorted way, then, although both approaches (heap and bst) have the same asymptotic complexity O(n log n), the heap tends to be faster. The reason is the heap always is a perfectly balanced tree and its operations always are O(log n), in a determistic way, not on average. With bst's, depending on the approah for balancing, insertion and deletion tend to take more time than the heap, no matter which balancing approach is used. In addition, a heap is usually implemented with an array storing the level traversal of the tree, without the need of storing any kind of pointers. Thus, if you know the number of elements, which usually is the case, the extra storage required for a heap is less than the used for a bst.
In the case of sorting an array, there is a very important reason which it would rather be preferable a heap than a bst: you can use the same array for storing the heap; no need to use additional memory.
Since a Balanced BST would take O(log(n)) time is extracting the max (by extracting I mean both find and delete Max element).
On the other hand Max-heap would also take O(log(n)) time in extracting the max element.
Does anyone of them has cutting edge over the other in Extract-Max operation?
When thinking of data structures, you should take into account their complexities, both of time and space. Here space is the same, so let's focus on time:
Balanced BST:
balanced BST maintains h = O(lg n) ⇒ all operations run in O(lg n) time.
Max-Heap:
Find max O(1), Delete max O(lg n)
which means that their time complexities are also the same.
Also by reading this answer, you draw the same conclusion:
...a max-heap or a binary tree is a good fit.
However, note that balancing an already constructed BST is an O(n) operation (Balancing a BST). So if I had to do that too, then I would go for a max heap.
For advanced usage, read Which data structure to use for accessing min/max in constant-time?
Sources: 1, 2.
Let S be a dynamic set of integers. Let n = |S|. Describe a data structure on S to
support the following operations on S with the required performance guarantees:
• Insert a new element to S in O(log n) time.
• Delete an element from S in O(log n) time.
• Report the k smallest elements of S in O(k) time, for any k satisfying 1 ≤ k ≤ n.
Your structure must consume O(n) space at all times.
could i simply build an AVL tree and then preform in order traversal printing out first 3 elements?
Use a priority queue, which uses O(n) space. Insert and delete are both O(log n) operations. Find-min is O(1), followed by remove-min which is O(k), which you can repeat three times.
Your suggested solution does not work in O(k) time. Starting an in-order traversal takes O(log n) time. Even when you stop after you find one element, you still need to get to the left-most leaf first.
I can think of two solutions:
Use a skip-list, it has O(log n) insertion and deletion and the underlying list is sorted. However, the time complexity of the insertion and deletion is average and not worst-case.
Use a modified AVL-tree, keep a dedicated pointer to the left-most node and update it on insertions and deletions. Also nodes must have pointers to their parent node, so that you could do in-order traversal starting at the left-most node. Or use a threaded tree.
ive decided to store and updated left most node pointer at the root, and pointers to predecessor and successor at every node
thus on every insertion it costs O(logn) for insertions and O(logn)+O(logn) to find both predecessor and successor, O(logn)+O(logn)+O(logn) = O(logn)
these pointer also allow for the update of effected node pointers after insertion in O(1) to travel to predecessor or successor respectively and update there pointers also in O(1)
having pointer to left most node at the root means travelling there is O(1) and then traversing the successor and printing first k nodes in O(k) time.
What's the worst case time complexity in a log-structured merge tree for a simple search query (like querying a single WHERE clause)?
Is it O(log N)? O(N*Log N)? Something else?
How about for a multiple query, like searching for multiple WHERE clauses in a key-value database?
The wikipedia page on LSM trees is currently lacking this info.
And I'm trying to make sense of the original paper.
I have been wondering the same.
If you have a series of trees, getting smaller by a constant factor each time, and you need to search them all for a single key, the cost seems to be O(log(N)^2).
Say the first (binary) tree takes log_2(N) branches to reach a node. The second might be half the size, and take (log_2(N) - 1) branches to find a node. The smallest tree will be some O(1) constant in size and there are roughly log_2(N) trees total. Summing the series gives O(log_2(N)^2).
However, I'm wondering if there is some more clever scheme where arbitrary single-key lookups, insertions or deletions have amortized cost O(log(N)), but haven't been able to find an answer (yet).
For a simple search indexed by a LSM tree, it is O(log n). This is because the biggest tree in the LSM tree is a B tree, which is O(log n), and the other trees are subsets of B trees or in the case of in memory trees, more efficient trees, which are no worse than O(log n). The number of trees is a constant, so it doesn't affect the order of the search time.
We know that heaps and red-black tree both have these properties:
worst-case cost for searching is lgN;
worst-case cost for insertion is lgN.
So, since the implementation and operation of red-black trees is difficult, why don't we just use heaps instead of red-black trees? I am confused.
You can't find an arbitrary element in a heap in O(log n). It takes O(n) to do this. You can find the first element (the smallest, say) in a heap in O(1) and extract it in O(log n). Red-black trees and heaps have quite different uses, internal orderings, and implementations: see below for more details.
Typical use
Red-black tree: storing dictionary where as well as lookup you want elements sorted by key, so that you can for example iterate through them in order. Insert and lookup are O(log n).
Heap: priority queue (and heap sort). Extraction of minimum and insertion are O(log n).
Consistency constraints imposed by structure
Red-black tree: total ordering: left child < parent < right child.
Heap: dominance: parent < children only.
(note that you can substitute a more general ordering than <)
Implementation / Memory overhead
Red-black tree: pointers used to represent structure of tree, so overhead per element. Typically uses a number of nodes allocated on free store (e.g. using new in C++), nodes point to other nodes. Kept balanced to ensure logarithmic lookup / insertion.
Heap: structure is implicit: root is at position 0, children of root at 1 and 2, etc, so no overhead per element. Typically just stored in a single array.
Red Black Tree:
Form of a binary search tree with a deterministic balancing strategy. This Balancing guarantees good performance and it can always be searched in O(log n) time.
Heaps:
We need to search through every element in the heap in order to determine if an element is inside. Even with optimization, I believe search is still O(N). On the other hand, It is best for finding min/max in a set O(1).