How to remove elements from a binary heap? - algorithm

As I understand, binary heap does not support removing random elements. What if I need to remove random elements from a binary heap?
Obviously, I can remove an element and re-arrange the entire heap in O(N). Can I do better?

Yes and no.
The problem is a binary heap does not support search for an arbitrary element. Finding it is itself O(n).
However, if you have a pointer to the element (and not only its value) - you can swap the element with the rightest leaf, remove this leaf, and than re-heapify the relevant sub-heap (by sifting down the newly placed element as much as needed). This results in O(logn) removal, but requires a pointer to the actual element you are looking for.

Amit is right in his answer but here is one more nuance:
the position of the removed item (where you put the right-most leaf) can be required to be bubbled up (compare with parent and move up until parent is larger than you).
Sometimes it is required to bubble down (compare with children and move down until all children are smaller than you). It all depends on the case.

Depends on what is meant by "random element." If it means that the heap contains elements [e1, e2, ..., eN] and one wants to delete some ei (1 <= i <= N), then this is possible.
If you are using a binary heap implementation from some library, it might be that it doesn't provide you with the API that you need. In that case, you should look for another library that has it.
If you were to implement it yourself, you would need two additional calls:
A procedure deleteAtIndex(heap, i) that deletes the node at index i
by positioning the last element in the heap array at i, decrementing the
element count, and finally shuffling down/up
the new ith element to maintain the heap invariant. The most
common use of this procedure is to "pop" the heap by calling
deleteAtIndex(heap, 1) -- assuming 1-origin indexing. This operation
will run O(log n) (though, to be complete, I'll note that the highest bound can be
improved up to O(log(log n)) depending on some assumptions about your elements' keys).
A procedure deleteElement(heap, e) that deletes the element e (your arbitrary element).
Your heap algorithm would maintain an array ElementIndex such that ElementIndex[e] returns
the current index of element e: calling deleteAtIndex(heap, ElementIndex[e])
will then do what you want. It will also run in O(log n) because the array access is constant.
Since binary heaps are often used in algorithms that merely pop the highest (or lowest)
priority element (rather than deleting arbitrary elements), I imagine that some libraries might miss on the deleteAtIndex API to save space (the extra ElementIndex array mentioned above).

Related

Delete certain element from heap [duplicate]

As I understand, binary heap does not support removing random elements. What if I need to remove random elements from a binary heap?
Obviously, I can remove an element and re-arrange the entire heap in O(N). Can I do better?
Yes and no.
The problem is a binary heap does not support search for an arbitrary element. Finding it is itself O(n).
However, if you have a pointer to the element (and not only its value) - you can swap the element with the rightest leaf, remove this leaf, and than re-heapify the relevant sub-heap (by sifting down the newly placed element as much as needed). This results in O(logn) removal, but requires a pointer to the actual element you are looking for.
Amit is right in his answer but here is one more nuance:
the position of the removed item (where you put the right-most leaf) can be required to be bubbled up (compare with parent and move up until parent is larger than you).
Sometimes it is required to bubble down (compare with children and move down until all children are smaller than you). It all depends on the case.
Depends on what is meant by "random element." If it means that the heap contains elements [e1, e2, ..., eN] and one wants to delete some ei (1 <= i <= N), then this is possible.
If you are using a binary heap implementation from some library, it might be that it doesn't provide you with the API that you need. In that case, you should look for another library that has it.
If you were to implement it yourself, you would need two additional calls:
A procedure deleteAtIndex(heap, i) that deletes the node at index i
by positioning the last element in the heap array at i, decrementing the
element count, and finally shuffling down/up
the new ith element to maintain the heap invariant. The most
common use of this procedure is to "pop" the heap by calling
deleteAtIndex(heap, 1) -- assuming 1-origin indexing. This operation
will run O(log n) (though, to be complete, I'll note that the highest bound can be
improved up to O(log(log n)) depending on some assumptions about your elements' keys).
A procedure deleteElement(heap, e) that deletes the element e (your arbitrary element).
Your heap algorithm would maintain an array ElementIndex such that ElementIndex[e] returns
the current index of element e: calling deleteAtIndex(heap, ElementIndex[e])
will then do what you want. It will also run in O(log n) because the array access is constant.
Since binary heaps are often used in algorithms that merely pop the highest (or lowest)
priority element (rather than deleting arbitrary elements), I imagine that some libraries might miss on the deleteAtIndex API to save space (the extra ElementIndex array mentioned above).

How do I further optimize this Data Structure?

I was recently asked to build a data structure that supports four operations, namely,
Push: Add an element to the DS.
Pop: Remove the last pushed element.
Find_max: Find the maximum element out of the currently stored elements.
Pop_max: Remove the maximum element from the DS.
The elements are integers.
Here is the solution I suggested:
Take a stack.
Store a pair of elements in it. The pair should be (element, max_so_far), where element is the element at that index and max_so_far is the maximum valued element seen so far.
While pushing an element into the stack, check the max_so_far of the topmost stack element. If current number is greater than that, put the current pair's max_so_far value as the current element's value, else store the previous max_so_far. This mean that pushing would simply be an O(1) operation.
For pop, simply pop an element out of the stack. Again, this operation is O(1).
For Find_max, return the value of the max_so_far of the topmost element in the stack. Again, O(1).
Popping the max element would involve going through the stack and explicitly removing the max element and pushing back the elements on top of it, after allotting new max_so_far values. This would be linear.
I was asked to improve it, but I couldn't.
In terms of time complexity, the overall time can be improved if all operations happen in O(logn), I guess. How to do that, is something I'm unable to get.
One approach would be to store pointers to the elements in a doubly-linked list, and also in a max-heap data structure (sorted by value).
Each element would store its position in the doubly-linked list and also in the max-heap.
In this case all of your operations would require O(1) time in the doubly-linked list, plus O(log(n)) time in the heap data structure.
One way to get O(log n)-time operations is to mash up two data structures, in this case a doubly linked list and a priority queue (a pairing heap is a good choice) . We have a node structure like
struct Node {
Node *previous, *next; // doubly linked list
Node **back, *child, *sibling; // pairing heap
int value;
} list_head, *heap_root;
Now, to push, we push in both structures. To find_max, we return the value of the root of the pairing heap. To pop or pop_max, we pop from the appropriate data structure and then use the other node pointers to delete in the other data structure.
Usually, when you need to find elements by quality A (value), and also by quality B (insert order), then you start eyeballing a data structure that actually has two data structures inside that reference each other, or are otherwise interleaved.
For instance: two maps that who's keys are quality A and quality B, who's values are a shared pointer to a struct that contains iterators back to both maps, and the value. Then you have log(n) to find an element via either quality, and erasure is ~O(logn) to remove the two iterators from either map.
struct hybrid {
struct value {
std::map<std::string, std::shared_ptr<value>>::iterator name_iter;
std::map<int, std::shared_ptr<value>>::iterator height_iter;
mountain value;
};
std::map<std::string, std::shared_ptr<value>> name_map;
std::map<int, std::shared_ptr<value>> height_map;
mountain& find_by_name(std::string s) {return name_map[s]->value;}
mountain& find_by_height(int height h) {return height_map[s]->value;}
void erase_by_name(std::string s) {
value& v = name_map[s];
name_map.erase(v.name_iter);
height_iter.erase(v.height_iter); //note that this invalidates the reference v
}
};
However, in your case, you can do even better than this O(logn), since you only need "the most recent" and "the next highest". To make "pop highest" fast, you need a fast way to detect the next highest, which means that needs to be precalculated at insert. To find the "height" position relative to the rest, you need a map of some sort. To make "pop most recent" fast, you need a fast way to detect the next most recent, but that's trivially calculated. I'd recommend creating a map or heap of nodes, where keys are the value for finding the max, and the values are a pointer to the next most recent value. This gives you O(logn) insert, O(1) find most recent, O(1) or O(logn) find maximum value (depending on implementation), and ~O(logn) erasure by either index.
One more way to do this is:-
Create max-heap with elements. In this way we will be able to get/delete max-element in O(1) operations.
Along with this we can maintain a pointer to last pushed element.And as far as I know delete in heaps can be constructed in O(log n).

Why does a Binary Heap has to be a Complete Binary Tree?

The heap property says:
If A is a parent node of B then the key of node A is ordered with
respect to the key of node B with the same ordering applying across
the heap. Either the keys of parent nodes are always greater than or
equal to those of the children and the highest key is in the root node
(this kind of heap is called max heap) or the keys of parent nodes are
less than or equal to those of the children and the lowest key is in
the root node (min heap).
But why in this wiki, the Binary Heap has to be a Complete Binary Tree? The Heap Property doesn't imply that in my impression.
According to the wikipedia article you provided, a binary heap must conform to both the heap property (as you discussed) and the shape property (which mandates that it is a complete binary tree). Without the shape property, one would lose the runtime advantage that the data structure provides (i.e. the completeness ensures that there is a well defined way to determine the new root when an element is removed, etc.)
Every item in the array has a position in the binary tree, and this position is calculated from the array index. The positioning formula ensures that the tree is 'tightly packed'.
For example, this binary tree here:
is represented by the array
[1, 2, 3, 17, 19, 36, 7, 25, 100].
Notice that the array is ordered as if you're starting at the top of the tree, then reading each row from left-to-right.
If you add another item to this array, it will represent the slot below the 19 and to the right of the 100. If this new number is less than 19, then values will have to be swapped around, but nonetheless, that is the slot that will be filled by the 10th item of the array.
Another way to look at it: try constructing a binary heap which isn't a complete binary tree. You literally cannot.
You can only guarantee O(log(n)) insertion and (root) deletion if the tree is complete. Here's why:
If the tree is not complete, then it may be unbalanced and in the worst case, simply a linked list, requiring O(n) to find a leaf, and O(n) for insertion and deletion. With the shape requirement of completeness, you are guaranteed O(log(n)) operations since it takes constant time to find a leaf (last in array), and you are guaranteed that the tree is no deeper than log2(N), meaning the "bubble up" (used in insertion) and "sink down" (used in deletion) will require at most log2(N) modifications (swaps) of data in the heap.
This being said, you don't absolutely have to have a complete binary tree, but you just loose these runtime guarantees. In addition, as others have mentioned, having a complete binary tree makes it easy to store the tree in array format forgoing object reference representation.
The point that 'complete' makes is that in a heap all interior (not leaf) nodes have two children, except where there are no children left -- all the interior nodes are 'complete'. As you add to the heap, the lowest level of nodes is filled (with childless leaf nodes), from the left, before a new level is started. As you remove nodes from the heap, the right-most leaf at the lowest level is removed (and pushed back in at the top). The heap is also perfectly balanced (hurrah!).
A binary heap can be looked at as a binary tree, but the nodes do not have child pointers, and insertion (push) and deletion (pop or from inside the heap) are quite different to those procedures for an actual binary tree.
This is a direct consequence of the way in which the heap is organised. The heap is held as a vector with no gaps between the nodes. The parent of the i'th item in the heap is item (i - 1) / 2 (assuming a binary heap, and assuming the top of the heap is item 0). The left child of the i'th item is (i * 2) + 1, and the right child one greater than that. When there are n nodes in the heap, a node has no left child if (i * 2) + 1 exceeds n, and no right child if (i * 2) + 2 does.
The heap is a beautiful thing. It's one flaw is that you do need a vector large enough for all entries... unlike a real binary tree, you cannot allocate a node at a time. So if you have a heap for an indefinite number of items, you have to be ready to extend the underlying vector as and when needed -- or run some fragmented structure which can be addressed as if it was a vector.
FWIW: when stepping down the heap, I find it convenient to step to the right child -- (i + 1) * 2 -- if that is < n then both children are present, if it is == n only the left child is present, otherwise there are no children.
By maintaining binary heap as a complete binary gives multiple advantages such as
1.heap is complete binary tree so height of heap is minimum possible i.e log(size of tree). And insertion, build heap operation depends on height. So if height is minimum then their time complexity will be reduced.
2.All the items of complete binary tree stored in contiguous manner in array so random access is possible and it also provide cache friendliness.
In order for a Binary Tree to be considered a heap two it must meet two criteria. 1) It must have the heap property. 2) it must be a complete tree.
It is possible for a structure to have either of these properties and not have the other, but we would not call such a data structure a heap. You are right that the heap property does not entail the shape property. They are separate constraints.
The underlying structure of a heap is an array where every node is an index in an array so if the tree is not complete that means that one of the index is kept empty which is not possible beause it is coded in such a way that each node is an index .I have given a link below so that u can see how the heap structure is built
http://www.sanfoundry.com/java-program-implement-min-heap/
Hope it helps
I find that all answers so far either do not address the question or are, essentially, saying "because the definition says so" or use a similar circular argument. They are surely true but (to me) not very informative.
To me it became immediately obvious that the heap must be a complete tree when I remembered that you insert a new element not at the root (as you do in a binary search tree) but, rather, at the bottom right.
Thus, in a heap, a new element propagates from the bottom up - it is "moved up" within the tree till it finds a suitable place.
In a binary search tree a newly inserted element moves the other way round - it is inserted at the root and it "moves down" till it finds its place.
The fact that each new element in a heap starts as the bottom right node means that the heap is going to be a complete tree at all times.

Deleting in a heap, why does this implementation switch the values of the last element, not just replace it?

(USC CSCI 303 Homework 4) Problem 7 (6.5-7):
The operation Heap-Delete(A, i) deletes the item in node i from heap A. Give an implementation of Heap-Delete that runs in O(lg n) time for an n-element max-heap.
here's the pseudo-code and description of the reference solution:
Heap-Delete(A, i)
A[i] ↔ A[length(A)]
length(A) ← length(A) - 1
Heapify(A, i)
The algorithm deletes the element at node i, and replaces it with the last element. Then the algorithm runs Heapify from the node i.
isn't it better if "↔" was "←" instead? or is this really necessary?
I got this from
http://www-scf.usc.edu/~csci303/cs303hw4solutions.pdf (Page 4)
It is not really necessary. Perhaps the intent is to return that element, in which case you need to store it somewhere, before being overwritten.
Perhaps this is done so that Heap-Delete() can be called repeatedly during execution of a heapsort. In heapsort, the largest item is taken off the heap, stashed in the last item of the heap, and the heap size is decreased by one. Then the heap is re-heapified, and you proceed to the next largest item. The end result, when the heap is size 1, is that the array on which the heap is/was based is sorted from smallest to largest. Writing Heap-Delete() in this way allows Heap-Sort() to be very simple:
while (HeapSize>0)
HeapDelete(0)
Now it doesn't look like your Heap-Delete() keeps track of heap size outside of length, which might be evidence against my theory. But this is my best guess anyway.
Here only pointer to the root node is given so if you want to delete the last node also (for those who says it takes o(1) to delete a leaf node)
You have to traverse log(n) to reach to the leaf then only answer is b.

Data Structure for fast position lookup

Looking for a datastructure that logically represents a sequence of elements keyed by unique ids (for the purpose of simplicity let's consider them to be strings, or at least hashable objects). Each element can appear only once, there are no gaps, and the first position is 0.
The following operations should be supported (demonstrated with single-letter strings):
insert(id, position) - add the element keyed by id into the sequence at offset position. Naturally, the position of each element later in the sequence is now incremented by one. Example: [S E L F].insert(H, 1) -> [S H E L F]
remove(position) - remove the element at offset position. Decrements the position of each element later in the sequence by one. Example: [S H E L F].remove(2) -> [S H L F]
lookup(id) - find the position of element keyed by id. [S H L F].lookup(H) -> 1
The naïve implementation would be either a linked list or an array. Both would give O(n) lookup, remove, and insert.
In practice, lookup is likely to be used the most, with insert and remove happening frequently enough that it would be nice not to be linear (which a simple combination of hashmap + array/list would get you).
In a perfect world it would be O(1) lookup, O(log n) insert/remove, but I actually suspect that wouldn't work from a purely information-theoretic perspective (though I haven't tried it), so O(log n) lookup would still be nice.
A combination of trie and hash map allows O(log n) lookup/insert/remove.
Each node of trie contains id as well as counter of valid elements, rooted by this node and up to two child pointers. A bit string, determined by left (0) or right (1) turns while traversing the trie from its root to given node, is part of the value, stored in the hash map for corresponding id.
Remove operation marks trie node as invalid and updates all counters of valid elements on the path from deleted node to the root. Also it deletes corresponding hash map entry.
Insert operation should use the position parameter and counters of valid elements in each trie node to search for new node's predecessor and successor nodes. If in-order traversal from predecessor to successor contains any deleted nodes, choose one with lowest rank and reuse it. Otherwise choose either predecessor or successor, and add a new child node to it (right child for predecessor or left one for successor). Then update all counters of valid elements on the path from this node to the root and add corresponding hash map entry.
Lookup operation gets a bit string from the hash map and uses it to go from trie root to corresponding node while summing all the counters of valid elements to the left of this path.
All this allow O(log n) expected time for each operation if the sequence of inserts/removes is random enough. If not, the worst case complexity of each operation is O(n). To get it back to O(log n) amortized complexity, watch for sparsity and balancing factors of the tree and if there are too many deleted nodes, re-create a new perfectly balanced and dense tree; if the tree is too imbalanced, rebuild the most imbalanced subtree.
Instead of hash map it is possible to use some binary search tree or any dictionary data structure. Instead of bit string, used to identify path in the trie, hash map may store pointer to corresponding node in trie.
Other alternative to using trie in this data structure is Indexable skiplist.
O(log N) time for each operation is acceptable, but not perfect. It is possible, as explained by Kevin, to use an algorithm with O(1) lookup complexity in exchange for larger complexity of other operations: O(sqrt(N)). But this can be improved.
If you choose some number of memory accesses (M) for each lookup operation, other operations may be done in O(M*N1/M) time. The idea of such algorithm is presented in this answer to related question. Trie structure, described there, allows easily converting the position to the array index and back. Each non-empty element of this array contains id and each element of hash map maps this id back to the array index.
To make it possible to insert element to this data structure, each block of contiguous array elements should be interleaved with some empty space. When one of the blocks exhausts all available empty space, we should rebuild the smallest group of blocks, related to some element of the trie, that has more than 50% empty space. When total number of empty space is less than 50% or more than 75%, we should rebuild the whole structure.
This rebalancing scheme gives O(MN1/M) amortized complexity only for random and evenly distributed insertions/removals. Worst case complexity (for example, if we always insert at leftmost position) is much larger for M > 2. To guarantee O(MN1/M) worst case we need to reserve more memory and to change rebalancing scheme so that it maintains invariant like this: keep empty space reserved for whole structure at least 50%, keep empty space reserved for all data related to the top trie nodes at least 75%, for next level trie nodes - 87.5%, etc.
With M=2, we have O(1) time for lookup and O(sqrt(N)) time for other operations.
With M=log(N), we have O(log(N)) time for every operation.
But in practice small values of M (like 2 .. 5) are preferable. This may be treated as O(1) lookup time and allows this structure (while performing typical insert/remove operation) to work with up to 5 relatively small contiguous blocks of memory in a cache-friendly way with good vectorization possibilities. Also this limits memory requirements if we require good worst case complexity.
You can achieve everything in O(sqrt(n)) time, but I'll warn you that it's going to take some work.
Start by having a look at a blog post I wrote on ThriftyList. ThriftyList is my implementation of the data structure described in Resizable Arrays in Optimal Time and Space along with some customizations to maintain O(sqrt(n)) circular sublists, each of size O(sqrt(n)). With circular sublists, one can achieve O(sqrt(n)) time insertion/removal by the standard insert/remove-then-shift in the containing sublist followed by a series of push/pop operations across the circular sublists themselves.
Now, to get the index at which a query value falls, you'll need to maintain a map from value to sublist/absolute-index. That is to say, a given value maps to the sublist containing the value, plus the absolute index at which the value falls (the index at which the item would fall were the list non-circular). From these data, you can compute the relative index of the value by taking the offset from the head of the circular sublist and summing with the number of elements which fall behind the containing sublist. To maintain this map requires O(sqrt(n)) operations per insert/delete.
Sounds roughly like Clojure's persistent vectors - they provide O(log32 n) cost for lookup and update. For smallish values of n O(log32 n) is as good as constant....
Basically they are array mapped tries.
Not quite sure on the time complexity for remove and insert - but I'm pretty sure that you could get a variant of this data structure with O(log n) removes and inserts as well.
See this presentation/video: http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey
Source code (Java): https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/PersistentVector.java

Resources