Data structure to extract median and 2nd smallest element in O(lgn) - data-structures

I need to find a data structure that meets these requirements:
can build it from a list of n items in O(n)
inserting an item takes O(lgn)
extracting the median takes O(lgn)
extracting the 2nd smallest item takes O(lgn)
For the first three requirements, this works: keep the n/2 smallest items in a max heap and the n/2 largest in a min heap. The roots of those heaps will be the lower/upper median.
But I'm stuck with the 4th requirement. Any ideas?

Keep the n/2 largest items in a min heap. For the n/2 smallest items maintain a pair of max and min heaps. Heaps in this pair are augmented with index of the same element in the paired heap, so that any heap modification updates indexes in the paired heap for all moved items.
Paired heap explanations
Both heaps contain exactly the same set of items. Along with each item there is an additional index field. When heap is modified, some items may change their places. If an item is moved from index x to index y, corresponding item in the paired heap must be notified. This corresponding item is easily located in paired heap with moved item's index field. And the contents of the corresponding item's index field is changed from x to y. This allows all heap items to know exactly where their pairs are located. Keeping corresponding items in both heaps in-sync allows (while extracting largest item from the max heap or 2nd smallest item from the min heap) extract corresponding item from the paired heap. And keeping heaps in-sync does not increase any of the complexity requirements.

Related

What is the Time Complexity for sorting all elements of ⌈logn⌉ sorted lists of ⌊n/logn⌋ elements each?

Suppose there are ⌈logn⌉ sorted lists of ⌊n/logn⌋ elements each. The time complexity of producing a sorted list of all these elements is: (Hint:Use a heap data structure)
A. O(nloglogn)
B. Θ(nlogn)
C. Ω(nlogn)
D. Ω(n3/2)
My Understanding:
There are logn list each containing n/logn elements then we can apply min heap procedure each of the list
it can be done in O(n/logn). Now we have logn list which satisfy the min heap property. Now how can i understand it further i am really confused here. Please help me to visualize it.
[I assume we're sorting into increasing order]
Build a heap of the smallest (ie: first) element of each list, (and for each, along with the value, keep a record of which list it came from at which index). Repeatedly remove the smallest element of this heap, and then insert the next element in the list it came from (if that list hasn't already been consumed). This gives you the sorted list of all the elements.
This heap has [log(n)] elements, so the initial cost of building this heap is O(log(n)), and each remove and insert takes O(log(log n)) time. So overall, the cost of this sort is O(log(n) + nlog(log n)) = O(nloglogn).

Which algorithm should use when get the biggest n elements from a large list?

In my project, there is a very large list.
The most common operation on this list is to get the biggest n elements.
The n is fixed or rarely changed throughout the whole lifetime. Which algorithm should I use in order to do this efficiently?
It means that what should I do when inserting, updating, or deleting an element in the list, and what should I do when getting top n elements from the list.
There is a solution (maybe not that good):
After inserting, updating, or deleting an element, sort the list with quicksort or another sort algorithm. Because the list is too large, this step maybe too slow.
When getting top n elements, get the first n elements from the list.
Is there any better solution?
So you have a list of n items, and you want to pick the k largest. One way to do this is with a min-heap of size k. The resulting algorithm is O(n log k).
Start by creating an empty min-heap of the first k items. Then, for each following item in the list, if it's larger than the smallest item on the heap, remove the smallest item on the heap and replace it with the new item. When you're done, the largest k items will be on the heap. Pseudo code looks like this:
// assume an array a[], with length n.
// k is the number of largest items you want.
heap = new min-heap
// add first k items to the heap
for (i = 0; i < k; ++i)
heap.add(a[i])
for (i = k; i < n; ++i)
if (a[i] > heap.peek())
heap.removeMin()
heap.add(a[i])
// at this point, the largest k items are on the min-heap
This technique works well when k is a small percentage of n. In that case, it requires little memory. The algorithm has a worst case running time of O(n log k), but it's highly dependent on the order of items in the list. The worst case is when the array is sorted in ascending order. The best case is when the array is sorted in descending order. In the average case, much fewer than 50% of items get added and removed from the heap.
Another algorithm, Quickselect, has complexity O(n), but is slower than the heap selection method when k is a small percentage (1 or 2%) of n. Quickselect also modifies the existing list, which might not be something you want.
See my blog post, https://blog.mischel.com/2011/10/25/when-theory-meets-practice/, for more details.
You can do a few things here to speed up your average time by maintaining a heap rather than rebuilding it for every query.
If the number of items you want will always be less than 30, then maintain a heap of 30 items all the time. If the user just wants the top 10, then you can pick just those from the heap.
When an item is added to the list, check to see if it is larger than the smallest item on the heap. If it is, replace the smallest item.
When an item is deleted, mark the heap as dirty.
When asked for the top k items, if the heap is dirty then you have to rebuild it. Otherwise, you can just copy the contents of the heap to a scratch array, sort it, and return the k items that were asked for. Of course, once you rebuild the heap, clear the dirty flag.
The result, then, is that you can maintain the heap at little cost: potentially updating it whenever a new item is added, but only if it is larger than one of the top 30 (or whatever your max is) items. The only time you have to rebuild is when asked for the top k items after a deletion.
Come to think of it, you only have to mark the heap as dirty if the item you delete is greater than or equal to the smallest item on the heap. Also, if the heap is marked as dirty, then you can forego any further update on insertion or deletion because you have to rebuild the heap anyway the next time you get a query.
A (balanced) binary search tree is your best friend. Insertions, deletions, search for the k-th all in time O(Log N).
If the data resides in external memory, then a B-tree or similar.
if n <<<< size(list) then use a hashtable for the main elements, and a companion data structure to store the n biggest elements. The companion data structure is updated during insertion and deletion, and its used to query the biggest elements.
If n is 30, a sorted array is sufficient.
Disclaimer : This approach perform poorly if biggest elements are often removed. Deletion of a biggest element would requires a sequential scan of the whole hashtable.
in C++ STL.
Your best bet is to used an std::set.
Every time you add an element it will be ordered.
Then you can extract the n last element of std::set

Implementing a smart-list

I've been asked to devise a data structure called clever-list which holds items with real key numbers and offers the next operations:
Insert(x) - inserts a new element to the list. Should be in O(log n).
Remove min/max - removes and returns the min/max element in the list. Should be in O(log n) time.
Transform - changes the return object of remove min/max (if was min then to max, and the opposite). Should be in O(1).
Random sample(k) - returns randomly selected k elements from the list(k bigger than 0 and smaller than n). Should be in O(min(k log k, n + (n-k) log (n-k))).
Assumptions about the structure:
The data structure won't hold more then 3n elements at any stage.
We cannot assume that n=O(1).
We can use Random() method which return a real number between [0,1) and preforms in O(1) time.
I managed to implement the first three methods, using a min-max fine heap. However, I don't have a clue about the random sample(k) method in this time limit. All I could find is "Reservoir sampling", which operates in O(n) time.
Any suggestions?
You can do all of that with a min-max heap implemented in an array, including the random sampling.
For the random sampling, pick a random number from 0 to n. That's the index of the item you want to remove. Copy that item and then replace the item at that index with the last item in the array, and reduce the count. Now, either bubble that item up or sift it down as required.
If it's on a min level and the item is smaller than its parent, then bubble it up. If it's larger than its smallest child, sift it down. If it's on a max level, you reverse the logic.
That random sampling is O(k log n). That is, you'll remove k items from a heap of n items. It's the same complexity as k calls to delete-min.
Additional info
If you don't have to remove the items from the list, then you can do a naive random sampling in O(k) by selecting k indexes from the array. However, there is a chance of duplicates. To avoid duplicates, you can do this:
When you select an item at random, swap it with the last item in the array and reduce the count by 1. When you've selected all the items, they're in the last k positions of the array. This is clearly an O(k) operation. You can copy those items to be returned by the function. Then, set count back to the original value and call your MakeHeap function, which can build a heap from an arbitrary array in O(n). So your operation is O(k + n).
The MakeHeap function is pretty simple:
for (int i = count/2; i >= 0; --i)
{
SiftDown(i);
}
Another option would be, when you do a swap, to save the swap operation on a stack. That is, save the from and to indexes. To put the items back, just run the swaps in reverse order (i.e. pop from the stack, swap the items, and continue until the stack is empty). That's O(k) for the selection, O(k) for putting it back, and O(k) extra space for the stack.
Another way to do it, of course, is to do the removals as I suggested, and once all the removals are done you re-insert the items into the heap. That's O(k log n) to remove and O(k log n) to add.
You could, by the way, do the random sampling in O(k) best case by using a hash table to hold the randomly selected indexes. You just generate random indexes and add them to the hash table (which won't accept duplicates) until the hash table contains k items. The problem with that approach is that, at least in theory, the algorithm could fail to terminate.
If you store the numbers in an array, and use a self-balancing binary tree to maintain a sorted index of them, then you can do all the operations with the time complexities given. In the nodes of the tree, you'll need pointers into the number array, and in the array you'll need a pointer back into the node of the tree where that number belongs.
Insert(x) adds x to the end of the array, and then inserts it into the binary tree.
Remove min/max follows the left/right branches of the binary tree to find the min or max, then removes it. You need to swap the last number in the array into the hole produced by the removal. This is when you need the back pointers from the array back into the tree.
Transform toggles a bit for the remove min/max operation
Random sample either picks k or (n-k) unique ints in the range 0...n-1 (depending whether 2k < n). The random sample is either the elements at the k locations in the number array, or it's the elements at all but the (n-k) locations in the number array.
Creating a set of k unique ints in the range 0..n can be done in O(k) time, assuming that (uninitialized) memory can be allocated in O(1) time.
First, assume that you have a way of knowing if memory is uninitialized or not. Then, you could have an uninitialized array of size n, and do the usual k-steps of a Fisher-Yates shuffle, except every time you access an element of the array (say, index i), if it's uninitialized, then you can initialize it to value i. This avoids initializing the entire array which allows the shuffle to be done in O(k) time rather than O(n) time.
Second, obviously it's not possible in general to know if memory is uninitialized or not, but there's a trick you can use (at the cost of doubling the amount of memory used) that lets you implement a sparse array in uninitialized memory. It's described in depth on Russ Cox's blog here: http://research.swtch.com/sparse
This gives you an O(k) way of randomly selecting k numbers. If k is large (ie: > n/2) you can do the selection of (n-k) numbers instead of k numbers, but you still need to return the non-selected numbers to the user, which is always going to be O(k) if you copy them out, so the faster selection gains you nothing.
A simpler approach, if you don't mind giving out access to your internal data-structure, is to do k or n-k steps of the Fisher-Yates shuffle on the underlying array (depending whether k < n/2, and being careful to update the corresponding nodes in the tree to maintain their values), and then return either a[0..k-1] or a[k..n-1]. In this case, the returned value will only be valid until the next operation on the datastructure. This method is O(min(k, n-k)).

O(klogk) time algorithm to find kth smallest element from a binary heap

We have an n-node binary heap which contains n distinct items (smallest item at the root). For a k<=n, find a O(klogk) time algorithm to select kth smallest element from the heap.
O(klogn) is obvious, but couldn't figure out a O(klogk) one. Maybe we can use a second heap, not sure.
Well, your intuition was right that we need extra data structure to achieve O(klogk) because if we simply perform operations on the original heap, the term logn will remain in the resulting complexity.
Guessing from the targeted complexity O(klogk), I feel like creating and maintaining a heap of size k to help me achieve the goal. As you may be aware, building a heap of size k in top-down fashion takes O(klogk), which really reminds me of our goal.
The following is my try (not necessarily elegant or efficient) in an attempt to attain O(klogk):
We create a new min heap, initializing its root to be the root of the original heap.
We update the new min heap by deleting the current root and inserting the two children of the current root in the original heap. We repeat this process k times.
The resulting heap will consist of k nodes, the root of which is the kth smallest element in the original heap.
Notes: Nodes in the new heap should store indexes of their corresponding nodes in the original heap, rather than the node values themselves. In each iteration of step 2, we really add a net of one more node into the new heap (one deleted, two inserted), k iterations of which will result in our new heap of size k. During the ith iteration, the node to be deleted is the ith smallest element in the original heap.
Time Complexity: in each iteration, it takes O(3logk) time to delete one element from and insert two into the new heap. After k iterations, it is O(3klogk) = O(klogk).
Hope this solution inspires you a bit.
Assuming that we're using a minheap, so that a root node is always smaller than its children nodes.
Create a sorted list toVisit, which contains the nodes which we will traverse next. This is initially just the root node.
Create an array smallestNodes. Initially this is empty.
While length of smallestNodes < k:
Remove the smallest Node from toVisit
add that node to smallestNodes
add that node's children to toVisit
When you're done, the kth smallest node is in smallestNodes[k-1].
Depending on the implementation of toVisit, you can get insertion in log(k) time and removal in constant time (since you're only removing the topmost node). That makes O(k*log(k)) total.

When would I want to use a heap?

Besides the obvious answer of a Priority Queue, when would a heap be useful in my programming adventures?
Use it whenever you need quick access to the largest (or smallest) item, because that item will always be the first element in the array or at the root of the tree.
However, the remainder of the array is kept partially unsorted. Thus, instant access is only possible to the largest (smallest) item. Insertions are fast, so it's a good way to deal with incoming events or data and always have access to the earliest/biggest.
Useful for priority queues, schedulers (where the earliest item is desired), etc...
A heap is a tree where a parent node's value is larger than that of any of its descendant nodes.
If you think of a heap as a binary tree stored in linear order by depth, with the root node first (then the children of that node next, then the children of those nodes next); then the children of a node at index N are at 2N+1 and 2N+2. This property allows quick access-by-index. And since heaps are manipulated by swapping nodes, this allows for in-place sorting.
Heaps are structures meant to allow quick access to the min or the max.
But why would you want that? You could just check every entry on add to see if it's the smallest or the biggest. This way you always have the smallest or the biggest in constant time O(1).
The answer is because heaps allow you to pull the smallest or the biggest and quickly know the NEXT smallest or biggest. That's why it's called a Priority Queue.
Real world example (not very fair world, though):
Suppose you have a hospital in which patients are attended based on their ages. The oldest are always attended first, no matter when he/she got in the queue.
You can't just keep track of the oldest one because if you pull he/she out, you don't know the next oldest one. In order to solve this hospital problem, you implement a max heap. This heap is, by definition, partially ordered. This means you cannot sort the patients by their age, but you know that the oldest ones are always in the top, so you can pull a patient out in constant time O(1) and re-balance the heap in log time O(log N).
More sophisticated example:
Suppose you have a sequence of integers and you want to keep track of the median. The median is the number that is in the middle of an ordered array.
Example:
[1, 2, 5, 7, 23, 27, 31]
In the above case, 7 is the median because the array containing the smaller numbers [1, 2, 5] is of the same size of the one containing the bigger numbers [23, 27, 31]. Normally, if the array has an even number of elements, the median is the arithmetic average of the 2 elements in the middle, e.g (5 + 7)/2.
Now, how do you keep track of the median? By having 2 heaps, one min heap containing the numbers smaller than the current median and a max heap containing the numbers bigger than the current median. Now, if these heaps are always balanced, the 2 heaps will contain the same number of elements or one will have 1 element more than the other, the most.
When you add a new element to the sequence, if the number is smaller than the current median, you add it to the min heap, otherwise, you add it to the max heap. Now, if the heaps are unbalanced (one heap has more than 1 element more than the other), you pull an element from the biggest heap and add to the smallest. Now they're balanced.
The characteristic of a heap is that it is a structure that maintains data semiordered; thus, it is a good tradeoff between the cost of maintaining a complete order and the cost of searching through random chaos. That characteristic is used on many algorithms, such as selection, ordering, or classification.
Another useful characteristic of a heap is that it can be created in-place from an array!
Also good for selection algorithms (finding the min or max)
anytime when you sort a temporary list, you should consider heaps.
You can use a minHeap or maxHeap when you want to access the smallest and largest elements respectively.

Resources