Partial sorting to find the kth largest/smallest elements - algorithm

Source : Wikipedia
A streaming, single-pass partial sort is also possible using heaps or
other priority queue data structures. First, insert the first k
elements of the input into the structure. Then make one pass over the
remaining elements, add each to the structure in turn, and remove the
largest element. Each insertion operation also takes O(log k) time,
resulting in O(n log k) time overall.
How is this different from the case where we first heapify the complete input array in O(n) time and then extract the minimum of the heap k times.
I don't understand the part where it says to make one pass over the remaining elements, add each to the structure in turn, and remove the largest element. Isn't this the same as the method described in 1) ?

The suggested method is streaming. It doesn't need to have all the items in memory to run the heapify algorithm, given it O(k) space complexity (but it only finds the top-k items).
A more explicit description of the algorithm (see also the reference WP gives) is
given a stream of items:
make a heap of the first k elements in the stream,
for each element after the first k:
push it onto the heap,
extract the largest (or smallest) element from the heap and discard it,
finally return the k values left in the heap.
By construction, the heap never grows to more than k + 1 elements. The items can be streamed in from disk, over a network, etc., which is not possible with the heapify algorithm.

Related

existence of a certain data structure

I'm wondering, can there exists such a data stucture under the following criterions and times(might be complicated)?
if we obtain an unsorted list L an build a data structure out of it like this:
Build(L,X) - under O(n) time, we build the structure S from an unsorted list of n elements
Insert (y,S) under O(lg n) we insert z into the structure S
DEL-MIN(S) - under O(lg n) we delete the minimal element from S
DEL-MAX(S) - under O(lg n) we delete the maximal element from S
DEL-MId(S) - under O(lg n) we delete the upper medial(ceiling function) element from S
the problem is that the list L is unsorted. can such a data structure exist?
DEL-MIN and DEL-MAX are easy: keep a min-heap and max-heap of all the elements. The only trick is that you have to keep indices of the value in the heap so that when (for example) you remove the max, you can also find it and remove it in the min-heap.
For DEL-MED, you can keep a max-heap of the elements less than the median and a min-heap of the elements greater than or equal to the median. The full description is in this answer: Data structure to find median. Note that in that answer the floor-median is returned, but that's easily fixed. Again, you need to use the cross-indexing trick to refer to the other datastructures as in the first part. You will also need to think how this handles repeated elements if that's possible in your problem formulation. (If necessary, you can do it by storing repeated elements as (count, value) in your heap, but this complicates rebalancing the heaps on insert/remove a little).
Can this all be built in O(n)? Yes -- you can find the median of n things in O(n) time (using the median-of-median algorithm), and heaps can be built in O(n) time.
So overall, the datastructure is 4 heaps (a min-heap of all the elements, a max-heap of all the elements, a max-heap of the floor(n/2) smallest elements, a min-heap of the ceil(n/2) largest elements. All with cross-indexes to each other.

Find nth Smallest element from Binary Search Tree

How to find nth Smallest element from Binary Search Tree
Constraints are :
time complexity must be O(1)
No extra space should be used
I have already tried 2 approaches.
Doing inorder traversal and finding nth element - Time complexity O(n)
Maintaining no. of small elements than current node and finding element with m small elements - Time complexity O(log n)
The only way I could think about is to change the data structure that holds the BST in memory. Should be simple if you actually consider every nodes as structure themselves (value, left_child and right_child) instead of storing them in a unordered array, you can store them in a ordered array. Thus the nth smallest element would be the nth element in your array. The extra computation will be at insertion and deletion. But it still would be more effective if you use for example a C++ set (log(n) for both insertion and deletion).
It mainly depends on your use case.
If you do not use data structure for handling the tree (based on array position) I don't think you cannot do it in something better than log(n).

Implementing a smart-list

I've been asked to devise a data structure called clever-list which holds items with real key numbers and offers the next operations:
Insert(x) - inserts a new element to the list. Should be in O(log n).
Remove min/max - removes and returns the min/max element in the list. Should be in O(log n) time.
Transform - changes the return object of remove min/max (if was min then to max, and the opposite). Should be in O(1).
Random sample(k) - returns randomly selected k elements from the list(k bigger than 0 and smaller than n). Should be in O(min(k log k, n + (n-k) log (n-k))).
Assumptions about the structure:
The data structure won't hold more then 3n elements at any stage.
We cannot assume that n=O(1).
We can use Random() method which return a real number between [0,1) and preforms in O(1) time.
I managed to implement the first three methods, using a min-max fine heap. However, I don't have a clue about the random sample(k) method in this time limit. All I could find is "Reservoir sampling", which operates in O(n) time.
Any suggestions?
You can do all of that with a min-max heap implemented in an array, including the random sampling.
For the random sampling, pick a random number from 0 to n. That's the index of the item you want to remove. Copy that item and then replace the item at that index with the last item in the array, and reduce the count. Now, either bubble that item up or sift it down as required.
If it's on a min level and the item is smaller than its parent, then bubble it up. If it's larger than its smallest child, sift it down. If it's on a max level, you reverse the logic.
That random sampling is O(k log n). That is, you'll remove k items from a heap of n items. It's the same complexity as k calls to delete-min.
Additional info
If you don't have to remove the items from the list, then you can do a naive random sampling in O(k) by selecting k indexes from the array. However, there is a chance of duplicates. To avoid duplicates, you can do this:
When you select an item at random, swap it with the last item in the array and reduce the count by 1. When you've selected all the items, they're in the last k positions of the array. This is clearly an O(k) operation. You can copy those items to be returned by the function. Then, set count back to the original value and call your MakeHeap function, which can build a heap from an arbitrary array in O(n). So your operation is O(k + n).
The MakeHeap function is pretty simple:
for (int i = count/2; i >= 0; --i)
{
SiftDown(i);
}
Another option would be, when you do a swap, to save the swap operation on a stack. That is, save the from and to indexes. To put the items back, just run the swaps in reverse order (i.e. pop from the stack, swap the items, and continue until the stack is empty). That's O(k) for the selection, O(k) for putting it back, and O(k) extra space for the stack.
Another way to do it, of course, is to do the removals as I suggested, and once all the removals are done you re-insert the items into the heap. That's O(k log n) to remove and O(k log n) to add.
You could, by the way, do the random sampling in O(k) best case by using a hash table to hold the randomly selected indexes. You just generate random indexes and add them to the hash table (which won't accept duplicates) until the hash table contains k items. The problem with that approach is that, at least in theory, the algorithm could fail to terminate.
If you store the numbers in an array, and use a self-balancing binary tree to maintain a sorted index of them, then you can do all the operations with the time complexities given. In the nodes of the tree, you'll need pointers into the number array, and in the array you'll need a pointer back into the node of the tree where that number belongs.
Insert(x) adds x to the end of the array, and then inserts it into the binary tree.
Remove min/max follows the left/right branches of the binary tree to find the min or max, then removes it. You need to swap the last number in the array into the hole produced by the removal. This is when you need the back pointers from the array back into the tree.
Transform toggles a bit for the remove min/max operation
Random sample either picks k or (n-k) unique ints in the range 0...n-1 (depending whether 2k < n). The random sample is either the elements at the k locations in the number array, or it's the elements at all but the (n-k) locations in the number array.
Creating a set of k unique ints in the range 0..n can be done in O(k) time, assuming that (uninitialized) memory can be allocated in O(1) time.
First, assume that you have a way of knowing if memory is uninitialized or not. Then, you could have an uninitialized array of size n, and do the usual k-steps of a Fisher-Yates shuffle, except every time you access an element of the array (say, index i), if it's uninitialized, then you can initialize it to value i. This avoids initializing the entire array which allows the shuffle to be done in O(k) time rather than O(n) time.
Second, obviously it's not possible in general to know if memory is uninitialized or not, but there's a trick you can use (at the cost of doubling the amount of memory used) that lets you implement a sparse array in uninitialized memory. It's described in depth on Russ Cox's blog here: http://research.swtch.com/sparse
This gives you an O(k) way of randomly selecting k numbers. If k is large (ie: > n/2) you can do the selection of (n-k) numbers instead of k numbers, but you still need to return the non-selected numbers to the user, which is always going to be O(k) if you copy them out, so the faster selection gains you nothing.
A simpler approach, if you don't mind giving out access to your internal data-structure, is to do k or n-k steps of the Fisher-Yates shuffle on the underlying array (depending whether k < n/2, and being careful to update the corresponding nodes in the tree to maintain their values), and then return either a[0..k-1] or a[k..n-1]. In this case, the returned value will only be valid until the next operation on the datastructure. This method is O(min(k, n-k)).

What is the best way to implement a double-ended priority queue?

I would like to implement a double-ended priority queue with the following constraints:
needs to be implemented in a fixed size array..say 100 elements..if new elements need to be added after the array is full, the oldest needs to be removed
need maximum and minimum in O(1)
if possible insert in O(1)
if possible remove minimum in O(1)
clear to empty/init state in O(1) if possible
count of number of elements in array at the moment in O(1)
I would like O(1) for all the above 5 operations but its not possible to have O(1) on all of them in the same implementation. Atleast O(1) on 3 operations and O(log(n)) on the other 2 operations should suffice.
Will appreciate if any pointers can be provided to such an implementation.
There are many specialized data structures for this. One simple data structure is the min-max heap, which is implemented as a binary heap where the layers alternate between "min layers" (each node is less than or equal to its descendants) and "max layers" (each node is greater than or equal to its descendants.) The minimum and maximum can be found in time O(1), and, as in a standard binary heap, enqueues and dequeues can be done in time O(log n) time each.
You can also use the interval heap data structure, which is another specialized priority queue for the task.
Alternatively, you can use two priority queues - one storing elements in ascending order and one in descending order. Whenever you insert a value, you can then insert elements into both priority queues and have each store a pointer to the other. Then, whenever you dequeue the min or max, you can remove the corresponding element from the other heap.
As yet another option, you could use a balanced binary search tree to store the elements. The minimum and maximum can then be found in time O(log n) (or O(1) if you cache the results) and insertions and deletions can be done in time O(log n). If you're using C++, you can just use std::map for this and then use begin() and rbegin() to get the minimum and maximum values, respectively.
Hope this helps!
A binary heap will give you insert and remove minimum in O(log n) and the others in O(1).
The only tricky part is removing the oldest element once the array is full. For this, keep another array:
time[i] = at what position in the heap array is the element
added at time i + 100 * k.
Every 100 iterations, you increment k.
Then, when the array fills up for the first time, you remove heap[ time[0] ], when it fills up for the second time you remove heap[ time[1] ], ..., when it fills up for the 100th time, you wrap around and remove heap[ time[0] ] again etc. When it fills up for the kth time, you remove heap[ time[k % 100] ] (100 is your array size).
Make sure to also update the time array when you insert and remove elements.
Removal of an arbitrary element can be done in O(log n) if you know its position: just swap it with the last element in your heap array, and sift down the element you have swapped in.
If you absolutely need max and min to be O(1) then what you can do is create a linked list, where you constantly keep track of min, max, and size, and then link all the nodes to some sort of tree structure, probably a heap. Min, max, and size would all be constant, and since finding any node would be in O(log n), insert and remove are log n each. Clearing would be trivial.
If your queue is a fixed size, then O-notation is meaningless. Any O(log n) or even O(n) operation is essentially O(1) because n is fixed, so what you really want is an algorithm that's fast for the given dataset. Probably two parallel traditional heap priority queues would be fine (one for high, one for low).
If you know more about what kind of data you have, you might be able to make something more special-purpose.

Fastest method for Queue Implementation in Java

The task is to implement a queue in java with the following methods:
enqueue //add an element to queue
dequeue //remove element from queue
peekMedian //find median
peekMinimum //find minimum
peakMaximum //find maximum
size // get size
Assume that ALL METHODS ARE CALLED In EQUAL FREQUENCY, the task is to have the fastest implementation.
My Current Approach:
Maintain a sorted array, in addition to the queue, so enqueue and dequeue are take O(logn) and peekMedian, peekMaximum, peekMinimum all take O(1) time.
Please suggest a method that will be faster, assuming all methods are called in equal frequency.
Well, you are close - but there is still something missing, since inserting/deleting from a sorted array is O(n) (because at probability 1/2 the inserted element is at the first half of the array, and you will have to shift to the right all the following elements, and there are at least n/2 of these, so total complexity of this operation is O(n) on average + worst case)
However, if you switch your sorted DS to a skip list/ balanced BST - you are going to get O(logn) insertion/deletion and O(1) minimum/maximum/median/size (with caching)
EDIT:
You cannot get better then O(logN) for insertion (unless you decrease the peekMedian() to Omega(logN)), because that will enable you to sort better then O(NlogN):
First, note that the median moves one element to the right for each "high" elements you insert (in here, high means >= the current max).
So, by iteratively doing:
while peekMedian() != MAX:
peekMedian()
insert(MAX)
insert(MAX)
you can find the "higher" half of the sorted array.
Using the same approach with insert(MIN) you can get the lowest half of the array.
Assuming you have o(logN) (small o notation, better then Theta(logN) insertion and O(1) peekMedian(), you got yourself a sort better then O(NlogN), but sorting is Omega(NlogN) problem.
=><=
Thus insert() cannot be better then O(logN), with median still being O(1).
QED
EDIT2: Modifying the median in insertions:
If the tree size before insertion is 2n+1 (odd) then the old median is at index n+1, and the new median is at the same index (n+1), so if the element was added before the old median - you need to get the preceding node of the last median - and that's the new median. If it was added after it - do nothing, the old median is the new one as well.
If the list is even (2n elements), then after the insertion, you should increase an index (from n to n+1), so if the new element was added before the median - do nothing, if it was added after the old median, you need to set the new median as the following node from the old median.
note: In here next nodes and preceding nodes are those that follow according to the key, and index means the "place" of the node (smallest is 1st and biggest is last).
I only explained how to do it for insertion, the same ideas hold for deletion.
There is a simpler and perhaps better solution. (As has been discussed, the sorted array makes enqueue and dequeue both O(n), which is not so good.)
Maintain two sorted sets in addition to the queue. The Java library provides e.g. SortedSet, which are balanced search trees. The "low set" stores the first ceiling (n/2) elements in sorted order. The second "high set" has the last floor(n/2).
NB: If duplicates are allowed, you'll have to use something like Google's TreeMultiset instead of regular Java sorted sets.
To enqueue, just add to the queue and the correct set. If necessary, re-establish balance between the sets by moving one element: either the greatest element in the low set to the upper set or the least element in the high set to the low. Dequeuing needs the same re-balance operation.
Finding the median if n is odd is just looking up the max element in the low set. If n is even, find the max element in the low set and min in the high set and average them.
With the native Java sorted set implementation (balanced tree), this will be O(log n) for all operations. It will be very easy to code. About 60 lines.
If you implement your own sifting heaps for the low and high sets, then you'll have O(1) for the find median operation while all other ops will remain O(log n).
If you go on and implement your own Fibonacci heaps for the low and high sets, then you'll have O(1) insert as well.

Resources