Find Median of K last numbers from Data Stream - algorithm

Problem: An odd number K is given. At the input, we gradually receive a sequence of N numbers. After reading each number (except the first K-1), print the median of the last K numbers.
I solved it using two heaps (MaxHeap and MinHeap, e.g. like here ) and added function remove_element(value) to both heaps. remove_element(value) iterates over the heap, looking for the value to remove. Removes it and then rebalances the tree. So it works in O(K+log K).
And I solved the problem like this: by iterating the stream date, I add a new element to the heaps and delete the old one, which is already outside the window (K+1th element). So it works in O(N*(K + log K + log K)) = O(N*K).
I'm wondering 1) if I correctly estimated the time complexity of the algorithm, 2) is it possible to speed up this algorithm. 3) if not (the algorithm is optimal), then how can I prove its optimality.
As for the third question, if my estimate of the O(N*K) algorithm is correct, I think it can be proven based on the obvious idea that in order to write out the median, we need to check all K elements for each of N requests

For a better approach, make the heaps be heaps of (value, i). And now you don't need to remove stale values promptly, you can simply keep track of how many non-stale values are on each side, and throw away stale values when they pop up from the heap.
And if more than half of a heap is stale, then do garbage collection to remove stale values.
If you do this well, you should be able to solve this problem with O(k) extra memory in time O(n log(k)).

Related

Sorting a unique array in less than O(nlogn)

The question goes like this-
Assuming I have an Array of real numbers X[x1,...,xn] and a natural constant k such that for every i X[i]<X[i+k]. Write a sorting algorithm with time complexity better than O(nlogn). For the purpose of the question I am allowed to use quick sort, counting sort, Radix sort, bucket sort, heaps and so on.
All I've figured out so far is that if I take sublists by the remainder of the indecies (after dividing with K), those sublists are sorted. But merging them in the right complexity seems impossible. I also tried using min heaps after realizing the i smallest value must be in the first k*i places but it took me O(n^2) which is too much.
I'd appreaciate any guidance/help/references.
Thank you!
Something to note here is that you essentially have k sorted arrays that you could merge directly. That is, the sequence [x[i],x[i+k],x[i+2k],x[i+3k]...] is sorted. As is [x[i+1],x[i+k+1],[x[i]+2k+1]...]
So you have to merge k sorted sequences. The basic idea is:
Initialize a priority queue of size k with the first item from each sorted sequence. That is, add x[0], x[1], x[2], x[3] ... x[k-1] to the priority queue. In addition, save the index of the item in the priority queue structure.
Remove the first item from the queue and add the next item from that sequence. So if the first item you remove from the queue is x[3], then you'll want to add x[3+k] to the queue.
You'll of course have to make sure that you're not attempting to add anything from beyond the end of the array.
Continue in that fashion until the queue is empty.
Complexity is O(n log k). Basically, every item is added to and removed from a priority queue of size k. Both addition and removal are, worst case, O(log k).

Algorithmic complexity of group average clustering

I've been reading lately about various hierarchical clustering algorithms such as single-linkage clustering and group average clustering. In general, these algorithms don't tend to scale well. Naive implementations of most hierarchical clustering algorithms are O(N^3), but single-linkage clustering can be implemented in O(N^2) time.
It is also claimed that group-average clustering can be implemented in O(N^2 logN) time. This is what my question is about.
I simply do not see how this is possible.
Explanation after explanation, such as:
http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html
http://nlp.stanford.edu/IR-book/completelink.html#averagesection
https://en.wikipedia.org/wiki/UPGMA#Time_complexity
... are claiming that group average hierarchical clustering can be done in O(N^2 logN) time by using priority queues. But when I read the actual explanation or pseudo-code, it always appears to me that it is nothing better than O(N^3).
Essentially, the algorithm is as follows:
For an input sequence of size N:
Create a distance matrix of NxN #(this is O(N^2) time)
For each row in the distance matrix:
Create a priority queue (binary heap) of all distances in the row
Then:
For i in 0 to N-1:
Find the min element among all N priority queues # O(N)
Let k = the row index of the min element
For each element e in the kth row:
Merge the min element with it's nearest neighbor
Update the corresponding values in the distance matrix
Update the corresponding value in priority_queue[e]
So it's that last step that, to me, would seem to make this an O(N^3) algorithm. There's no way to "update" an arbitrary value in the priority queue without scanning the queue in O(N) time - assuming the priority queue is a binary heap. (A binary heap gives you constant access to the min element and log N insertion/deletion, but you can't simply find an element by value in better than O(N) time). And since we'd scan the priority queue for each row element, for each row, we get (O(N^3)).
The priority queue is sorted by a distance value - but the algorithm in question calls for deleting the element in the priority queue which corresponds to k, the row index in the distance matrix of the min element. Again, there's no way to find this element in the queue without an O(N) scan.
So, I assume I'm probably wrong since everyone else is saying otherwise. Can someone explain how this algorithm is somehow not O(N^3), but in fact, O(N^2 logN) ?
I think you are saying that the problem is that in order to update an entry in a heap you have to find it, and finding it takes time O(N). What you can do to get round this is to maintain an index that gives, for each item i, its location heapPos[i] in the heap. Every time you swap two items to restore the heap invariant you then need to modify two entries in heapPos[i] to keep the index correct, but this is just a constant factor on the work done in the heap.
If you store the positions in the heap (which adds another O(n) memory) you can update the heap without scanning, on the changed positions only. These updates are restricted to two paths on the heap (one removal, one update) and execute in O(log n). Alternatively, you could binary-search by the old priority, which will likely be in O(log n), too (but slower, above approach is O(1)).
So IMHO you can indeed implement these in O(n^2 log n). But the implementation will still use a lot (O(n^2)) of memory, anything of O(n^2) does not scale. You usually
run out of memory before you run out of time if you have O(n^2) memory...
Implementing these data structures is quite tricky. And when not done well, this may end up being slower than a theoretically-worse approach. For example Fibonacci heaps. They have nice properties on paper, but have too high constant costs to pay off.
No, because the distance matrix is symmetrical.
if the first entry in row 0 is to column 5, distance of 1, and that is lowest in the system, then the first entry in row 5 must be the complementary entry to column 0, with a distance of 1.
In fact you only need a half matrix.

Is my interpretation for the complexity O(n + k log n) in this algorithm correct?

So we were given a problem that if given an array of n elements, we need to extract the k smallest elements from it. Our solution should use heaps and the complexity should be O(n + k log n). I think I may have figured out the solution, but I'd like to be sure about it.
I'd say that the array must first be built into a heap using a typical buildHeap() function which starts at half the length of the array and calls a minHeapify() function to ensure each parent is at least less than its children. So that would be O(n) complexity all in all. Since we need to extract k times, we would use an extractMin() function, which would remove the minimum value and minHeapify() what remains to keep a Min Heap property. The extractMin() would be O(logn), and since it would done k times, this supports the overall complexity of O(n+klogn).
Does this check out? Someone told me it should also be sorted with a heapSort() function, but this didn't make sense to me, because heapSort() would add an O(nlogn) to the overall complexity, and you're still able to extract the min without sorting...
Yes, you are right. You don't need heapSort() but heapify() to re-order your heap.

Finding the max and min of a BIT in linear or sub-linear time

I have to perform a series of range updations on an array, i.e., adding or subtracting some constant to and from a range. After that I have to find the RANGE of the final array, i.e., (max-min). Initially the numbers are 1 to n.
I'm using Binary Indexed Tree. Each update is in log N. I want to know if there is a way to find thus RANGE (or max and min) in O(n) or less time. Conventionally, it takes O(n log n).
You need direct indexed access to the array elements since you need to address them for doing the incremental updates.
You also need to maintain a min-heap and max-heap.
When you update an element, you also need to update the corresponding entries in the two heaps. So you need to store the pointers into corresponding elements in the two heaps in the array.
Creating the original heap is O(n) and any modifications are O(lg(N)).
Why not just sort the array once? Then adding or subtracting a constant from the whole array still gives the same ordering, as does multiplying by a positive number. Maybe there's more to the picture though.
This question is almost 2 years old, hence I am not sure if this answer is going to help much. Anyway...
I have never used BIT to answer minimum or maximum queries. And here there are range queries, which change a lot of numbers all at once. So the maximums and minimums also get updated. As far as I know, I have never seen BITs to be used in queries other than point query, range sum search, etc.
In general, segment trees provide better option for searching for minimum and maximum values. After performing all updates, you can find those in O(lg n) time. However, during updates, you must update the min max values for each node, which can be done using Lazy Propagation. The update cost is O(lg n).
To sum up, if m lg n < n for your application, you can go with Segment tree, albeit with more space.

Looking for a data container with O(1) indexing and O(log(n)) insertion and deletion

I'm not sure if it's possible but it seems a little bit reasonable to me, I'm looking for a data structure which allows me to do these operations:
insert an item with O(log n)
remove an item with O(log n)
find/edit the k'th-smallest element in O(1), for arbitrary k (O(1) indexing)
of course editing won't result in any change in the order of elements. and what makes it somehow possible is I'm going to insert elements one by one in increasing order. So if for example I try inserting for the fifth time, I'm sure all four elements before this one are smaller than it and all the elements after this this are going to be larger.
I don't know if the requested time complexities are possible for such a data container. But here is a couple of approaches, which almost achieve these complexities.
First one is tiered vector with O(1) insertion and indexing, but O(sqrt N) deletion. Since you expect only about 10000 elements in this container and sqrt(10000)/log(10000) = 7, you get almost the required performance here. Tiered vector is implemented as an array of ring-buffers, so deleting an element requires moving all elements, following it in the ring-buffer, and moving one element from each of the following ring-buffers to the one, preceding it; indexing in this container means indexing in the array of ring-buffers and then indexing inside the ring-buffer.
It is possible to create a different container, very similar to tiered vector, having exactly the same complexities, but working a little bit faster because it is more cache-friendly. Allocate a N-element array to store the values. And allocate a sqrt(N)-element array to store index corrections (initialized with zeros). I'll show how it works on the example of 100-element container. To delete element with index 56, move elements 57..60 to positions 56..59, then in the array of index corrections add 1 to elements 6..9. To find 84-th element, look up eighth element in the array of index corrections (its value is 1), then add its value to the index (84+1=85), then take 85-th element from the main array. After about half of elements in main array are deleted, it is necessary to compact the whole container to attain contiguous storage. This gets only O(1) cumulative complexity. For real-time applications this operation may be performed in several smaller steps.
This approach may be extended to a Trie of depth M, taking O(M) time for indexing, O(M*N1/M) time for deletion and O(1) time for insertion. Just allocate a N-element array to store the values, N(M-1)/M, N(M-2)/M, ..., N1/M-element arrays to store index corrections. To delete element 2345, move 4 elements in main array, increase 5 elements in the largest "corrections" array, increase 6 elements in the next one and 7 elements in the last one. To get element 5678 from this container, add to 5678 all corrections in elements 5, 56, 567 and use the result to index the main array. Choosing different values for 'M', you can balance the complexity between indexing and deletion operations. For example, for N=65000 you can choose M=4; so indexing requires only 4 memory accesses and deletion updates 4*16=64 memory locations.
I wanted to point out first that if k is really a random number, then it might be worth considering that the problem might be completely different: asking for the k-th smallest element, with k uniformly at random in the range of the available elements is basically... picking an element at random. And it can be done much differently.
Here I'm assuming you actually need to select for some specific, if arbitrary, k.
Given your strong pre-condition that your elements are inserted in order, there is a simple solution:
Since your elements are given in order, just add them one by one to an array; that is you have some (infinite) table T, and a cursor c, initially c := 1, when adding an element, do T[c] := x and c := c+1.
When you want to access the k-th smallest element, just look at T[k].
The problem, of course, is that as you delete elements, you create gaps in the table, such that element T[k] might not be the k-th smallest, but the j-th smallest with j <= k, because some cells before k are empty.
It then is enough to keep track of the elements which you have deleted, to know how many have been deleted that are smaller than k. How do you do this in time at most O(log n)? By using a range tree or a similar type of data structure. A range tree is a structure that lets you add integers and then query for all integers in between X and Y. So, whenever you delete an item, simply add it to the range tree; and when you are looking for the k-th smallest element, make a query for all integers between 0 and k that have been deleted; say that delta have been deleted, then the k-th element would be in T[k+delta].
There are two slight catches, which require some fixing:
The range tree returns the range in time O(log n), but to count the number of elements in the range, you must walk through each element in the range and so this adds a time O(D) where D is the number of deleted items in the range; to get rid of this, you must modify the range tree structure so as to keep track, at each node, of the number of distinct elements in the subtree. Maintaining this count will only cost O(log n) which doesn't impact the overall complexity, and it's a fairly trivial modification to do.
In truth, making just one query will not work. Indeed, if you get delta deleted elements in range 1 to k, then you need to make sure that there are no elements deleted in range k+1 to k+delta, and so on. The full algorithm would be something along the line of what is below.
KthSmallest(T,k) := {
a = 1; b = k; delta
do {
delta = deletedInRange(a, b)
a = b + 1
b = b + delta
while( delta > 0 )
return T[b]
}
The exact complexity of this operation depends on how exactly you make your deletions, but if your elements are deleted uniformly at random, then the number of iterations should be fairly small.
There is a Treelist (implementation for Java, with source code), which is O(lg n) for all three ops (insert, delete, index).
Actually, the accepted name for this data structure seems to be "order statistic tree". (Apart from indexing, it's also defined to support indexof(element) in O(lg n).)
By the way, O(1) is not considered much of an advantage over O(lg n). Such differences tend to be overwhelmed by the constant factor in practice. (Are you going to have 1e18 items in the data structure? If we set that as an upper bound, that's just equivalent to a constant factor of 60 or so.)
Look into heaps. Insert and removal should be O(log n) and peeking of the smallest element is O(1). Peeking or retrieval of the K'th element, however, will be O(log n) again.
EDITED: as amit stated, retrieval is more expensive than just peeking
This is probably not possible.
However, you can make certain changes in balanced binary trees to get kth element in O(log n).
Read more about it here : Wikipedia.
Indexible Skip lists might be able to do (close) what you want:
http://en.wikipedia.org/wiki/Skip_lists#Indexable_skiplist
However, there's a few caveats:
It's a probabilistic data structure. That means it's not necessarily going to be O(log N) for all operations
It's not going to be O(1) for indexing, just O(log N)
Depending on the speed of your RNG and also depending on how slow traversing pointers are, you'll likely get worse performance from this than just sticking with an array and dealing with the higher cost of removals.
Most likely, something along the lines of this is going to be the "best" you can do to achieve your goals.

Resources