Fastest method for Queue Implementation in Java - algorithm

The task is to implement a queue in java with the following methods:
enqueue //add an element to queue
dequeue //remove element from queue
peekMedian //find median
peekMinimum //find minimum
peakMaximum //find maximum
size // get size
Assume that ALL METHODS ARE CALLED In EQUAL FREQUENCY, the task is to have the fastest implementation.
My Current Approach:
Maintain a sorted array, in addition to the queue, so enqueue and dequeue are take O(logn) and peekMedian, peekMaximum, peekMinimum all take O(1) time.
Please suggest a method that will be faster, assuming all methods are called in equal frequency.

Well, you are close - but there is still something missing, since inserting/deleting from a sorted array is O(n) (because at probability 1/2 the inserted element is at the first half of the array, and you will have to shift to the right all the following elements, and there are at least n/2 of these, so total complexity of this operation is O(n) on average + worst case)
However, if you switch your sorted DS to a skip list/ balanced BST - you are going to get O(logn) insertion/deletion and O(1) minimum/maximum/median/size (with caching)
EDIT:
You cannot get better then O(logN) for insertion (unless you decrease the peekMedian() to Omega(logN)), because that will enable you to sort better then O(NlogN):
First, note that the median moves one element to the right for each "high" elements you insert (in here, high means >= the current max).
So, by iteratively doing:
while peekMedian() != MAX:
peekMedian()
insert(MAX)
insert(MAX)
you can find the "higher" half of the sorted array.
Using the same approach with insert(MIN) you can get the lowest half of the array.
Assuming you have o(logN) (small o notation, better then Theta(logN) insertion and O(1) peekMedian(), you got yourself a sort better then O(NlogN), but sorting is Omega(NlogN) problem.
=><=
Thus insert() cannot be better then O(logN), with median still being O(1).
QED
EDIT2: Modifying the median in insertions:
If the tree size before insertion is 2n+1 (odd) then the old median is at index n+1, and the new median is at the same index (n+1), so if the element was added before the old median - you need to get the preceding node of the last median - and that's the new median. If it was added after it - do nothing, the old median is the new one as well.
If the list is even (2n elements), then after the insertion, you should increase an index (from n to n+1), so if the new element was added before the median - do nothing, if it was added after the old median, you need to set the new median as the following node from the old median.
note: In here next nodes and preceding nodes are those that follow according to the key, and index means the "place" of the node (smallest is 1st and biggest is last).
I only explained how to do it for insertion, the same ideas hold for deletion.

There is a simpler and perhaps better solution. (As has been discussed, the sorted array makes enqueue and dequeue both O(n), which is not so good.)
Maintain two sorted sets in addition to the queue. The Java library provides e.g. SortedSet, which are balanced search trees. The "low set" stores the first ceiling (n/2) elements in sorted order. The second "high set" has the last floor(n/2).
NB: If duplicates are allowed, you'll have to use something like Google's TreeMultiset instead of regular Java sorted sets.
To enqueue, just add to the queue and the correct set. If necessary, re-establish balance between the sets by moving one element: either the greatest element in the low set to the upper set or the least element in the high set to the low. Dequeuing needs the same re-balance operation.
Finding the median if n is odd is just looking up the max element in the low set. If n is even, find the max element in the low set and min in the high set and average them.
With the native Java sorted set implementation (balanced tree), this will be O(log n) for all operations. It will be very easy to code. About 60 lines.
If you implement your own sifting heaps for the low and high sets, then you'll have O(1) for the find median operation while all other ops will remain O(log n).
If you go on and implement your own Fibonacci heaps for the low and high sets, then you'll have O(1) insert as well.

Related

Algorithmic complexity of group average clustering

I've been reading lately about various hierarchical clustering algorithms such as single-linkage clustering and group average clustering. In general, these algorithms don't tend to scale well. Naive implementations of most hierarchical clustering algorithms are O(N^3), but single-linkage clustering can be implemented in O(N^2) time.
It is also claimed that group-average clustering can be implemented in O(N^2 logN) time. This is what my question is about.
I simply do not see how this is possible.
Explanation after explanation, such as:
http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html
http://nlp.stanford.edu/IR-book/completelink.html#averagesection
https://en.wikipedia.org/wiki/UPGMA#Time_complexity
... are claiming that group average hierarchical clustering can be done in O(N^2 logN) time by using priority queues. But when I read the actual explanation or pseudo-code, it always appears to me that it is nothing better than O(N^3).
Essentially, the algorithm is as follows:
For an input sequence of size N:
Create a distance matrix of NxN #(this is O(N^2) time)
For each row in the distance matrix:
Create a priority queue (binary heap) of all distances in the row
Then:
For i in 0 to N-1:
Find the min element among all N priority queues # O(N)
Let k = the row index of the min element
For each element e in the kth row:
Merge the min element with it's nearest neighbor
Update the corresponding values in the distance matrix
Update the corresponding value in priority_queue[e]
So it's that last step that, to me, would seem to make this an O(N^3) algorithm. There's no way to "update" an arbitrary value in the priority queue without scanning the queue in O(N) time - assuming the priority queue is a binary heap. (A binary heap gives you constant access to the min element and log N insertion/deletion, but you can't simply find an element by value in better than O(N) time). And since we'd scan the priority queue for each row element, for each row, we get (O(N^3)).
The priority queue is sorted by a distance value - but the algorithm in question calls for deleting the element in the priority queue which corresponds to k, the row index in the distance matrix of the min element. Again, there's no way to find this element in the queue without an O(N) scan.
So, I assume I'm probably wrong since everyone else is saying otherwise. Can someone explain how this algorithm is somehow not O(N^3), but in fact, O(N^2 logN) ?
I think you are saying that the problem is that in order to update an entry in a heap you have to find it, and finding it takes time O(N). What you can do to get round this is to maintain an index that gives, for each item i, its location heapPos[i] in the heap. Every time you swap two items to restore the heap invariant you then need to modify two entries in heapPos[i] to keep the index correct, but this is just a constant factor on the work done in the heap.
If you store the positions in the heap (which adds another O(n) memory) you can update the heap without scanning, on the changed positions only. These updates are restricted to two paths on the heap (one removal, one update) and execute in O(log n). Alternatively, you could binary-search by the old priority, which will likely be in O(log n), too (but slower, above approach is O(1)).
So IMHO you can indeed implement these in O(n^2 log n). But the implementation will still use a lot (O(n^2)) of memory, anything of O(n^2) does not scale. You usually
run out of memory before you run out of time if you have O(n^2) memory...
Implementing these data structures is quite tricky. And when not done well, this may end up being slower than a theoretically-worse approach. For example Fibonacci heaps. They have nice properties on paper, but have too high constant costs to pay off.
No, because the distance matrix is symmetrical.
if the first entry in row 0 is to column 5, distance of 1, and that is lowest in the system, then the first entry in row 5 must be the complementary entry to column 0, with a distance of 1.
In fact you only need a half matrix.

What is the time complexity of sorting half an array using a priority queue?

So I have developed a Priority Queue using a Min Heap and according to online tutorials it takes O(nlogn) time to sort an entire array using a Priority Queue. This is because we extract 'n' times and for every extraction we have to perform a priority fix which takes logn time. Hence it is nlogn.
However, if I only want to sort half an array every single time, would it still be O(nlogn) time? Or would it be just O(logn)? The reason why I want to do this is because I want to get the element with middle priority and this seems like the only way to do it using a priority queue by extracting half the elements unless there is a more intuitive way of getting the element with middle priority in Priority Queue.
I think that the question is in two parts, so I will answer in two parts:
(a) If I understand you correctly, by sorting "half an array" you mean obtaining a sorted array of (n/2) smallest values of the given array. This will have to take O(n lg n) time. If there were a technique for doing this shorter than O(n lg n) time, then whenever we wanted to sort an array of n values whose maximum value is known to be v (and we can obtain the maximum value in O(n) time), we could construct an array of 2n elements, where the first half is the original array and the second half is filled with a value larger than v. Then, applying the hypothetical technique, we could in effect sort the original array in a time shorter than O(n lg n), which is known to be impossible.
(b) But if I am correct in understanding "the element with middle priority" as the median element in an array, you may be interested in this question.

Get the k smallest elements of an array using quick sort

How would you find the k smallest elements from an unsorted array using quicksort (other than just sorting and taking the k smallest elements)? Would the worst case running time be the same O(n^2)?
You could optimize quicksort, all you have to do is not run the recursive potion on the other portions of the array other than the "first" half until your partition is at position k. If you don't need your output sorted, you can stop there.
Warning: non-rigorous analysis ahead.
However, I think the worst-case time complexity will still be O(n^2). That occurs when you always pick the biggest or smallest element to be your pivot, and you devolve into bubble sort (i.e. you aren't able to pick a pivot that divides and conquers).
Another solution (if the only purpose of this collection is to pick out k min elements) is to use a min-heap of limited tree height ciel(log(k)) (or exactly k nodes). So now, for each insert into the min heap, your maximum time for insert is O(n*log(k)) and the same for removal (versus O(n*log(n)) for both in a full heapsort). This will give the array back in sorted order in linearithmic time worst-case. Same with mergesort.

What is the most efficient way to ge the top N items of the union of M sorted sets

Say you have 4 sorted sets with thousands and thousands of keys and scores. Since they are sorted sets, getting the top items can ben done in logaritmic time complexity.
The easy way would be to take the union of the sets, and then get the top items. But doing so is at least linear to the sum of all items in all sets.
The best way I could think of is this:
Take the top N items from every set
Find the item with the lowest rank and the higest score for that rank.
Devide that score by the number of sets. (Any key with a score lower than this can never be in the top N)
Take the union of those keys. (Ignoring scores)
Find the scores for all keys in all sets. (A key might have score 1 in one set and 10000 in another)
That is like, finding all keys that could possibly be in the top list, and do the union with those keys. There are probably more efficient ways to limit the number of items to consider.
[edit]
Keys occur in one or more sets, and their summed scores determines the final score.
So a key that is in all sets with a low score might have a higher score than a key with a high score that is in only one set.
The algorithm you propose seems quite awkward. Just take one of the following:
The simple way
for i = 1 to n
loop through all sets and look at their smallest element,
pick the smallest element and remove it from the sets
Complexity:
O(n * s) where n is the number of items you want and s is the number of sets.
Of course, if you are not allowed to remove elements from the sets, you can also maintain iterators into each set to get elements from them in sorted order without having to alter the sets.
A more efficient way
Maintain a priority queue over all the smallest elements of each set. Whenever removing the smallest element e from that priority queue, reinsert the next element from the set from which e came.
Complexity: Assume a simple priority queue with O(log n) 'insert' and O(log n) 'remove smallest element' complexity. There are better ones like fibonacci heaps, but this one will do just fine. Then we have:
s insertions to fill the priority queue at the start, so O(s log s).
n "delete smallest element" + insert a new one, so O(n log s) (since there are always s elements in the queue)
Thus, we achieve O(s log s + n log s) which is way better.
Comparison
As long as s is quite small, there shouldn't really be a big difference between the algorithms and you can also pick the simple one. If you have a lot of sets, then you should definitely go for the second approach.
Lookup Complexity
In my analysis, I omitted the logarithmic lookup factor to find the smallest element for each set and assumed that the smallest element of each set could be retrieved in O(1), like in a sorted list. Varying the lookup cost from O(1) to O(log n) just introduces an additional factor that does not alter the algorithms. In addition, you usueally only pay the O(log n) once at the first lookup. Afterwards, you usually have an iterator to the smallest element. Accessing each further element using the iterator is then only O(1).

What is the best way to implement a double-ended priority queue?

I would like to implement a double-ended priority queue with the following constraints:
needs to be implemented in a fixed size array..say 100 elements..if new elements need to be added after the array is full, the oldest needs to be removed
need maximum and minimum in O(1)
if possible insert in O(1)
if possible remove minimum in O(1)
clear to empty/init state in O(1) if possible
count of number of elements in array at the moment in O(1)
I would like O(1) for all the above 5 operations but its not possible to have O(1) on all of them in the same implementation. Atleast O(1) on 3 operations and O(log(n)) on the other 2 operations should suffice.
Will appreciate if any pointers can be provided to such an implementation.
There are many specialized data structures for this. One simple data structure is the min-max heap, which is implemented as a binary heap where the layers alternate between "min layers" (each node is less than or equal to its descendants) and "max layers" (each node is greater than or equal to its descendants.) The minimum and maximum can be found in time O(1), and, as in a standard binary heap, enqueues and dequeues can be done in time O(log n) time each.
You can also use the interval heap data structure, which is another specialized priority queue for the task.
Alternatively, you can use two priority queues - one storing elements in ascending order and one in descending order. Whenever you insert a value, you can then insert elements into both priority queues and have each store a pointer to the other. Then, whenever you dequeue the min or max, you can remove the corresponding element from the other heap.
As yet another option, you could use a balanced binary search tree to store the elements. The minimum and maximum can then be found in time O(log n) (or O(1) if you cache the results) and insertions and deletions can be done in time O(log n). If you're using C++, you can just use std::map for this and then use begin() and rbegin() to get the minimum and maximum values, respectively.
Hope this helps!
A binary heap will give you insert and remove minimum in O(log n) and the others in O(1).
The only tricky part is removing the oldest element once the array is full. For this, keep another array:
time[i] = at what position in the heap array is the element
added at time i + 100 * k.
Every 100 iterations, you increment k.
Then, when the array fills up for the first time, you remove heap[ time[0] ], when it fills up for the second time you remove heap[ time[1] ], ..., when it fills up for the 100th time, you wrap around and remove heap[ time[0] ] again etc. When it fills up for the kth time, you remove heap[ time[k % 100] ] (100 is your array size).
Make sure to also update the time array when you insert and remove elements.
Removal of an arbitrary element can be done in O(log n) if you know its position: just swap it with the last element in your heap array, and sift down the element you have swapped in.
If you absolutely need max and min to be O(1) then what you can do is create a linked list, where you constantly keep track of min, max, and size, and then link all the nodes to some sort of tree structure, probably a heap. Min, max, and size would all be constant, and since finding any node would be in O(log n), insert and remove are log n each. Clearing would be trivial.
If your queue is a fixed size, then O-notation is meaningless. Any O(log n) or even O(n) operation is essentially O(1) because n is fixed, so what you really want is an algorithm that's fast for the given dataset. Probably two parallel traditional heap priority queues would be fine (one for high, one for low).
If you know more about what kind of data you have, you might be able to make something more special-purpose.

Resources