Algorithmic complexity of group average clustering - algorithm

I've been reading lately about various hierarchical clustering algorithms such as single-linkage clustering and group average clustering. In general, these algorithms don't tend to scale well. Naive implementations of most hierarchical clustering algorithms are O(N^3), but single-linkage clustering can be implemented in O(N^2) time.
It is also claimed that group-average clustering can be implemented in O(N^2 logN) time. This is what my question is about.
I simply do not see how this is possible.
Explanation after explanation, such as:
http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html
http://nlp.stanford.edu/IR-book/completelink.html#averagesection
https://en.wikipedia.org/wiki/UPGMA#Time_complexity
... are claiming that group average hierarchical clustering can be done in O(N^2 logN) time by using priority queues. But when I read the actual explanation or pseudo-code, it always appears to me that it is nothing better than O(N^3).
Essentially, the algorithm is as follows:
For an input sequence of size N:
Create a distance matrix of NxN #(this is O(N^2) time)
For each row in the distance matrix:
Create a priority queue (binary heap) of all distances in the row
Then:
For i in 0 to N-1:
Find the min element among all N priority queues # O(N)
Let k = the row index of the min element
For each element e in the kth row:
Merge the min element with it's nearest neighbor
Update the corresponding values in the distance matrix
Update the corresponding value in priority_queue[e]
So it's that last step that, to me, would seem to make this an O(N^3) algorithm. There's no way to "update" an arbitrary value in the priority queue without scanning the queue in O(N) time - assuming the priority queue is a binary heap. (A binary heap gives you constant access to the min element and log N insertion/deletion, but you can't simply find an element by value in better than O(N) time). And since we'd scan the priority queue for each row element, for each row, we get (O(N^3)).
The priority queue is sorted by a distance value - but the algorithm in question calls for deleting the element in the priority queue which corresponds to k, the row index in the distance matrix of the min element. Again, there's no way to find this element in the queue without an O(N) scan.
So, I assume I'm probably wrong since everyone else is saying otherwise. Can someone explain how this algorithm is somehow not O(N^3), but in fact, O(N^2 logN) ?

I think you are saying that the problem is that in order to update an entry in a heap you have to find it, and finding it takes time O(N). What you can do to get round this is to maintain an index that gives, for each item i, its location heapPos[i] in the heap. Every time you swap two items to restore the heap invariant you then need to modify two entries in heapPos[i] to keep the index correct, but this is just a constant factor on the work done in the heap.

If you store the positions in the heap (which adds another O(n) memory) you can update the heap without scanning, on the changed positions only. These updates are restricted to two paths on the heap (one removal, one update) and execute in O(log n). Alternatively, you could binary-search by the old priority, which will likely be in O(log n), too (but slower, above approach is O(1)).
So IMHO you can indeed implement these in O(n^2 log n). But the implementation will still use a lot (O(n^2)) of memory, anything of O(n^2) does not scale. You usually
run out of memory before you run out of time if you have O(n^2) memory...
Implementing these data structures is quite tricky. And when not done well, this may end up being slower than a theoretically-worse approach. For example Fibonacci heaps. They have nice properties on paper, but have too high constant costs to pay off.

No, because the distance matrix is symmetrical.
if the first entry in row 0 is to column 5, distance of 1, and that is lowest in the system, then the first entry in row 5 must be the complementary entry to column 0, with a distance of 1.
In fact you only need a half matrix.

Related

Inserting into data structure in O(1) time while maintaining index order

Is there a data structure with elements that can be indexed whose insertion runtime is O(1)? So for example, I could index the data structure like so: a[4], and yet when inserting an element at an arbitrary place in the data structure that the runtime is O(1)? Note that the data structure does not maintain sorted order, just the ability for each sequential element to have an index.
I don't think its possible, since inserting somewhere that is not at the end or beginning of the ordered data structure would mean that all the indicies after insertion must be updated to know that their index has increased by 1, which would take worst case O(n) time. If the answer is no, could someone prove it mathematically?
EDIT:
To clarify, I want to maintain the order of insertion of elements, so upon inserting, the item inserted remains sequentially between the two elements it was placed between.
The problem that you are looking to solve is called the list labeling problem.
There are lower bounds on the cost that depend on the relationship between the the maximum number of labels you need (n), and the number of possible labels (m).
If n is in O(log m), i.e., if the number of possible labels is exponential in the number of labels you need at any one time, then O(1) cost per operation is achievable... but this is not the usual case.
If n is in O(m), i.e., if they are proportional, then O(log2 n) per operation is the best you can do, and the algorithm is complicated.
If n <= m2, then you can do O(log N). Amortized O(log N) is simple, and O(log N) worst case is hard. Both algorithms are described in this paper by Dietz and Sleator. The hard way makes use of the O(log2 n) algorithm mentioned above.
HOWEVER, maybe you don't really need labels. If you just need to be able to compare the order of two items in the collection, then you are solving a slightly different problem called "list order maintenance". This problem can actually be solved in constant time -- O(1) cost per operation and O(1) cost to compare the order of two items -- although again O(1) amortized cost is a lot easier to achieve.
When inserting into slot i, append the element which was first at slot i to the end of the sequence.
If the sequence capacity must be grown, then this growing may not necessarily be O(1).

Finding k smallest elements in a min heap - worst-case complexity

I have a minimum heap with n elements and want to find k smallest numbers from this heap. What is the worst-case complexity?
Here is my approach: somewhere on the stackoverflow I read that complexity of finding i-th smallest number in a min heap is O(i). So if we would like to find n-1 smallest numbers (n is pointless, since it would be the entire heap), the total complexity would look something like this:
O(n-1)+O(n-2)+O(n-3)+…+O(2)+O(1)=O((1+n-1)*(n/2))=O(n^2)
Is this correct?
No, the time is much better than that. O(k log(n)) very easily, and O(k) if you're smart.
Finding and removing the smallest element from the heap is O(log(n)). This leads to O(k log(n)) time very easily.
BUT the result that you are thinking about is https://ac.els-cdn.com/S0890540183710308/1-s2.0-S0890540183710308-main.pdf?_tid=382a1cac-e4f7-11e7-8ac9-00000aab0f02&acdnat=1513713791_08f4df78a8821855e8ec788da063ea2f that shows how to find the size of the kth smallest number in time O(k). Now you use the fact that a heap is a binary tree and start from the root and do a recursive search for every number that you find which is smaller than that largest. Then fill out the rest of your list with copies of the k'th smallest number.
In that search you will wind up looking at up to k-1 that are at most that size, and for some of them you will look at up to 2 children that are too large to bother with, for a maximum of 3k-3 elements. This makes the whole algorithm O(k).
That link died due to bitrot. Hopefully https://www.sciencedirect.com/science/article/pii/S0890540183710308 lasts longer.
I am doubtful that it is possible to identify the kth smallest element in time O(k). The best I have seen before is an O(k log k) algorithm, which also conveniently solves your problem by identifying the k smallest elements. You can read the details in another answer on StackOverflow or on Quora.
The basic idea is to manipulate a secondary heap. Initially, this secondary heap contains only the root of the original heap. At each step, the algorithm deletes the min of the secondary heap and inserts its two original children (that is, its children from the original heap) into the secondary heap.
This algorithm has a nice property that on step i, the element it deletes from the secondary heap is the ith smallest element overall. So after k steps, the set of items which have been deleted from the secondary heap are exactly the k smallest elements. This algorithm is O(k log k) because there are O(k) deletions/insertions into a secondary heap which is upper bounded in size by O(k).
EDIT: I stand corrected! btilly's answer provides a solution in O(k) using a result from this paper.
There is a recent (2019) algorithm that finds the k smallest elements of a binary min-heap in time O(k) that uses the soft heap data structure. This is a dramatically simpler algorithm than Frederickson’s original O(k)-time heap selection algorithm. See “ Selection from Heaps, Row-Sorted Matrices, and X+Y Using Soft Heaps” by Kaplan et al.

What is the time complexity of sorting half an array using a priority queue?

So I have developed a Priority Queue using a Min Heap and according to online tutorials it takes O(nlogn) time to sort an entire array using a Priority Queue. This is because we extract 'n' times and for every extraction we have to perform a priority fix which takes logn time. Hence it is nlogn.
However, if I only want to sort half an array every single time, would it still be O(nlogn) time? Or would it be just O(logn)? The reason why I want to do this is because I want to get the element with middle priority and this seems like the only way to do it using a priority queue by extracting half the elements unless there is a more intuitive way of getting the element with middle priority in Priority Queue.
I think that the question is in two parts, so I will answer in two parts:
(a) If I understand you correctly, by sorting "half an array" you mean obtaining a sorted array of (n/2) smallest values of the given array. This will have to take O(n lg n) time. If there were a technique for doing this shorter than O(n lg n) time, then whenever we wanted to sort an array of n values whose maximum value is known to be v (and we can obtain the maximum value in O(n) time), we could construct an array of 2n elements, where the first half is the original array and the second half is filled with a value larger than v. Then, applying the hypothetical technique, we could in effect sort the original array in a time shorter than O(n lg n), which is known to be impossible.
(b) But if I am correct in understanding "the element with middle priority" as the median element in an array, you may be interested in this question.

Fastest method for Queue Implementation in Java

The task is to implement a queue in java with the following methods:
enqueue //add an element to queue
dequeue //remove element from queue
peekMedian //find median
peekMinimum //find minimum
peakMaximum //find maximum
size // get size
Assume that ALL METHODS ARE CALLED In EQUAL FREQUENCY, the task is to have the fastest implementation.
My Current Approach:
Maintain a sorted array, in addition to the queue, so enqueue and dequeue are take O(logn) and peekMedian, peekMaximum, peekMinimum all take O(1) time.
Please suggest a method that will be faster, assuming all methods are called in equal frequency.
Well, you are close - but there is still something missing, since inserting/deleting from a sorted array is O(n) (because at probability 1/2 the inserted element is at the first half of the array, and you will have to shift to the right all the following elements, and there are at least n/2 of these, so total complexity of this operation is O(n) on average + worst case)
However, if you switch your sorted DS to a skip list/ balanced BST - you are going to get O(logn) insertion/deletion and O(1) minimum/maximum/median/size (with caching)
EDIT:
You cannot get better then O(logN) for insertion (unless you decrease the peekMedian() to Omega(logN)), because that will enable you to sort better then O(NlogN):
First, note that the median moves one element to the right for each "high" elements you insert (in here, high means >= the current max).
So, by iteratively doing:
while peekMedian() != MAX:
peekMedian()
insert(MAX)
insert(MAX)
you can find the "higher" half of the sorted array.
Using the same approach with insert(MIN) you can get the lowest half of the array.
Assuming you have o(logN) (small o notation, better then Theta(logN) insertion and O(1) peekMedian(), you got yourself a sort better then O(NlogN), but sorting is Omega(NlogN) problem.
=><=
Thus insert() cannot be better then O(logN), with median still being O(1).
QED
EDIT2: Modifying the median in insertions:
If the tree size before insertion is 2n+1 (odd) then the old median is at index n+1, and the new median is at the same index (n+1), so if the element was added before the old median - you need to get the preceding node of the last median - and that's the new median. If it was added after it - do nothing, the old median is the new one as well.
If the list is even (2n elements), then after the insertion, you should increase an index (from n to n+1), so if the new element was added before the median - do nothing, if it was added after the old median, you need to set the new median as the following node from the old median.
note: In here next nodes and preceding nodes are those that follow according to the key, and index means the "place" of the node (smallest is 1st and biggest is last).
I only explained how to do it for insertion, the same ideas hold for deletion.
There is a simpler and perhaps better solution. (As has been discussed, the sorted array makes enqueue and dequeue both O(n), which is not so good.)
Maintain two sorted sets in addition to the queue. The Java library provides e.g. SortedSet, which are balanced search trees. The "low set" stores the first ceiling (n/2) elements in sorted order. The second "high set" has the last floor(n/2).
NB: If duplicates are allowed, you'll have to use something like Google's TreeMultiset instead of regular Java sorted sets.
To enqueue, just add to the queue and the correct set. If necessary, re-establish balance between the sets by moving one element: either the greatest element in the low set to the upper set or the least element in the high set to the low. Dequeuing needs the same re-balance operation.
Finding the median if n is odd is just looking up the max element in the low set. If n is even, find the max element in the low set and min in the high set and average them.
With the native Java sorted set implementation (balanced tree), this will be O(log n) for all operations. It will be very easy to code. About 60 lines.
If you implement your own sifting heaps for the low and high sets, then you'll have O(1) for the find median operation while all other ops will remain O(log n).
If you go on and implement your own Fibonacci heaps for the low and high sets, then you'll have O(1) insert as well.

Intuitive explanation for why QuickSort is n log n?

Is anybody able to give a 'plain english' intuitive, yet formal, explanation of what makes QuickSort n log n? From my understanding it has to make a pass over n items, and it does this log n times...Im not sure how to put it into words why it does this log n times.
Complexity
A Quicksort starts by partitioning the input into two chunks: it chooses a "pivot" value, and partitions the input into those less than the pivot value and those larger than the pivot value (and, of course, any equal to the pivot value have go into one or the other, of course, but for a basic description, it doesn't matter a lot which those end up in).
Since the input (by definition) isn't sorted, to partition it like that, it has to look at every item in the input, so that's an O(N) operation. After it's partitioned the input the first time, it recursively sorts each of those "chunks". Each of those recursive calls looks at every one of its inputs, so between the two calls it ends up visiting every input value (again). So, at the first "level" of partitioning, we have one call that looks at every input item. At the second level, we have two partitioning steps, but between the two, they (again) look at every input item. Each successive level has more individual partitioning steps, but in total the calls at each level look at all the input items.
It continues partitioning the input into smaller and smaller pieces until it reaches some lower limit on the size of a partition. The smallest that could possibly be would be a single item in each partition.
Ideal Case
In the ideal case we hope each partitioning step breaks the input in half. The "halves" probably won't be precisely equal, but if we choose the pivot well, they should be pretty close. To keep the math simple, let's assume perfect partitioning, so we get exact halves every time.
In this case, the number of times we can break it in half will be the base-2 logarithm of the number of inputs. For example, given 128 inputs, we get partition sizes of 64, 32, 16, 8, 4, 2, and 1. That's 7 levels of partitioning (and yes log2(128) = 7).
So, we have log(N) partitioning "levels", and each level has to visit all N inputs. So, log(N) levels times N operations per level gives us O(N log N) overall complexity.
Worst Case
Now let's revisit that assumption that each partitioning level will "break" the input precisely in half. Depending on how good a choice of partitioning element we make, we might not get precisely equal halves. So what's the worst that could happen? The worst case is a pivot that's actually the smallest or largest element in the input. In this case, we do an O(N) partitioning level, but instead of getting two halves of equal size, we've ended up with one partition of one element, and one partition of N-1 elements. If that happens for every level of partitioning, we obviously end up doing O(N) partitioning levels before even partition is down to one element.
This gives the technically correct big-O complexity for Quicksort (big-O officially refers to the upper bound on complexity). Since we have O(N) levels of partitioning, and each level requires O(N) steps, we end up with O(N * N) (i.e., O(N2)) complexity.
Practical implementations
As a practical matter, a real implementation will typically stop partitioning before it actually reaches partitions of a single element. In a typical case, when a partition contains, say, 10 elements or fewer, you'll stop partitioning and and use something like an insertion sort (since it's typically faster for a small number of elements).
Modified Algorithms
More recently other modifications to Quicksort have been invented (e.g., Introsort, PDQ Sort) which prevent that O(N2) worst case. Introsort does so by keeping track of the current partitioning "level", and when/if it goes too deep, it'll switch to a heap sort, which is slower than Quicksort for typical inputs, but guarantees O(N log N) complexity for any inputs.
PDQ sort adds another twist to that: since Heap sort is slower, it tries to avoid switching to heap sort if possible To to that, if it looks like it's getting poor pivot values, it'll randomly shuffle some of the inputs before choosing a pivot. Then, if (and only if) that fails to produce sufficiently better pivot values, it'll switch to using a Heap sort instead.
Each partitioning operation takes O(n) operations (one pass on the array).
In average, each partitioning divides the array to two parts (which sums up to log n operations). In total we have O(n * log n) operations.
I.e. in average log n partitioning operations and each partitioning takes O(n) operations.
There's a key intuition behind logarithms:
The number of times you can divide a number n by a constant before reaching 1 is O(log n).
In other words, if you see a runtime that has an O(log n) term, there's a good chance that you'll find something that repeatedly shrinks by a constant factor.
In quicksort, what's shrinking by a constant factor is the size of the largest recursive call at each level. Quicksort works by picking a pivot, splitting the array into two subarrays of elements smaller than the pivot and elements bigger than the pivot, then recursively sorting each subarray.
If you pick the pivot randomly, then there's a 50% chance that the chosen pivot will be in the middle 50% of the elements, which means that there's a 50% chance that the larger of the two subarrays will be at most 75% the size of the original. (Do you see why?)
Therefore, a good intuition for why quicksort runs in time O(n log n) is the following: each layer in the recursion tree does O(n) work, and since each recursive call has a good chance of reducing the size of the array by at least 25%, we'd expect there to be O(log n) layers before you run out of elements to throw away out of the array.
This assumes, of course, that you're choosing pivots randomly. Many implementations of quicksort use heuristics to try to get a nice pivot without too much work, and those implementations can, unfortunately, lead to poor overall runtimes in the worst case. #Jerry Coffin's excellent answer to this question talks about some variations on quicksort that guarantee O(n log n) worst-case behavior by switching which sorting algorithms are used, and that's a great place to look for more information about this.
Well, it's not always n(log n). It is the performance time when the pivot chosen is approximately in the middle. In worst case if you choose the smallest or the largest element as the pivot then the time will be O(n^2).
To visualize 'n log n', you can assume the pivot to be element closest to the average of all the elements in the array to be sorted.
This would partition the array into 2 parts of roughly same length.
On both of these you apply the quicksort procedure.
As in each step you go on halving the length of the array, you will do this for log n(base 2) times till you reach length = 1 i.e a sorted array of 1 element.
Break the sorting algorithm in two parts. First is the partitioning and second recursive call. Complexity of partioning is O(N) and complexity of recursive call for ideal case is O(logN). For example, if you have 4 inputs then there will be 2(log4) recursive call. Multiplying both you get O(NlogN). It is a very basic explanation.
In-fact you need to find the position of all the N elements(pivot),but the maximum number of comparisons is logN for each element (the first is N,second pivot N/2,3rd N/4..assuming pivot is the median element)
In the case of the ideal scenario, the first level call, places 1 element in its proper position. there are 2 calls at the second level taking O(n) time combined but it puts 2 elements in their proper position. in the same way. there will be 4 calls at the 3rd level which would take O(n) combined time but will place 4 elements into their proper position. so the depth of the recursive tree will be log(n) and at each depth, O(n) time is needed for all recursive calls. So time complexity is O(nlogn).

Resources