find best k-means from a list of candidates - algorithm

I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized. One way to do it would be to check the cost of any possible k points in S and take the minimum, but that would take O(k^k*n) time, is there any more efficient way to do it?
I need either an optimal solution or a constant approximation.
The reason I need this is that I'm trying to find a constant approximation for the k-means as fast as possible and later use this for a coreset construction (coreset=data minimization while still keeping the cost of any query approximately the same). I was able to show that if we assume that in the optimal clustering each cluster has omega(n/k) points we can create pretty fast a list of size O(k) canidates that contains inside of them a 3-approximation for the k-means, so I was wondering if we can find those k points or a constant approximation for their costs in time which is faster than exhaustive search.
Example for k=2
In this example S is the green dots and A is the red dots. The algorithm should return the 2 circled points from S since they minimize the sum of squared distances from the points of A to their closest point of the 2.

I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized.
It sounds like this could be solved simply by checking the N points against the K points to find the k points in N with the smallest squared distance.
Therefore, I'm now fairly sure this is actually finding the k-nearest neighbors (K-NN as a computational geometry problem, not the pattern recognition definition) in the N points for each point in the K points and not actually k-means.
For higher dimensionality, it is often useful to also consider the dimensionality, D in the algorithm.
The algorithm mentioned is indeed O(NDk^2) then when considering K-NN instead. That can be improved to O(NDk) by using Quickselect algorithm on the distances. This allows for checking the list of N points against each of the K points in O(N) to find the nearest k points.
https://en.wikipedia.org/wiki/Quickselect
Edit:
Seems there is some confusion on quickselect and if it can be used. Here is a O(DkNlogN) solution that uses a standard sort O(NlogN) instead of quickselect O(N). Though this might be faster in practice and as you can see in most languages it's pretty easy to implement.
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# First k sorted by distanceSquared
results[y] = S.sort(key=distanceSquared)[:k]
return results
Update for new visual
# Build up distance sums O(A*N*D)
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# Sum of distance squared from y for all points in S
results[y] = sum(map(distanceSquared, S))
def results_key_value(key):
results[key]
# First k results sorted by key O(D*AlogA)
results.keys().sort(key=results_key_value)[:k]
You could approximate by only considering Z random points chosen from the S points. Alternatively, you could merge points in S if they are close enough together. This could reduce S to a much smaller size as long S remains about F^2 or larger in size, it shouldn't affect which points in F are chosen too much. Though you would also need to adjust the weight of the points to handle that better. IE: the square distance of a point that represents 10 points is multiplied by 10 to account for it acting as 10 points instead of just 1.

Related

Trouble understanding Closest-Pair divide and conquer algorithm

I'm new into coding and today I completed the trivial solution for the Closest-Pair problem in a 2-D space. (2 for loops)
However I gave up finding any solution which could do it in O(n log n). Even after researching it, I still don't understand how this can be faster than the trivial method.
What I understand:
-> At first we split the array in 2 halfs and sort everything only considering the X coordinates. This can be done in n log n.
Next there are recursive calls which "find the two points with the lowest distance" in each half. But how is this done exactly below O(n^2)?
In my understanding it is impossible to find the lowest distance between N/2 points without checking every single one of them.
There is a solution in 1-D which absolutely makes sense to me. After sorting we know, that the distance between two non-adjacent points can't be lower than the distance of at least 2 adjacent ones. However this is not true for 2-D space, since we have an additional Y coordinate which could lead to the lowest distance between two points which are not adjacent on the X axis.
First of all, heed the advice of user #Evg - this answer cannot substitute the comprehensive description and mathematically rigorous analysis of the algorithm.
However, here are some ideas to get the intuition started:
(Recursion structure)
The question states:
Next there are recursive calls which "find the two points with the lowest distance" in each half. But how is this done exactly below O(n^2)? In my understanding it is impossible to find the lowest distance between N/2 points without checking every single one of them.
The recursion, however, does not stop at level 1 - assume for the sake of the argument that some O(n log n) algorithm works. Finding closest pairs among N/2 points applying that very algorithm takes O(N/2 log N/2) - not O((N/2)^2).
(Consequences of finding a closest pair in one half)
If you have found a closest pair (p, q) in the 'left' half of the point set, this pair's distance sets an upper bound to the width of a corridor around the halving line from which a closer pair (r, s) with r from the left, s from the right half can be drawn. If the closest distance found so far is 'small', it significantly reduces the size of the candidate set. As the points have been ordered by their x coordinate, the algorithm can exploit the information efficiently.
Said corridor may still cover up to the whole set of N points, but if it does, it provides information of the geometry of the point set: the points of each half will basically be aligned along a vertical line. This information can be exploited algorithmically - the most naive way would be to execute the algorithm once again but sorting along y coordinates and halving the point set by a horizontal line. Note that executing any algorithm a constant number of times does not change asymptotic run time expressed by the O(.) notation.
(Finding a close pair with one point from each half)
Consider checking a pair of points (r, s), one point from each half. It is known that the difference in their x and y coordinates, resp., mustn't exceed the minimal distance d found so far. It is known from the recursion that there can be no points r', s' (r' from the left, s' from the right half) closer to r, s, resp., than d. So given some r there cannot be 'many' candidates from the other half.
Imagine a circle of radius d drawn around r. Any point s from the other half being closer than d must be located within that circle. Let there be a few of them - however, the minimum distance among each pair still be at least d. The maximum number of points that can be distributed within a circle of radius d such that the distance between each pair of them is at least d is 7 - think of a regular hexagon with side length d and its center coinciding with the circle's center.
So after the recursion, at most every r from the left half needs to be checked against at max a constant number of points from the other half which makes the part of the algorithm after the recursion run in O(N).
Note that finding the pairing candidates for a given r is an efficient operation - the points from both halves have been sorted by the same criterion.

Find a region with maximum sum of top-K points

My problem is: we have N points in a 2D space, each point has a positive weight. Given a query consisting of two real numbers a,b and one integer k, find the position of a rectangle of size a x b, with edges are parallel to axes, so that the sum of weights of top-k points, i.e. k points with highest weights, covered by the rectangle is maximized?
Any suggestion is appreciated.
P.S.:
There are two related problems, which are already well-studied:
Maximum region sum: find the rectangle with the highest total weight sum. Complexity: NlogN.
top-K query for orthogonal ranges: find top-k points in a given rectangle. Complexity: O(log(N)^2+k).
You can reduce this problem into finding two points in the rectangle: rightmost and topmost. So effectively you can select every pair of points and calculate the top-k weight (which according to you is O(log(N)^2+k)). Complexity: O(N^2*(log(N)^2+k)).
Now, given two points, they might not form a valid pair: they might be too far or one point may be right and top of the other point. So, in reality, this will be much faster.
My guess is the optimal solution will be a variation of maximum region sum problem. Could you point to a link describing that algorithm?
An non-optimal answer is the following:
Generate all the possible k-plets of points (they are N × N-1 × … × N-k+1, so this is O(Nk) and can be done via recursion).
Filter this list down by eliminating all k-plets which are not enclosed in a a×b rectangle: this is a O(k Nk) at worst.
Find the k-plet which has the maximum weight: this is a O(k Nk-1) at worst.
Thus, this algorithm is O(k Nk).
Improving the algorithm
Step 2 can be integrated in step 1 by stopping the branch recursion when a set of points is already too large. This does not change the need to scan the element at least once, but it can reduce the number significantly: think of cases where there are no solutions because all points are separated more than the size of the rectangle, that can be found in O(N2).
Also, the permutation generator in step 1 can be made to return the points in order by x or y coordinate, by pre-sorting the point array correspondingly. This is useful because it lets us discard a bunch of more possibilities up front. Suppose the array is sorted by y coordinate, so the k-plets returned will be ordered by y coordinate. Now, supposing we are discarding a branch because it contains a point whose y coordinate is outside the max rectangle, we can also discard all the next sibling branches because their y coordinate will be more than of equal to the current one which is already out of bounds.
This adds O(n log n) for the sort, but the improvement can be quite significant in many cases -- again, when there are many outliers. The coordinate should be chosen corresponding to the minimum rectangle side, divided by the corresponding side of the 2D field -- by which I mean the maximum coordinate minus the minimum coordinate of all points.
Finally, if all the points lie within an a×b rectangle, then the algorithm performs as O(k Nk) anyways. If this is a concrete possibility, it should be checked, an easy O(N) loop, and if so then it's enough to return the points with the top N weights, which is also O(N).

Sorting multidimensional vectors

If I have a set of k vectors of n dimensions, how can I sort these such that the distance between each consecutive pair of vectors is the minimal possible? The distance can be calculated by using the Euclidian distance, but how is the "sorting" then implemented in an effective manner?
I'm thinking one approach would be to select a vector at random, calculate the distance to all other vectors, pick the vector that minimizes the distance as the next vector and repeat until all vectors have been "sorted". However, this greedy search would probably render different results depending on which vector I start with.
Any ideas on how to do this?
If you really want just 'that the distance between each consecutive pair of vectors is the minimal possible' without randomness, you can firstly find 2 closest points (by O(n log n) algo like this) - let's say, p and q, then search for closest points for p (let's say, r) and q (let's say, s), then compare distance (p,r) and (q,s) and if the first is smaller, start with q,p,r and use your greedy algo (in other case, obviously, start with p,q,s).
However, if your goal is actually to arrange points so that the sum of all paired distances is smallest, you should choose any approximate solution for Travelling salesman problem. Note this trick in order to reduce your task to TSP.

algorithm to find a point among n points in plane to minimize the sum of distances

I have an algorithm problem here. It is different from the normal Fermat Point problem.
Given a set of n points in the plane, I need to find which one can minimize the sum of distances to the rest of n-1 points.
Is there any algorithm you know of run less than O(n^2)?
Thank you.
One solution is to assume median is close to the mean and for a subset of points close to the mean exhaustively calculate sum of distances. You can choose klog(n) points closest to the mean, where k is an arbitrarily chosen constant (complexity nlog(n)).
Another possible solution is Delaunay Triangulation. This triangulation is possible in O(nlogn) time. The triangulation results in a graph with one vertex for each point and edges to satisfy delauney triangulation.
Once you have the triangulation, you can start at any point and compare sum-of-distances of that point to its neighbors and keep moving iteratively. You can stop when the current point has the minimum sum-of-distance compared to its neighbors. Intuitively, this will halt at the global optimal point.
I think the underlying assumption here is that you have a dataset of points which you can easily bound, as many algorithms which would be "good enough" in practice may not be rigorous enough for theory and/or may not scale well for arbitrarily large solutions.
A very simple solution which is probably "good enough" is to sort the coordinates on the Y ordinate, then do a stable sort on the X ordinate.
Take the rectangle defined by the min(X,Y) and max(X,Y) values, complexity O(1) as the values will be at known locations in the sorted dataset.
Now, working from the center of your sorted dataset, find coordinate values as close as possible to {Xctr = Xmin + (Xmax - Xmin) / 2, Yctr = Ymin + (Ymax - Ymin) / 2} -- complexity O(N) bounded by your minimization criteria, distance being the familiar radius from {Xctr,Yctr}.
The worst case complexity would be comparing your centroid to every other point, but once you get away from the middle points you will not be improving the global optimal and should terminate the search.

Finding the farthest point in one set from another set

My goal is a more efficient implementation of the algorithm posed in this question.
Consider two sets of points (in N-space. 3-space for the example case of RGB colorspace, while a solution for 1-space 2-space differs only in the distance calculation). How do you find the point in the first set that is the farthest from its nearest neighbor in the second set?
In a 1-space example, given the sets A:{2,4,6,8} and B:{1,3,5}, the answer would be
8, as 8 is 3 units away from 5 (its nearest neighbor in B) while all other members of A are just 1 unit away from their nearest neighbor in B. edit: 1-space is overly simplified, as sorting is related to distance in a way that it is not in higher dimensions.
The solution in the source question involves a brute force comparison of every point in one set (all R,G,B where 512>=R+G+B>=256 and R%4=0 and G%4=0 and B%4=0) to every point in the other set (colorTable). Ignore, for the sake of this question, that the first set is elaborated programmatically instead of iterated over as a stored list like the second set.
First you need to find every element's nearest neighbor in the other set.
To do this efficiently you need a nearest neighbor algorithm. Personally I would implement a kd-tree just because I've done it in the past in my algorithm class and it was fairly straightforward. Another viable alternative is an R-tree.
Do this once for each element in the smallest set. (Add one element from the smallest to larger one and run the algorithm to find its nearest neighbor.)
From this you should be able to get a list of nearest neighbors for each element.
While finding the pairs of nearest neighbors, keep them in a sorted data structure which has a fast addition method and a fast getMax method, such as a heap, sorted by Euclidean distance.
Then, once you're done simply ask the heap for the max.
The run time for this breaks down as follows:
N = size of smaller set
M = size of the larger set
N * O(log M + 1) for all the kd-tree nearest neighbor checks.
N * O(1) for calculating the Euclidean distance before adding it to the heap.
N * O(log N) for adding the pairs into the heap.
O(1) to get the final answer :D
So in the end the whole algorithm is O(N*log M).
If you don't care about the order of each pair you can save a bit of time and space by only keeping the max found so far.
*Disclaimer: This all assumes you won't be using an enormously high number of dimensions and that your elements follow a mostly random distribution.
The most obvious approach seems to me to be to build a tree structure on one set to allow you to search it relatively quickly. A kd-tree or similar would probably be appropriate for that.
Having done that, you walk over all the points in the other set and use the tree to find their nearest neighbour in the first set, keeping track of the maximum as you go.
It's nlog(n) to build the tree, and log(n) for one search so the whole thing should run in nlog(n).
To make things more efficient, consider using a Pigeonhole algorithm - group the points in your reference set (your colorTable) by their location in n-space. This allows you to efficiently find the nearest neighbour without having to iterate all the points.
For example, if you were working in 2-space, divide your plane into a 5 x 5 grid, giving 25 squares, with 25 groups of points.
In 3 space, divide your cube into a 5 x 5 x 5 grid, giving 125 cubes, each with a set of points.
Then, to test point n, find the square/cube/group that contains n and test distance to those points. You only need to test points from neighbouring groups if point n is closer to the edge than to the nearest neighbour in the group.
For each point in set B, find the distance to its nearest neighbor in set A.
To find the distance to each nearest neighbor, you can use a kd-tree as long as the number of dimensions is reasonable, there aren't too many points, and you will be doing many queries - otherwise it will be too expensive to build the tree to be worthwhile.
Maybe I'm misunderstanding the question, but wouldn't it be easiest to just reverse the sign on all the coordinates in one data set (i.e. multiply one set of coordinates by -1), then find the first nearest neighbour (which would be the farthest neighbour)? You can use your favourite knn algorithm with k=1.
EDIT: I meant nlog(n) where n is the sum of the sizes of both sets.
In the 1-Space set I you could do something like this (pseudocode)
Use a structure like this
Struct Item {
int value
int setid
}
(1) Max Distance = 0
(2) Read all the sets into Item structures
(3) Create an Array of pointers to all the Items
(4) Sort the array of pointers by Item->value field of the structure
(5) Walk the array from beginning to end, checking if the Item->setid is different from the previous Item->setid
if (SetIDs are different)
check if this distance is greater than Max Distance if so set MaxDistance to this distance
Return the max distance.

Resources