If I have a set of k vectors of n dimensions, how can I sort these such that the distance between each consecutive pair of vectors is the minimal possible? The distance can be calculated by using the Euclidian distance, but how is the "sorting" then implemented in an effective manner?
I'm thinking one approach would be to select a vector at random, calculate the distance to all other vectors, pick the vector that minimizes the distance as the next vector and repeat until all vectors have been "sorted". However, this greedy search would probably render different results depending on which vector I start with.
Any ideas on how to do this?
If you really want just 'that the distance between each consecutive pair of vectors is the minimal possible' without randomness, you can firstly find 2 closest points (by O(n log n) algo like this) - let's say, p and q, then search for closest points for p (let's say, r) and q (let's say, s), then compare distance (p,r) and (q,s) and if the first is smaller, start with q,p,r and use your greedy algo (in other case, obviously, start with p,q,s).
However, if your goal is actually to arrange points so that the sum of all paired distances is smallest, you should choose any approximate solution for Travelling salesman problem. Note this trick in order to reduce your task to TSP.
Related
I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized. One way to do it would be to check the cost of any possible k points in S and take the minimum, but that would take O(k^k*n) time, is there any more efficient way to do it?
I need either an optimal solution or a constant approximation.
The reason I need this is that I'm trying to find a constant approximation for the k-means as fast as possible and later use this for a coreset construction (coreset=data minimization while still keeping the cost of any query approximately the same). I was able to show that if we assume that in the optimal clustering each cluster has omega(n/k) points we can create pretty fast a list of size O(k) canidates that contains inside of them a 3-approximation for the k-means, so I was wondering if we can find those k points or a constant approximation for their costs in time which is faster than exhaustive search.
Example for k=2
In this example S is the green dots and A is the red dots. The algorithm should return the 2 circled points from S since they minimize the sum of squared distances from the points of A to their closest point of the 2.
I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized.
It sounds like this could be solved simply by checking the N points against the K points to find the k points in N with the smallest squared distance.
Therefore, I'm now fairly sure this is actually finding the k-nearest neighbors (K-NN as a computational geometry problem, not the pattern recognition definition) in the N points for each point in the K points and not actually k-means.
For higher dimensionality, it is often useful to also consider the dimensionality, D in the algorithm.
The algorithm mentioned is indeed O(NDk^2) then when considering K-NN instead. That can be improved to O(NDk) by using Quickselect algorithm on the distances. This allows for checking the list of N points against each of the K points in O(N) to find the nearest k points.
https://en.wikipedia.org/wiki/Quickselect
Edit:
Seems there is some confusion on quickselect and if it can be used. Here is a O(DkNlogN) solution that uses a standard sort O(NlogN) instead of quickselect O(N). Though this might be faster in practice and as you can see in most languages it's pretty easy to implement.
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# First k sorted by distanceSquared
results[y] = S.sort(key=distanceSquared)[:k]
return results
Update for new visual
# Build up distance sums O(A*N*D)
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# Sum of distance squared from y for all points in S
results[y] = sum(map(distanceSquared, S))
def results_key_value(key):
results[key]
# First k results sorted by key O(D*AlogA)
results.keys().sort(key=results_key_value)[:k]
You could approximate by only considering Z random points chosen from the S points. Alternatively, you could merge points in S if they are close enough together. This could reduce S to a much smaller size as long S remains about F^2 or larger in size, it shouldn't affect which points in F are chosen too much. Though you would also need to adjust the weight of the points to handle that better. IE: the square distance of a point that represents 10 points is multiplied by 10 to account for it acting as 10 points instead of just 1.
My problem is: Given N points in a plane and a number R, list/enumerate all subsets of points, where points in each subset are enclosed by a circle with radius of R. Two subsets should be different and not covered each other.
Efficiency may not be important, but the algorithm should not be too slow.
In a special case, can we find K subsets with most points? Approximation algorithm can be accepted.
Thanks,
Edit: It seems that the statement is not clear to understand. My bad!
So I restate my question as follows: Given N points and a circle with fixed radius R, use the circle to scan whole the space. At a time, the circle will cover a subset of points. The goal is to list all the possible subset of points that can be covered by such an R-radius circle. One subset cannot be a superset of other subsets.
I am not sure I get what you mean by 'not covered'. If you drop this, what you are looking for is exactely a Cech complex whose complexity is high, you wont have efficient algorithm if you dont have condition on the sampling (sampling should be sparse enough and R not too big otherwise you could have 2^n subsets with n your number of points). You have to enumerate all subsets and check if their minimal enclosing ball radius is lower than R. You can reduce the search to all subsets whose diameter is lower than R (eg pairwise distance lower than R) which may be sufficient in your case.
If 'not covered' for two subsets mean that one is not included into the other, you can have many different decompositions. One of interest is the alpha-complex as it can be computed efficiently in O(nlogn) in dimension 2-3 (I will suggest to use CGAL to compute it, you can also see what it means with pictures). If your points are high dimensional, then you will probably end up computing a Cech complex.
Without loss of generality, we can assume that the enclosing circles considered pass through at least two points (ignoring the trivial cases of no points or one point and assuming that your motivation is maximizing density, so that you don't care if non-maximal subsets are omitted). Build a proximity structure (kd-tree, cover tree, etc.) on the input points. For each input point p, use the structure to find all points q such that d(p, q) ≤ 2R. For each point q, there are one or two circles that contain p and q on their boundary. Find their centers by solving some quadratic equations and then look among the other choices of q to determine the subset.
Let X be a collection of n points in some moderate-dimensional space - for now, say R^5. Let S be the convex hull of X, let p be a point in S, and let v be any direction. Finally, let L = {p + lambda v : lambda a real number} be the line passing through p in direction v.
I am interested in finding a reasonably efficient algorithm for computing the intersection of S with L. I'd also be interested in hearing if it is known that no such algorithm exists! Note that this intersection can be represented by the (extreme) two points of intersection of L with the boundary of S. I'm particularly interested in finding an algorithm that behaves well when n is large.
I should say that it is easy to do this very efficiently in two dimensions. In that case, one can order the points of X in 'clockwise order' as seen from p, and then do binary search. So, the initial ordering takes O(n log(n)) steps and then further lookups take O(log(n)) steps. I don't see what the analogous algorithm should be in higher dimensions. Part of the problem is that a convex body in two dimensions has n vertices and n faces, while a convex body in 3 or higher dimensions can have n vertices but many, many more than n faces.
You can write a simple linear program for this. You want to minimise/maximise lambda subject to the constraint that x + lambda v lies in the convex hull of your input points. "Lies in the convex hull of" is coordinatewise equality between two points, one of which is a nonnegative weighted average of your input points such that the weights sum to 1.
As a practical matter, it may be useful to start with a handful of randomly chosen points, get a convex combination or a certificate of infeasibility, then interpret the cerificate of infeasibility as a linear inequality and find the input point that most violates it. If you're using a practical solver, this means you want to formulate the dual, switch a bunch of presolve things off, and run the above essentially as a cutting plane method using certificates of unboundedness instead. It is likely that, unless you have pathological data, you will only need to tell the LP solver about a small handful of your input points.
To effectively find n nearest neighbors of a point in d-dimensional space, I selected the dimension with greatest scatter (i.e. in this coordinate differences between points are largest). The whole range from minimal to maximal value in this dimension was split into k bins. Each bin contains points which coordinates (in this dimensions) are within the range of that bin. It was ensured that there are at least 2n points in each bin.
The algorithm for finding n nearest neighbors of point x is following:
Identify bin kx,in which point x lies(its projection to be precise).
Compute distances between x and all the points in bin kx.
Sort computed distances in ascending order.
Select first n distances. Points to which these distances were measured are returned as n
nearest neighbors of x.
This algorithm is not working for all cases. When algorithm can fail to compute nearest neighbors?
Can anyone propose modification of the algorithm to ensure proper operation for all cases?
Where KNN failure:
If the data is a jumble of all different classes then knn will fail because it will try to find k nearest neighbours but all points are random
outliers points
Let's say you have two clusters of different classes. Then if you have a outlier point as query, knn will assign one of the classes even though the query point is far away from both clusters.
This is failing because (any of) the k nearest neighbors of x could be in a different bin than x.
What do you mean by "not working"? You do understand that, what you are doing is only an approximate method.
Try normalising the data and then choosing the dimension, else scatter makes no sense.
The best vector for discrimination or for clustering may not be one of the original dimensions, but any combination of dimensions.
Use PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis), to identify a discriminative dimension.
I have an algorithm problem here. It is different from the normal Fermat Point problem.
Given a set of n points in the plane, I need to find which one can minimize the sum of distances to the rest of n-1 points.
Is there any algorithm you know of run less than O(n^2)?
Thank you.
One solution is to assume median is close to the mean and for a subset of points close to the mean exhaustively calculate sum of distances. You can choose klog(n) points closest to the mean, where k is an arbitrarily chosen constant (complexity nlog(n)).
Another possible solution is Delaunay Triangulation. This triangulation is possible in O(nlogn) time. The triangulation results in a graph with one vertex for each point and edges to satisfy delauney triangulation.
Once you have the triangulation, you can start at any point and compare sum-of-distances of that point to its neighbors and keep moving iteratively. You can stop when the current point has the minimum sum-of-distance compared to its neighbors. Intuitively, this will halt at the global optimal point.
I think the underlying assumption here is that you have a dataset of points which you can easily bound, as many algorithms which would be "good enough" in practice may not be rigorous enough for theory and/or may not scale well for arbitrarily large solutions.
A very simple solution which is probably "good enough" is to sort the coordinates on the Y ordinate, then do a stable sort on the X ordinate.
Take the rectangle defined by the min(X,Y) and max(X,Y) values, complexity O(1) as the values will be at known locations in the sorted dataset.
Now, working from the center of your sorted dataset, find coordinate values as close as possible to {Xctr = Xmin + (Xmax - Xmin) / 2, Yctr = Ymin + (Ymax - Ymin) / 2} -- complexity O(N) bounded by your minimization criteria, distance being the familiar radius from {Xctr,Yctr}.
The worst case complexity would be comparing your centroid to every other point, but once you get away from the middle points you will not be improving the global optimal and should terminate the search.