Calculate the maximum distance between vectors in an array - algorithm

Assume we have an array that holds n vectors. We want to calculate the maximum euclidean distance between those vectors.
The easiest (naive?) approach would be to iterate the array and for each vector calculate its distance with the all subsequent vectors and then find the maximum.
This algorithm, however, would grow (n-1)! with respect to the size of the array.
Is there any other more efficient approach to this problem?
Thanks.

Your computation of the naive algorithm's complexity is wonky, it should be O(n(n-1)/2), which reduces to O(n^2). Computing the distance between two vectors is O(k) where k is the number of elements in the vector; this still gives a complexity well below O(n!).

Complexity is O(N^2 * K) for brute force algorithm (K is number of elem in vector). But we can do better by knowing that in euclidean space for points A,B and C:
|AB| + |AC| >= |BC|
Algorithm should be something like this:
If max distance found so far is MAX and for a |AB| there is a point C, such that distance |AC| and |CB| already computed and MAX > |AC|+|CB|, then we can skip calculation for |AB|.
It is difficult to tell complexity of this algorithm, but my gut feeling tells me it is not far from O(N*log(N)*K)

This question has been here before, see How to find two most distant points?
And the answer is: is can be done in less than O(n^2) in Euclidean space. See also http://mukeshiiitm.wordpress.com/2008/05/27/find-the-farthest-pair-of-points/

So suppose you have a pair of points A and B. Consider the hypersphere that have A and B at the north and south pole respectively. Could any point C contained in the hypersphere be farther from A than B?
Further suppose we partition the pointset into sqrt(N) hyperboxes with sqrt(N) points each. For any pair of hyperboxes, we can calculate in k time the maximum distance possible between any two points of the infinite set of points contained within them - by simply calculating the distance between their furthest corners. If we already have a candidate better than this we can discard all pairs of points from those hyperboxes.

Related

Find a subset of k most distant point each other

I have a set of N points (in particular this point are binary string) and for each of them I have a discrete metric (the Hamming distance) such that given two points, i and j, Dij is the distance between the i-th and the j-th point.
I want to find a subset of k elements (with k < N of course) such that the distance between this k points is the maximum as possibile.
In other words what I want is to find a sort of "border points" that cover the maximum area in the space of the points.
If k = 2 the answer is trivial because I can try to search the two most distant element in the matrix of distances and these are the two points, but how I can generalize this question when k>2?
Any suggest? It's a NP-hard problem?
Thanks for the answer
One generalisation would be "find k points such that the minimum distance between any two of these k points is as large as possible".
Unfortunately, I think this is hard, because I think if you could do this efficiently you could find cliques efficiently. Suppose somebody gives you a matrix of distances and asks you to find a k-clique. Create another matrix with entries 1 where the original matrix had infinity, and entries 1000000 where the original matrix had any finite distance. Now a set of k points in the new matrix where the minimum distance between any two points in that set is 1000000 corresponds to a set of k points in the original matrix which were all connected to each other - a clique.
This construction does not take account of the fact that the points correspond to bit-vectors and the distance between them is the Hamming distance, but I think it can be extended to cope with this. To show that a program capable of solving the original problem can be used to find cliques I need to show that, given an adjacency matrix, I can construct a bit-vector for each point so that pairs of points connected in the graph, and so with 1 in the adjacency matrix, are at distance roughly A from each other, and pairs of points not connected in the graph are at distance B from each other, where A > B. Note that A could be quite close to B. In fact, the triangle inequality will force this to be the case. Once I have shown this, k points all at distance A from each other (and so with minimum distance A, and a sum of distances of k(k-1)A/2) will correspond to a clique, so a program finding such points will find cliques.
To do this I will use bit-vectors of length kn(n-1)/2, where k will grow with n, so the length of the bit-vectors could be as much as O(n^3). I can get away with this because this is still only polynomial in n. I will divide each bit-vector into n(n-1)/2 fields each of length k, where each field is responsible for representing the connection or lack of connection between two points. I claim that there is a set of bit-vectors of length k so that all of the distances between these k-long bit-vectors are roughly the same, except that two of them are closer together than the others. I also claim that there is a set of bit-vectors of length k so that all of the distances between them are roughly the same, except that two of them are further apart than the others. By choosing between these two different sets, and by allocating the nearer or further pair to the two points owning the current bit-field of the n(n-1)/2 fields within the bit-vector I can create a set of bit-vectors with the required pattern of distances.
I think these exist because I think there is a construction that creates such patterns with high probability. Create n random bit-vectors of length k. Any two such bit-vectors have an expected Hamming distance of k/2 with a variance of k/4 so a standard deviation of sqrt(k)/2. For large k we expect the different distances to be reasonably similar. To create within this set two points that are very close together, make one a copy of the other. To create two points that are very far apart, make one the not of the other (0s in one where the other has 1s and vice versa).
Given any two points their expected distance from each other will be (n(n-1)/2 - 1)k/2 + k (if they are supposed to be far apart) and (n(n-1)/2 -1)k/2 (if they are supposed to be close together) and I claim without proof that by making k large enough the expected difference will triumph over the random variability and I will get distances that are pretty much A and pretty much B as I require.
#mcdowella, I think that probably I don't explain very well my problem.
In my problem I have binary string and for each of them I can compute the distance to the other using the Hamming distance
In this way I have a distance matrix D that has a finite value in each element D(i,j).
I can see this distance matrix like a graph: infact, each row is a vertex in the graph and in the column I have the weight of the arc that connect the vertex Vi to the vertex Vj.
This graph, for the reason that I explain, is complete and it's a clique of itself.
For this reason, if i pick at random k vertex from the original graph I obtain a subgraph that is also complete.
From all the possible subgraph with order k I want to choose the best one.
What is the best one? Is a graph such that the distance between the vertex as much large but also much uniform as possible.
Suppose that I have two vertex v1 and v2 in my subgraph and that their distance is 25, and I have three other vertex v3, v4, v5, such that
d(v1, v3) = 24, d(v1, v4) = 7, d(v2, v3) = 5, d(v2, v4) = 22, d(v1, v5) = 14, d(v1, v5) = 14
With these distance I have that v3 is too far from v1 but is very near to v2, and the opposite situation for v4 that is too far from v2 but is near to v1.
Instead I prefer to add the vertex v5 to my subgraph because it is distant to the other two in a more uniform way.
I hope that now my problem is clear.
You think that your formulation is already correct?
I have claimed that the problem of finding k points such that the minimum distance between these points, or the sum of the distances between these points, is as large as possible is NP-complete, so there is no polynomial time exact answer. This suggests that we should look for some sort of heuristic solution, so here is one, based on an idea for clustering. I will describe it for maximising the total distance. I think it can be made to work for maximising the minimum distance as well, and perhaps for other goals.
Pick k arbitrary points and note down, for each point, the sum of the distances to the other points. For each other point in the data, look at the sum of the distances to the k chosen points and see if replacing any of the chosen points with that point would increase the sum. If so, replace whichever point increases the sum most and continue. Keep trying until none of the points can be used to increase the sum. This is only a local optimum, so repeat with another set of k arbitrary/random points in the hope of finding a better one until you get fed up.
This inherits from its clustering forebear the following property, which might at least be useful for testing: if the points can be divided into k classes such that the distance between any two points in the same class is always less than the distance between any two points in different classes then, when you have found k points where no local improvement is possible, these k points should all be from different classes (because if not, swapping out one of a pair of points from the same class would increase the sum of distances between them).
This problem is known as the MaxMin Diversity Problem (MMDP). It is known to be NP-hard. However, there are algorithms for giving good approximate solutions in reasonable time, such as this one.
I'm answering this question years after it was asked because I was looking for algorithms to solve the same problem, and had trouble even finding out what to call it.

Is it possible to find the closest point to all points in subquadratic time?

An algorithmic question.
Input:
A list of data points X
A metric function for data points dist(x,y) that takes O(1) time to evaluate and obeys the triangle inequality
Is there an algorithm that can return a vector of data points Y such that Y[i] is the closest point in X to X[i] in subquadratic time?
Obviously this is possible in O(n^2), because you could just directly check every point. I'm wondering if it might be possible to leverage the triangle inequality to improve on this. I would also be interested in approximate algorithms with provable bounds (i.e. something like Y[i] is no more than (1 + log(n)) times the distance from X[i] as the minimum).
There's no such algorithm. Consider a metric where all but one pair of points is at distance 1. That pair cannot be found without consulting its particular distance oracle entry, which requires Omega(n^2) queries in the worst case.
Cover trees can be used to solve the exact neighbors problem. The time bound depends on the so-called doubling dimension of the metric.

How to find the smallest N dimensional simplex from a set of points that contains a given point?

I've looked all over google and stack but haven't found an answer to this problem yet. I keep finding results relating to the simplex method or results for finding the smallest arbitrary simplex (i.e. the vertices are not constrained). Neither can I think of an analytical solution.
Given a set of N-dimensional points, M, and an arbitrary N-dimensional point, q, how do I find the smallest N-dimensional simplex, S, that contains q as an interior point if the vertices of S must be in M? I'm sure I could solve it with an optimization, but I'd like an analytical solution if possible. A deterministic algorithm would be ok, as well.
I was originally using a K nearest neighbors approach, but then I realized it's possible that the N+1 nearest neighbors to q won't necessarily create a simplex that contains q.
Thanks in advance for any assistance provided.
I think you can do this is O(N^2) using an iterative process very similar to K-NN, but perhaps there is a more efficient way. This should return the minimum simplex in terms of the number of vertices.
For each coordinate i in q, we can iterate through all elements of M, comparing the q_i and m_i. We will select the two points in M which give us the min positive difference and min negative difference. If we repeat this process for every coordinate, then we should have our min set S.
Am I understanding the problem correctly?

Algorithm for maximal hypervolume simplex

Given a set of points in D-dimensional space. What is the optimal algorithm to find maximal possible D-simplex, all the vertexes of which is in the set? Algebraically it means that we have to find a subset of D + 1 points such, that determinant of D * D matrix, constructed from rows as deltas of coordinates each of first D points and last D + 1-st point, have greatest possible value (absolute value) on the set.
I sure, that all D + 1 required points are vertexes of convex hull of given set of points, but I need the algorithm, which not used any convex hull algorithm, because simplex required for they, in turn, required for such algorithms as starting polytope.
If it is not possible to obtain the simplex in less than exponential time, then what is the algorithm, which gives adjustable ratio run-time/precision of approximation for approximate solving of the problem?
I can't think of an exact solution, but you could probably get a reasonable approximation with an iterative approach. Note than I'm assuming that N is larger than D+1 here; if not then I have misunderstood the problem.
First, use a greedy algorithm to construct an initial simplex; choose the first two vertices to be the two most distant points, the next one to maximise your size measure in two dimensions, the next to maximise it in three, and so on. This has polynomial complexity in N and D.
One you have the initial simplex you can switch to iterative improvement. For example, for a given vertex in the simplex you can iterate through the points not in it measuring the change in the size measure that would result if you swapped them. At the end you swap it with the one, if any, that gave the greatest increase. Doing this once for each vertex in the simplex is again polynomial in N and D.
To trade-off betwen run-time cost and how large the resulting simplex is, simply choose how many times you're willing to do this.
Now this is a relatively crude local optimisation algorithm so cannot guarantee that it will find the maximal simplex. However, such approaches have been found to result in reasonably good approximations to the solution of problems like the travelling salesman problem, in the sense that whilst they're not optimal, they result in a distance that isn't too much greater than that of the actual solution in most cases.
Quickhull does not require to find a maximal simplex, this is overkill (too hard a problem, and will not guarantee that the next steps will be quicker).
I suggest you to select D+1 independent directions and take the farthest point in every direction. This will give you a good starting simplex in time O(N.D²). (The D² is because there are D+1 directions and evaluation of the distance in a direction takes D operations.)
Beware anyway that it can be degenerate (several vertexes being identical).
My own approximation of the solution is to take one point, compute furhtest from it and reject first point (totally N=1 point selected), then select else D - 1 points in such manner, that non-oriented N - 1-dimensional hypervolume (formula for S) of each N-points selection is maximal. Finally I find N = D + 1'st point it the way, that oriented D dimensional hypervolume (formula for V) of defined simplex is maximal by absolute value. Total complexity on my mind is something about O(D * N * D^3) (1...D + 1 vertices of simplex, N...N - D - 1 remaining points and D^3 is upper estimate of D * M, M in {1,2,...,D} matrix multiplication complexity). The approach allows us to find the right amount of linearly independent points, or else to find a dimension of the subspace and non-normalized and non-orthogonal basis of the subspace. For large amount of points and large dimensionalities the complexity of proposed algorithm does not predominate over the complexity of, say, quickhull algorithm.
The implementation's repository.

Finding maximum weight sequence of points in positive quadrant

Given a sequence of weighted points in the positive quadrant we have to find
the maximum weight sequence of points so that each successive point is contained in the
rectangle formed by the previous point and the origin.
I am interested in a DP algorithm for this problem.
This problem is really asking for the longest increasing subsequence. An O(N log N) algorithm for solving this is described on the wikipedia page.
Easier O(N²) algorithm
I am assuming you have integer points. If you don't, you can use coordinate compression to place your points in an N x N grid.
So you have an two-dimensional number array W where each number is the weight assigned to that coordinate. You now have a recurrence:
// T(w,h) = "Maximum weight of the point sequence in sub-grid (w,h)"
T(0,0) = W(0,0)
T(0,y) = W(0,y)+T(0,y-1)
T(x,0) = W(x,0)+T(x-1,0)
T(x,y) = W(x,y)+max(T(x-1,y),T(x,y-1))
You can either memoize the recurrence T (O(N²) space) or compute it one row at a time (O(N) space). Both algorithms will use O(N²) time.
You can try computing this recurrence using pen and paper to see how it works.

Resources