Find the closest set of points to another set - algorithm

I have two sets, A and B, with N and M points in R^n respectively. I know that N < M always.
The distance between two points, P and Q, is denoted by d( P,Q ). As the problem is generic, this distance could be any function (e.g. Euclidean distance).
I want to find the closest subset of B to A. Mathematically I would say, I want to find the subset C of B with size N with the minimal global distance to A. The global distance between A and C is given by
D(A,C) = min([sum(d(P_i,Q_i),i=1,N) with P_i in A and Q_i in C* for C* in Permutations of C])
I've been thinking about this problem and I did an algorithm that get a local optimum, but not necessarily the optimal:
Step 1) Find the nearest point of each point of A in B. If there are no points repeated, I found the optimal subset and finish the algorithm. However, if there are points repeated, go to step 2.
Step 2) Compare their distances (of course I compare the distance between points with the same closest point). The point with the minimal distance keeps the point previously found and the others change their desired point for the "next" closest point that have not been selected for another point yet.
Step 3) Check if all the points are different. If they are, finish. If not, go back to step 2.
Any idea? Trying all the combinations is not a good one (I should calculate M!/(M-N)! global distances)

If M = N, this problem could be formulated as minimum-weight perfect matching in a bipartite graph or, in other words, an assignment problem. A well-known method for solving an assignment problem is the Hungarian algorithm.
To make Hungarian algorithm applicable in the case N < M, you could extend set A with (M-N) additional elements (each having zero distance to all elements of B).

Related

Find a subset of k most distant point each other

I have a set of N points (in particular this point are binary string) and for each of them I have a discrete metric (the Hamming distance) such that given two points, i and j, Dij is the distance between the i-th and the j-th point.
I want to find a subset of k elements (with k < N of course) such that the distance between this k points is the maximum as possibile.
In other words what I want is to find a sort of "border points" that cover the maximum area in the space of the points.
If k = 2 the answer is trivial because I can try to search the two most distant element in the matrix of distances and these are the two points, but how I can generalize this question when k>2?
Any suggest? It's a NP-hard problem?
Thanks for the answer
One generalisation would be "find k points such that the minimum distance between any two of these k points is as large as possible".
Unfortunately, I think this is hard, because I think if you could do this efficiently you could find cliques efficiently. Suppose somebody gives you a matrix of distances and asks you to find a k-clique. Create another matrix with entries 1 where the original matrix had infinity, and entries 1000000 where the original matrix had any finite distance. Now a set of k points in the new matrix where the minimum distance between any two points in that set is 1000000 corresponds to a set of k points in the original matrix which were all connected to each other - a clique.
This construction does not take account of the fact that the points correspond to bit-vectors and the distance between them is the Hamming distance, but I think it can be extended to cope with this. To show that a program capable of solving the original problem can be used to find cliques I need to show that, given an adjacency matrix, I can construct a bit-vector for each point so that pairs of points connected in the graph, and so with 1 in the adjacency matrix, are at distance roughly A from each other, and pairs of points not connected in the graph are at distance B from each other, where A > B. Note that A could be quite close to B. In fact, the triangle inequality will force this to be the case. Once I have shown this, k points all at distance A from each other (and so with minimum distance A, and a sum of distances of k(k-1)A/2) will correspond to a clique, so a program finding such points will find cliques.
To do this I will use bit-vectors of length kn(n-1)/2, where k will grow with n, so the length of the bit-vectors could be as much as O(n^3). I can get away with this because this is still only polynomial in n. I will divide each bit-vector into n(n-1)/2 fields each of length k, where each field is responsible for representing the connection or lack of connection between two points. I claim that there is a set of bit-vectors of length k so that all of the distances between these k-long bit-vectors are roughly the same, except that two of them are closer together than the others. I also claim that there is a set of bit-vectors of length k so that all of the distances between them are roughly the same, except that two of them are further apart than the others. By choosing between these two different sets, and by allocating the nearer or further pair to the two points owning the current bit-field of the n(n-1)/2 fields within the bit-vector I can create a set of bit-vectors with the required pattern of distances.
I think these exist because I think there is a construction that creates such patterns with high probability. Create n random bit-vectors of length k. Any two such bit-vectors have an expected Hamming distance of k/2 with a variance of k/4 so a standard deviation of sqrt(k)/2. For large k we expect the different distances to be reasonably similar. To create within this set two points that are very close together, make one a copy of the other. To create two points that are very far apart, make one the not of the other (0s in one where the other has 1s and vice versa).
Given any two points their expected distance from each other will be (n(n-1)/2 - 1)k/2 + k (if they are supposed to be far apart) and (n(n-1)/2 -1)k/2 (if they are supposed to be close together) and I claim without proof that by making k large enough the expected difference will triumph over the random variability and I will get distances that are pretty much A and pretty much B as I require.
#mcdowella, I think that probably I don't explain very well my problem.
In my problem I have binary string and for each of them I can compute the distance to the other using the Hamming distance
In this way I have a distance matrix D that has a finite value in each element D(i,j).
I can see this distance matrix like a graph: infact, each row is a vertex in the graph and in the column I have the weight of the arc that connect the vertex Vi to the vertex Vj.
This graph, for the reason that I explain, is complete and it's a clique of itself.
For this reason, if i pick at random k vertex from the original graph I obtain a subgraph that is also complete.
From all the possible subgraph with order k I want to choose the best one.
What is the best one? Is a graph such that the distance between the vertex as much large but also much uniform as possible.
Suppose that I have two vertex v1 and v2 in my subgraph and that their distance is 25, and I have three other vertex v3, v4, v5, such that
d(v1, v3) = 24, d(v1, v4) = 7, d(v2, v3) = 5, d(v2, v4) = 22, d(v1, v5) = 14, d(v1, v5) = 14
With these distance I have that v3 is too far from v1 but is very near to v2, and the opposite situation for v4 that is too far from v2 but is near to v1.
Instead I prefer to add the vertex v5 to my subgraph because it is distant to the other two in a more uniform way.
I hope that now my problem is clear.
You think that your formulation is already correct?
I have claimed that the problem of finding k points such that the minimum distance between these points, or the sum of the distances between these points, is as large as possible is NP-complete, so there is no polynomial time exact answer. This suggests that we should look for some sort of heuristic solution, so here is one, based on an idea for clustering. I will describe it for maximising the total distance. I think it can be made to work for maximising the minimum distance as well, and perhaps for other goals.
Pick k arbitrary points and note down, for each point, the sum of the distances to the other points. For each other point in the data, look at the sum of the distances to the k chosen points and see if replacing any of the chosen points with that point would increase the sum. If so, replace whichever point increases the sum most and continue. Keep trying until none of the points can be used to increase the sum. This is only a local optimum, so repeat with another set of k arbitrary/random points in the hope of finding a better one until you get fed up.
This inherits from its clustering forebear the following property, which might at least be useful for testing: if the points can be divided into k classes such that the distance between any two points in the same class is always less than the distance between any two points in different classes then, when you have found k points where no local improvement is possible, these k points should all be from different classes (because if not, swapping out one of a pair of points from the same class would increase the sum of distances between them).
This problem is known as the MaxMin Diversity Problem (MMDP). It is known to be NP-hard. However, there are algorithms for giving good approximate solutions in reasonable time, such as this one.
I'm answering this question years after it was asked because I was looking for algorithms to solve the same problem, and had trouble even finding out what to call it.

Algorithm that finds the connectivity distance of a graph on uniform points on the unit square

Situation
Suppose we are given n points on the unit square [0, 1]x[0, 1] and a positive real number r. We define the graph G(point 1, point 2, ..., point n, r) as the graph on vertices {1, 2, ..., n} such that there is an edge connecting two given vertices if and only if the distance between the corresponding points is less than or equal to r. (You can think of the points as transmitters, which can communicate with each other as long as they are within range r.)
Given n points on the unit square [0, 1]x[0, 1], we define the connectivity distance as the smallest possible r for which G(point 1, point 2, ..., point n, r) is connected.
Problem 1) find an algorithm that determines if G(point 1, point 2, ..., point n, r) is connected
Problem 2) find an algorithm that finds the connectivity distance for any n given points
My partial solution
I have an algorithm (Algorithm 1) in mind for problem 1. I haven't implemented it yet, but I'm convinced it works. (Roughly, the idea is to start from vertex 1, and try to reach all other vertices through the edges. I think it would be somewhat similar to this.)
All that remains is problem 2. I also have an algorithm in mind for this one. However, I think it is not efficient time wise. I'll try to explain how it works:
You must first convince yourself that the connectivity distance rmin is necessarily the distance between two of the given points, say p and q. Hence, there are at most *n**(n-1)/2 possible values for rmin.
So, first, my algorithm would measure all *n**(n-1)/2 distances and store them (in an array in C, for instance) in increasing order. Then it would use Algorithm 1 to test each stored value (in increasing order) to see if the graph is connected with such range. The first value that does the job is the answer, rmin.
My question is: is there a better (time wise) algorithm for problem 2?
Remarks: the points will be randomly generated (something like 10000 of them), so that's the type of thing the algorithm is supposed to solve "quickly". Furthermore, I'll implement this in C. (If that makes any difference.)
Here is an algorithm which requires O(n2) time and O(n) space.
It's based on the observation that if you partition the points into two sets, then the connectivity distance cannot be less than the distance of the closest pair of points one from each set in the partition. In other words, if we build up the connected graph by always adding the closest point, then the largest distance we add will be the connectivity distance.
Create two sets, A and B. Put a random point into A and all the remaining points into B.
Initialize r (the connectivity distance) to 0.
Initialize a map M with the distance to every point in B of the point in A.
While there are still points in B:
Select the point b in B whose distance M[b] is the smallest.
If M[b] is greater than r, set r to M[b]
Remove b from B and add it to A.
For each point p in M:
If p is b, remove it from M.
Otherwise, if the distance from b to p is less than M[p], set M[p] to that distance.
When all the points are in A, r will be the connectivity distance.
Each iteration of the while loop takes O(|B|) time, first to find the minimum value in M (whose size is equal to the size of B); second, to update the values in M. Since a point is moved from B to A in each iteration, there will be exactly n iterations, and thus the total execution time is O(n2).
The algorithm presented above is an improvement to a previous algorithm, which used an (unspecified) solution to the bichromatic closest pair (BCP) problem to recompute the closest neighbour to A in every cycle. Since there is an O(n log n) solution to BCP, this implied a solution to the original problem in O(n2 log n). However, maintaining and updating the list of closest points is actually much simpler, and only requires O(n). Thanks to #LajosArpad for a question which triggered this line of thought.
I think your ideas are reasonably good, however, I have an improvement for you.
In fact you build up an array based on measurement and you sort your array. Very nice. At least with not too many points.
The number of n(n-1)/2 is a logical consequence of your pairing requirement. So, for 10000 elements, you will have 49995000 elements. You will need to increase significantly the speed! Also, this number of elements would eat a lot of your memory storage.
How can you achieve greater speed?
First of all, don't build arrays. You already have an array. Secondly, you can easily solve your problem by traversing. Let's suppose you have a function, which determines whether a given distance is enough to connect all the nodes, lets call this function "valid". It is not enough, because you need to find the minimal possible value. So, if you don't have more information about the nodes prior the execution of the algorithm, then my suggestion is this solution:
lowerBound <- 0
upperBound <- infinite
i <- 0
while i < numberOfElements do
j <- i + 1
while j < numberOfElements do
distance <- d(elements[i], elements[j])
if distance < upperBound and distance > lowerBound then
if valid(distance) then
upperBound <- distance
else
lowerBound <- distance
end if
end if
j <- j + 1
end while
i <- i + 1
end while
After traversing all the elements the value of upperBound will hold the smallest distance which still connects the network. You didn't store all the distances, as they were far too many and you have solved your problem in a single cycle. I hope you find my answer helpful.
If some distance makes graph connected, any larger distance would make it connected too. To find minimal connecting distance just sort all distances and use binary search.
Time complexity is O(n^2*log n), space complexity is O(n^2).
You can start with some small distance d then check for connectivity. If the Graph is connected, you're done, if not, increment d by a small distance then check again for connectivity.
You also need a clever algorithm to avoid O(N^2) in case N is big.

Find sub-array of objects with maximum distance between elements

Let be an array of objects [a, b, c, d, ...] and a function distance(x, y) that gives a numeric value showing how 'different' are objects x and y.
I'm looking for an algorithm to find the subset of the array of length n that maximizes the minimum difference between that subset element.
Of course, I can't simply sort the array by the minimum of the differences with other elements and take the n highest entries, since removing an element can very well change the distances. For instance, if a=b, then removing a means the minimum distance of b with another element will change dramatically.
So far, the only solution I could find was wether to iteratively remove the element with the lowest minimum distance and re-calculate the distance at each iteration, until only n elements are left, or, vice-versa, to iteratively pick new elements, recalculate the distances, add the new pick or replace an existing one based on the distance minimums.
Does anybody know how I could get the same results without those iterations?
PS: here is an example, the matrix shows the 'distance' between each element...
a b c d
a - 1 3 2
b 1 - 4 2
c 3 4 - 5
d 2 2 5 -
If we'd keep only 2 elements, that would be c and d; if we'd keep 3, that would be a or b, c and d.
This problem is NP-hard, so no-one knows an efficient (polynomial time) algorithm for solving it.
Here's a quick sketch that it is NP-hard, by reduction from CLIQUE.
Suppose we have an instance of CLIQUE in the form of a graph G and a number n and we want to know whether there is a clique of size n in G. Construct a distance matrix d such that d(i, j) = 1 if vertices i and j are connected in G, or 0 if they are not. Then find a subset of the vertices of G of size n that maximizes the minimum distance between elements (your problem). If the minimum distance between vertices in this subset is 1, then G has a clique of size n; otherwise it does not.
As Gareth said this is an NP-hard problem, however there has been a lot of research into solving these kind of problems and as such better methods than brute force have been found. Unfortunately this is such a large area that you could spend forever looking at the possible implementations of a solutions.
However if you are interested in a heuristic way of solving this I would suggest looking into Ant Colony Optimization (ACO) which has proven fairly effective at finding optimum paths within graphs.

Finding the distance between two sets in Manhattan distance

I'm learning for the test from algorithms and I spotted problem that I cannot deal with for a few days. So I'm writing here for help.
For a given two disjoint sets on plane:
G={(x_1^G, y_1^G), (x_2^G, y_2^G), ..., (x_n^G, y_n^G)}
D={(x_1^D, y_1^D), (x_2^D, y_2^D), ..., (x_n^D, y_n^D)}
Where for every 1 <= i, j <= n we have y_i^D < y_j^G, so G is above D.
Find an effective algorithm that
counts the distance between them
defined as:
d(G,D) = min{ d(a,b): a \in G and b\in D },
where d(a,b) = |x_a - x_b| + |y_a - y_b|
O(n^2) is trivial, so it is not the answer.
I hope the solution isn't too hard since it is from materials to review before the test. Can anybody help?
I think it will appear that this is a special case of some common problem. But if it is a special case, maybe the solution can be easier?
There are a few different ways to do this in O(n log n) time.
One: Compute the manhattan distance Voronoi diagram of the G points and build a point location data structure based on that. This takes O(n log n) time. For each D point, find the closest G point using the point location data structure. This takes O(log n) time per D point. Take the min of the distances between the pairs you just found and that's your answer.
Two: You can adapt Fortune's algorithm to this problem; just keep separate binary trees for D and G points. Kind of annoying to describe.
The next idea computes the distance of the closest pair for the infinity-norm, which is max(|x1-x2|, |y1-y2|). You can tilt your problem 45 degrees (substituting u = x-y, v = x+y) to get it into the appropriate form.
Three (variant of two): Sort all of the points by y coordinate. Maintain d, the distance between the closest pair seen so far. We'll sweep a line from top to bottom, maintaining two binary search trees, one of G points and one of D points. When a point is d or farther above the sweep line, we remove it from its binary search tree. When a point is first encountered by the sweep line, say a D point, we (1) check the G binary search tree to see if it has any elements whose x-coordinate is within d of the new point's, updating d as necessary, and (2) insert the new point into D's binary search tree. Each point only causes a constant number of binary search tree operations plus a constant amount of additional work, so the sweep is O(n log n). The sort is too, unsurprisingly, so our overall time complexity is as desired.
You can probably make a divide-and-conquer strategy work too based on similar ideas to three.

closest pair algorithm

I am trying to understand the closest pair algorithm. I understand about dividing the set in half. But I am having trouble understanding how to recursively compute the closest pair. I understand recursion, but do not understand how to compute the closest pair by recursion. If you have (1,2)(1,11)(7,8) how would recursion work on these?
the basic idea of the algorithm is this.
You have a set of points P and you want to find the two points in P that have the shortest distance between them.
A simple brute-force approach would go through every pair in P, calculate the distance, and then take the one pair that has the shortest distance. This is an O(n²) algorithm.
However it is possible to better by the algorithm you are talking about. The idea is first to order all the points according to one of the coordinates, e.g. the x-coordinate. Now your set P is actually a sorted list of points, sorted by their x-coordinates. The algorithm takes now as its input not a set of points, but a sorted list of points. Let's call the algorithm ClosestPair(L), where L is the list of points given as the argument.
ClosestPair(L) is now implemented recursively as follows:
Split the list L at its middle, obtaining Lleft and Lright.
Recursively solve ClosestPair(Lleft) and ClosestPair(Lright).
Let the corresponding shortest distances obtained by δleft and δright.
Now we know that the shortest distance in the original set (represented by L) is either one of the two δs, or then it is a distance between a point in Lleft and a point in Lright.
Se we need still to check if there is a shorter distance between two points from the left and right subdivision. The trick is that because we know the distance must be smaller than δleft and δright, it is enough to consider from both subdivisions points that are not farther than min(δleft, δright) from the dividing line (the x-coordinate you used to split the original list L). This optimization makes the procedure faster than the brute-force approach, in practice O(n log n).
If you mean this algorithm you do the following:
Sort points: (1,2) (1,11) (7,8)
Build two subsets: (1,2) (1,11) and (7,8)
Run the algorithm on (1,2) (1,11) and on (7,8) separately <= this is where the recursion comes. The result is dLmin = 9 and dRmin = infinity (there is no second point on the right)
dLRmin = sqrt(45)
result = min(dLmin, dRmin, dLRmin) = sqrt(45)
The recursion consists of the same steps as above. E.g. the call with (1,2) and (1,11) does:
Sort points: (1,2) (1,11)
Build two subsets: (1,2) and (1,11)
Run the algorithm on (1,2) and on (1,11) separately <= again recursion calls. The result is dLmin = infinity and dRmin = infinity
dLRmin = 9
result = min(dLmin, dRmin, dLRmin) = 9
I think I know what algorithm you're talking about. I could recount it here myself, but the description given in Introduction to Algorithms is by far superior to what I can produce. And that chapter is also available on google books: enjoy. (Everybody else can find problem description there too)
Maybe Linear-time Randomized Closest Pair algorithm will help. There you can find the pair in expected time O(n).

Resources