I have a set of N points (in particular this point are binary string) and for each of them I have a discrete metric (the Hamming distance) such that given two points, i and j, Dij is the distance between the i-th and the j-th point.
I want to find a subset of k elements (with k < N of course) such that the distance between this k points is the maximum as possibile.
In other words what I want is to find a sort of "border points" that cover the maximum area in the space of the points.
If k = 2 the answer is trivial because I can try to search the two most distant element in the matrix of distances and these are the two points, but how I can generalize this question when k>2?
Any suggest? It's a NP-hard problem?
Thanks for the answer
One generalisation would be "find k points such that the minimum distance between any two of these k points is as large as possible".
Unfortunately, I think this is hard, because I think if you could do this efficiently you could find cliques efficiently. Suppose somebody gives you a matrix of distances and asks you to find a k-clique. Create another matrix with entries 1 where the original matrix had infinity, and entries 1000000 where the original matrix had any finite distance. Now a set of k points in the new matrix where the minimum distance between any two points in that set is 1000000 corresponds to a set of k points in the original matrix which were all connected to each other - a clique.
This construction does not take account of the fact that the points correspond to bit-vectors and the distance between them is the Hamming distance, but I think it can be extended to cope with this. To show that a program capable of solving the original problem can be used to find cliques I need to show that, given an adjacency matrix, I can construct a bit-vector for each point so that pairs of points connected in the graph, and so with 1 in the adjacency matrix, are at distance roughly A from each other, and pairs of points not connected in the graph are at distance B from each other, where A > B. Note that A could be quite close to B. In fact, the triangle inequality will force this to be the case. Once I have shown this, k points all at distance A from each other (and so with minimum distance A, and a sum of distances of k(k-1)A/2) will correspond to a clique, so a program finding such points will find cliques.
To do this I will use bit-vectors of length kn(n-1)/2, where k will grow with n, so the length of the bit-vectors could be as much as O(n^3). I can get away with this because this is still only polynomial in n. I will divide each bit-vector into n(n-1)/2 fields each of length k, where each field is responsible for representing the connection or lack of connection between two points. I claim that there is a set of bit-vectors of length k so that all of the distances between these k-long bit-vectors are roughly the same, except that two of them are closer together than the others. I also claim that there is a set of bit-vectors of length k so that all of the distances between them are roughly the same, except that two of them are further apart than the others. By choosing between these two different sets, and by allocating the nearer or further pair to the two points owning the current bit-field of the n(n-1)/2 fields within the bit-vector I can create a set of bit-vectors with the required pattern of distances.
I think these exist because I think there is a construction that creates such patterns with high probability. Create n random bit-vectors of length k. Any two such bit-vectors have an expected Hamming distance of k/2 with a variance of k/4 so a standard deviation of sqrt(k)/2. For large k we expect the different distances to be reasonably similar. To create within this set two points that are very close together, make one a copy of the other. To create two points that are very far apart, make one the not of the other (0s in one where the other has 1s and vice versa).
Given any two points their expected distance from each other will be (n(n-1)/2 - 1)k/2 + k (if they are supposed to be far apart) and (n(n-1)/2 -1)k/2 (if they are supposed to be close together) and I claim without proof that by making k large enough the expected difference will triumph over the random variability and I will get distances that are pretty much A and pretty much B as I require.
#mcdowella, I think that probably I don't explain very well my problem.
In my problem I have binary string and for each of them I can compute the distance to the other using the Hamming distance
In this way I have a distance matrix D that has a finite value in each element D(i,j).
I can see this distance matrix like a graph: infact, each row is a vertex in the graph and in the column I have the weight of the arc that connect the vertex Vi to the vertex Vj.
This graph, for the reason that I explain, is complete and it's a clique of itself.
For this reason, if i pick at random k vertex from the original graph I obtain a subgraph that is also complete.
From all the possible subgraph with order k I want to choose the best one.
What is the best one? Is a graph such that the distance between the vertex as much large but also much uniform as possible.
Suppose that I have two vertex v1 and v2 in my subgraph and that their distance is 25, and I have three other vertex v3, v4, v5, such that
d(v1, v3) = 24, d(v1, v4) = 7, d(v2, v3) = 5, d(v2, v4) = 22, d(v1, v5) = 14, d(v1, v5) = 14
With these distance I have that v3 is too far from v1 but is very near to v2, and the opposite situation for v4 that is too far from v2 but is near to v1.
Instead I prefer to add the vertex v5 to my subgraph because it is distant to the other two in a more uniform way.
I hope that now my problem is clear.
You think that your formulation is already correct?
I have claimed that the problem of finding k points such that the minimum distance between these points, or the sum of the distances between these points, is as large as possible is NP-complete, so there is no polynomial time exact answer. This suggests that we should look for some sort of heuristic solution, so here is one, based on an idea for clustering. I will describe it for maximising the total distance. I think it can be made to work for maximising the minimum distance as well, and perhaps for other goals.
Pick k arbitrary points and note down, for each point, the sum of the distances to the other points. For each other point in the data, look at the sum of the distances to the k chosen points and see if replacing any of the chosen points with that point would increase the sum. If so, replace whichever point increases the sum most and continue. Keep trying until none of the points can be used to increase the sum. This is only a local optimum, so repeat with another set of k arbitrary/random points in the hope of finding a better one until you get fed up.
This inherits from its clustering forebear the following property, which might at least be useful for testing: if the points can be divided into k classes such that the distance between any two points in the same class is always less than the distance between any two points in different classes then, when you have found k points where no local improvement is possible, these k points should all be from different classes (because if not, swapping out one of a pair of points from the same class would increase the sum of distances between them).
This problem is known as the MaxMin Diversity Problem (MMDP). It is known to be NP-hard. However, there are algorithms for giving good approximate solutions in reasonable time, such as this one.
I'm answering this question years after it was asked because I was looking for algorithms to solve the same problem, and had trouble even finding out what to call it.
Related
Supposing I have a fully connected graph of N nodes, and I know the weight between any two pairs of nodes. How do I select k nodes such that I maximize the minimum distance between any pair of nodes?
I mapped this problem as a more general case of the one I actually want to solve, which I've dubbed the cheating students problem (I don't know if it has an actual name).
Cheating Students problem:
Given an N.M matrix, how to select k cells with maximum distance between any pair of cells? You could assume the matrix is a classroom where k cheating students are giving a test. No pair of students should be close to each other, and thus we want to maximize the minimum distance between any pair.
Your generalized graph problem appears to be very closely related to the maximum independent set problem described in https://en.wikipedia.org/wiki/Independent_set_%28graph_theory%29, which is NP-complete. I can find a maximum independent set by running a binary chop to find the largest k for which an algorithm solving your graph problem returns a minimum distance greater than 1. Since finding a maximum independent set is hard, I think your generalized problem is hard.
I don't see an easy way to solve the matrix problem, either, but the related problem of packing circles as efficiently as possible on a 2-d surface of infinite size has been solved, and the answer is what is called a hexagonal packing (https://en.wikipedia.org/wiki/Circle_packing) which confusingly is based on a triangular tiling (https://en.wikipedia.org/wiki/Triangular_tiling - "The vertices of the triangular tiling are the centers of the densest possible circle packing").
So for finite matrices and numbers of students it is possible that arranging the students in widely separated rows, with the rows staggered so that each student is centered between the pair of students nearest them in the row in front of them and behind them, is not too far from optimal - or at least a good place from which to start some sort of hill-climbing attempt.
I've looked all over google and stack but haven't found an answer to this problem yet. I keep finding results relating to the simplex method or results for finding the smallest arbitrary simplex (i.e. the vertices are not constrained). Neither can I think of an analytical solution.
Given a set of N-dimensional points, M, and an arbitrary N-dimensional point, q, how do I find the smallest N-dimensional simplex, S, that contains q as an interior point if the vertices of S must be in M? I'm sure I could solve it with an optimization, but I'd like an analytical solution if possible. A deterministic algorithm would be ok, as well.
I was originally using a K nearest neighbors approach, but then I realized it's possible that the N+1 nearest neighbors to q won't necessarily create a simplex that contains q.
Thanks in advance for any assistance provided.
I think you can do this is O(N^2) using an iterative process very similar to K-NN, but perhaps there is a more efficient way. This should return the minimum simplex in terms of the number of vertices.
For each coordinate i in q, we can iterate through all elements of M, comparing the q_i and m_i. We will select the two points in M which give us the min positive difference and min negative difference. If we repeat this process for every coordinate, then we should have our min set S.
Am I understanding the problem correctly?
Problem :
Form a network, that is, all the bases should be reachable from every base.
One base is reachable from other base if there is a path of tunnels connecting bases.
Bases are suppose based on a 2-D plane having integer coordinates.
Cost of building tunnels between two bases are coordinates (x1,y1) and (x2,y2) is min{ |x1-x2|, |y1-y2| }.
What is the minimum cost such that a network is formed.
1 ≤ N ≤ 100000 // Number of bases
-10^9 ≤ xi,yi ≤ 10^9
Typical Kruskal's minimum spanning tree implementation.But u cannot store (10^5)^2 edges.
So how i should make my cost matrix , how to make a graph so i can apply Kruskal algorithm?
You should not store the whole graph as you don't actually need it. In fact in this case I think Prim's algorithm is more suitable in this case. You will not need all the edges at any single time, instead on each iteration you will update a min dist array of size N. Of course complexity will still be in the order of N**2 but at least memory will not be an issue. Also you can further use the specific way distance is computed to improve the complexity(using some ordered structure to store the points).
I believe the only edges that will ever be used (due to your cost function) will be from each base to at most 4 neighbours. The neighbours to use are the closest point with greater (or equal) x value, the closest point with smaller (or equal) x value, the closest point with greater (or equal) y value, the closest point with smaller (or equal) y value.
You can compute these neighbours efficiently by sorting the points according to each axis and then linking each point with the point ahead and behind it in sorted order.
It does not matter if there is more than one point at a particular value of coordinate.
There will therefore be only O(4n) edges for you to consider with Kruskal's algorithm.
I have a list of GPS points...but what am I asking for could also work for any X,Y coordinates.
In my use-case, I want to assign points in sets. Each point can belong in only one set and each set has a condition that distance between any of two points in the set is not greater than some constant...that means, all points of the set fit a circle of a specific diameter.
For a list of points, I want to find best (or at least some) arrangement in which there is minimal number of sets.
There will be sets with just a single point because other points around are already in different sets or simply because there are no points around (distance between them is greater than in the condition of the set)...what I want to avoid is inefficient set assignment where e.g. instead of finding ideal 2 sets, each having 30 points, I find 5 sets, one with 1 point, second with 40 points, etc...
All I'am capable of is a brute-force solution, compute all distances, build all posible set arrangements, sort them by number of sets and pick one with the least number of sets.
Is there a better approach?
The Problem here is NP-complete. What you try to solve is the max-clique problem combined with a set cover problem.
Your problem can be represented as a Graph G=(V,E), where the vertices are your coordinates and the edges the connections in distances. This graph can be made in O(n^2) time. Then you filter out all edges with a distance greater then your constant giving the graph G'.
With the the remaining graph G' you want to find all cliques (effectively solving max-clique). A clique is a fully connected set of vertices. Name this list of cliques S.
Now finding a minimal set of elements of S that cover all vertices V is the set cover problem.
Both the set cover problem and the max clique are NP complete. And therefore finding an optimal solution would take exponential time. You could look at approximation algorithms for these two problems.
I'm currently interested in generating random geometric graphs. For my particular problem, we randomly place node v in the unit square, and add an edge from v to node u if they have Euclidean distance <= D, where D=D(u,n) varies with u and the number of nodes n in the graph.
Important points:
It is costly to compute D, so I'd like to minimize the number of calls to this function.
The vast majority of the time, when v is added, edges uv will be added to only a small number of nodes u (usually 0 or 1).
Question: What is an efficient method for checking which vertices u are "close enough" to v?
The brute force algorithm is to compute and compare dist(v,u) and D(u,n) for all extant nodes u. This requires O(n2) calls to D.
I feel we should be able to do much better than this. Perhaps some kind of binning would work. We could divide the space up into bins, then for each vertex u, we store a list of bins where a newly placed vertex v could result in the edge uv. If v ends up placed outside of u's list of bins (which should happen most of the time), then it's too far away, and we don't need to compute D. This is somewhat of a off-the-top-of-my-head suggestion, and I don't know if it'd work well (e.g., there would be overhead in computing sufficiently close bins, which might be too costly), so I'm after feedback.
Based on your description of the problem, I would choose an R-tree as your data structure.
It allows for very fast searching by narrowing the set of vertices you need to run D against drastically. However, in the worst-case insertion, O(n) time is required. Thankfully, you're quite unlikely to hit the worst-case insertion with a typical data set.
I would probably just use a binning approach.
Say we cut the unit square in m x m subsquares (each having side length 1/m of course). Since you place your vertices uniformly at random (or so I assumed), every square will contain n / m^2 vertices on average.
Depending on A1, A2, m and n, you can probably determine the maximum radius you need to check. Say that's less than m. Then, after inserting v, you would need to check the square in which it landed, plus all adjacent squares. Anyway, this is a constant number of squares, so for every insertion you'll need to check O(n / m^2) other vertices on average.
I don't know the best value for m (as said, that depends on A1 and A2), but say it would be sqrt(n), then your entire algorithm could run in O(n) expected time.
EDIT
A small addition: you could keep track of vertices with many neighbors (so with high radius, which extends over multiple squares) and check them for every inserted vertex. There should only be few, so that's no problem.