I want to partition a connected graph into 2 sets of vertices, such that the difference of sum of edge-weights among vertices of each set is minimized.
For example, if a graph consists of vertices 1,2,3,4,5, consider this partition:
Set A - {1,2,3}
Set B - {4,5}
Sum A = {w(1 2) + w(2 3) + w(1 3)}
Sum B = {w(4 5)}
Diff = abs(Sum A - Sum B) ... (This is one possible partition difference.)
So, how do I find a partition such that the difference is minimized?
This problem is NP hard because it is at least as hard as the partition problem.
Sketch of proof
Consider a partition problem where we have the numbers {1,2,3,4,5} that we wish to partition into two sets with as small a difference as possible.
Construct the graph shown below:
If someone comes up with an algorithm to solve your problem you can use the algorithm to partition this graph into two sets such that the sum of weights within each set is minimized.
In the optimal solution the blue and green nodes must be placed into different sets (because we have an edge with weight infinity connecting them). The remaining nodes will be connected to either the blue or green nodes. Call the ones connected to blue set1, and the ones connected to green set2. This partition will give the optimal answer to the partition problem.
Greedy algorithm
However, depending on the structure of your graph and values of the weights you may well be able to do a reasonable job.
For example, you could try:
Choose a random permutation of vertices
Loop through each vertex and assign to set 1 or 2 according to whichever minimises the objective function (which is just evaluated over the vertices assigned so far)
Repeat this algorithm a few times and keep track of the best score.
When you get down to just a few vertices left to be assigned, you could also try a brute force evaluation of all possible partitions of the remaining vertices to search for a good solution.
The following algorithmic sketch is based on Iterated Local Search. The idea is to greedily optimize the current solution until a local optimal solution is found. Then disturb this solution to overcome the local optimal solution. Always keep track of the best solution found so far.
Randomly divide the set of vertice into V1 and V2
Iterate
Calculate the costs (edge-weight-difference) of your current division
Select two random vertices v1 from V1 and v2 from V2
Check whether swapping these vertices (move v1 to V2 and v2 to V1) would lead to lower costs (edge-weight-difference). If so, swap vertices v1 and v2, else keep the sets.
Disturb a converged solution by swapping half of the vertices in V1 with half of the vertices in V2. Goto 2.
Iterated Local Search is a surprisingly effective and practical heuristic -- even for NP-complete problems.
Related
I have a set of N points (in particular this point are binary string) and for each of them I have a discrete metric (the Hamming distance) such that given two points, i and j, Dij is the distance between the i-th and the j-th point.
I want to find a subset of k elements (with k < N of course) such that the distance between this k points is the maximum as possibile.
In other words what I want is to find a sort of "border points" that cover the maximum area in the space of the points.
If k = 2 the answer is trivial because I can try to search the two most distant element in the matrix of distances and these are the two points, but how I can generalize this question when k>2?
Any suggest? It's a NP-hard problem?
Thanks for the answer
One generalisation would be "find k points such that the minimum distance between any two of these k points is as large as possible".
Unfortunately, I think this is hard, because I think if you could do this efficiently you could find cliques efficiently. Suppose somebody gives you a matrix of distances and asks you to find a k-clique. Create another matrix with entries 1 where the original matrix had infinity, and entries 1000000 where the original matrix had any finite distance. Now a set of k points in the new matrix where the minimum distance between any two points in that set is 1000000 corresponds to a set of k points in the original matrix which were all connected to each other - a clique.
This construction does not take account of the fact that the points correspond to bit-vectors and the distance between them is the Hamming distance, but I think it can be extended to cope with this. To show that a program capable of solving the original problem can be used to find cliques I need to show that, given an adjacency matrix, I can construct a bit-vector for each point so that pairs of points connected in the graph, and so with 1 in the adjacency matrix, are at distance roughly A from each other, and pairs of points not connected in the graph are at distance B from each other, where A > B. Note that A could be quite close to B. In fact, the triangle inequality will force this to be the case. Once I have shown this, k points all at distance A from each other (and so with minimum distance A, and a sum of distances of k(k-1)A/2) will correspond to a clique, so a program finding such points will find cliques.
To do this I will use bit-vectors of length kn(n-1)/2, where k will grow with n, so the length of the bit-vectors could be as much as O(n^3). I can get away with this because this is still only polynomial in n. I will divide each bit-vector into n(n-1)/2 fields each of length k, where each field is responsible for representing the connection or lack of connection between two points. I claim that there is a set of bit-vectors of length k so that all of the distances between these k-long bit-vectors are roughly the same, except that two of them are closer together than the others. I also claim that there is a set of bit-vectors of length k so that all of the distances between them are roughly the same, except that two of them are further apart than the others. By choosing between these two different sets, and by allocating the nearer or further pair to the two points owning the current bit-field of the n(n-1)/2 fields within the bit-vector I can create a set of bit-vectors with the required pattern of distances.
I think these exist because I think there is a construction that creates such patterns with high probability. Create n random bit-vectors of length k. Any two such bit-vectors have an expected Hamming distance of k/2 with a variance of k/4 so a standard deviation of sqrt(k)/2. For large k we expect the different distances to be reasonably similar. To create within this set two points that are very close together, make one a copy of the other. To create two points that are very far apart, make one the not of the other (0s in one where the other has 1s and vice versa).
Given any two points their expected distance from each other will be (n(n-1)/2 - 1)k/2 + k (if they are supposed to be far apart) and (n(n-1)/2 -1)k/2 (if they are supposed to be close together) and I claim without proof that by making k large enough the expected difference will triumph over the random variability and I will get distances that are pretty much A and pretty much B as I require.
#mcdowella, I think that probably I don't explain very well my problem.
In my problem I have binary string and for each of them I can compute the distance to the other using the Hamming distance
In this way I have a distance matrix D that has a finite value in each element D(i,j).
I can see this distance matrix like a graph: infact, each row is a vertex in the graph and in the column I have the weight of the arc that connect the vertex Vi to the vertex Vj.
This graph, for the reason that I explain, is complete and it's a clique of itself.
For this reason, if i pick at random k vertex from the original graph I obtain a subgraph that is also complete.
From all the possible subgraph with order k I want to choose the best one.
What is the best one? Is a graph such that the distance between the vertex as much large but also much uniform as possible.
Suppose that I have two vertex v1 and v2 in my subgraph and that their distance is 25, and I have three other vertex v3, v4, v5, such that
d(v1, v3) = 24, d(v1, v4) = 7, d(v2, v3) = 5, d(v2, v4) = 22, d(v1, v5) = 14, d(v1, v5) = 14
With these distance I have that v3 is too far from v1 but is very near to v2, and the opposite situation for v4 that is too far from v2 but is near to v1.
Instead I prefer to add the vertex v5 to my subgraph because it is distant to the other two in a more uniform way.
I hope that now my problem is clear.
You think that your formulation is already correct?
I have claimed that the problem of finding k points such that the minimum distance between these points, or the sum of the distances between these points, is as large as possible is NP-complete, so there is no polynomial time exact answer. This suggests that we should look for some sort of heuristic solution, so here is one, based on an idea for clustering. I will describe it for maximising the total distance. I think it can be made to work for maximising the minimum distance as well, and perhaps for other goals.
Pick k arbitrary points and note down, for each point, the sum of the distances to the other points. For each other point in the data, look at the sum of the distances to the k chosen points and see if replacing any of the chosen points with that point would increase the sum. If so, replace whichever point increases the sum most and continue. Keep trying until none of the points can be used to increase the sum. This is only a local optimum, so repeat with another set of k arbitrary/random points in the hope of finding a better one until you get fed up.
This inherits from its clustering forebear the following property, which might at least be useful for testing: if the points can be divided into k classes such that the distance between any two points in the same class is always less than the distance between any two points in different classes then, when you have found k points where no local improvement is possible, these k points should all be from different classes (because if not, swapping out one of a pair of points from the same class would increase the sum of distances between them).
This problem is known as the MaxMin Diversity Problem (MMDP). It is known to be NP-hard. However, there are algorithms for giving good approximate solutions in reasonable time, such as this one.
I'm answering this question years after it was asked because I was looking for algorithms to solve the same problem, and had trouble even finding out what to call it.
I have the following complete weighted graph, where each weight represents the probability of a vertice belonging to the same category as the next. I know a priori the category for which some of the vertices belong to; how would I be able to classify every other vertice?
In a more detailed manner I can describe the problem as following; From all the vertices N and clusters C, we have a set where we know for sure the specific cluster which a node belongs: P(v_n|C_n)=1. From the graph given we also know for each node, the probability of every other belonging to the same cluster as it: P(v_n1∩C_n2). From this, how can we estimate the cluster for every other node?
Let w_i be a vector where w_i[j] is the probability of node j, being in the cluser, at iteration i.
We define w_i:
w_0[j] = 1 j is given node in the class
0 otherwise
w_{i}[j] = P(j | w_{i-1})
Where: P(j | w_{i-1}) is the probability j being in the cluster, assuming we know the probabilities for each other node k to be in it, as w_{i-1}[k].
We can calculate the above probability:
P(j | w_{i-1}) = 1- (1- w_{i-1}[0]*c(0,j))*(1- w_{i-1}[1]*c(1,j))*...*(1- w_{i-1}[n-1]*c(n-1,j))
in here:
w_{i-1} is the output of last iteration.
c(x,y) is the weight of edge (x,y)
c(x,x) = 1
Repeat until convergence, and in the converged vector (let it be w), the probability of j being in the cluster is w[j]
Explanation for the probability function:
In order for a node NOT to be in the set, it needs all the other nodes will "decide" not to share it.
So, the probability for that happening is:
(1- w_{i-1}[0]*c(0,j))*(1- w_{i-1}[1]*c(1,j))*...*(1- w_{i-1}[n-1]*c(n-1,j))
^ ^ ^
node 0 doesn't share node 1 doesn't share node n-1 doesn't share
In order to be in the class, at least one node need to "share", so the probability for that happening is the complemantory, which is the formula we derived for P(j | w_{i-1})
You should start from the definition of the result. How should you show the probabilities of belonging?
The result, IMHO, should be a set of categories and a table: rows for vertices and columns for categories, and in the cells there will be possibilities of belonging of that vertice to this category.
Your graph can set some probabilities of belonging only if you already have some start known probabilities. I.e, that table would be already partly filled.
While filling the table according to the start values and weights of edges we would surely come to the situation, when we are getting different probabilities in the cells, coming into it by different ways. One more point should be set: can we change the start values in the table or they are set hardly? The same question for the weights of the edges.
Now the task is partly defined, and the part is very, very small. You even don't know the number of categories!
After you set all these rules and numbers, all is quite trivial - use The Gauss Method of Lesser Squares. As for iterative way, be careful - you don't know beforehand if the solution is stable or if it exists. If not, the iteration won't converge, and the whole that piece of code you wrote for it is for nothing. And by Gauss method you are getting a set of linear equations, and the standard algorithms are written to solve it for all cases. And at the end you have not only the solution, but the possible mistake for every final value.
I'd like to solve a harder version of the minimum spanning tree problem.
There are N vertices. Also there are 2M edges numbered by 1, 2, .., 2M. The graph is connected, undirected, and weighted. I'd like to choose some edges to make the graph still connected and make the total cost as small as possible. There is one restriction: an edge numbered by 2k and an edge numbered by 2k-1 are tied, so both should be chosen or both should not be chosen. So, if I want to choose edge 3, I must choose edge 4 too.
So, what is the minimum total cost to make the graph connected?
My thoughts:
Let's call two edges 2k and 2k+1 a edge set.
Let's call an edge valid if it merges two different components.
Let's call an edge set good if both of the edges are valid.
First add exactly m edge sets which are good in increasing order of cost. Then iterate all the edge sets in increasing order of cost, and add the set if at least one edge is valid. m should be iterated from 0 to M.
Run an kruskal algorithm with some variation: The cost of an edge e varies.
If an edge set which contains e is good, the cost is: (the cost of the edge set) / 2.
Otherwise, the cost is: (the cost of the edge set).
I cannot prove whether kruskal algorithm is correct even if the cost changes.
Sorry for the poor English, but I'd like to solve this problem. Is it NP-hard or something, or is there a good solution? :D Thanks to you in advance!
As I speculated earlier, this problem is NP-hard. I'm not sure about inapproximability; there's a very simple 2-approximation (split each pair in half, retaining the whole cost for both halves, and run your favorite vanilla MST algorithm).
Given an algorithm for this problem, we can solve the NP-hard Hamilton cycle problem as follows.
Let G = (V, E) be the instance of Hamilton cycle. Clone all of the other vertices, denoting the clone of vi by vi'. We duplicate each edge e = {vi, vj} (making a multigraph; we can do this reduction with simple graphs at the cost of clarity), and, letting v0 be an arbitrary original vertex, we pair one copy with {v0, vi'} and the other with {v0, vj'}.
No MST can use fewer than n pairs, one to connect each cloned vertex to v0. The interesting thing is that the other halves of the pairs of a candidate with n pairs like this can be interpreted as an oriented subgraph of G where each vertex has out-degree 1 (use the index in the cloned bit as the tail). This graph connects the original vertices if and only if it's a Hamilton cycle on them.
There are various ways to apply integer programming. Here's a simple one and a more complicated one. First we formulate a binary variable x_i for each i that is 1 if edge pair 2i-1, 2i is chosen. The problem template looks like
minimize sum_i w_i x_i (drop the w_i if the problem is unweighted)
subject to
<connectivity>
for all i, x_i in {0, 1}.
Of course I have left out the interesting constraints :). One way to enforce connectivity is to solve this formulation with no constraints at first, then examine the solution. If it's connected, then great -- we're done. Otherwise, find a set of vertices S such that there are no edges between S and its complement, and add a constraint
sum_{i such that x_i connects S with its complement} x_i >= 1
and repeat.
Another way is to generate constraints like this inside of the solver working on the linear relaxation of the integer program. Usually MIP libraries have a feature that allows this. The fractional problem has fractional connectivity, however, which means finding min cuts to check feasibility. I would expect this approach to be faster, but I must apologize as I don't have the energy to describe it detail.
I'm not sure if it's the best solution, but my first approach would be a search using backtracking:
Of all edge pairs, mark those that could be removed without disconnecting the graph.
Remove one of these sets and find the optimal solution for the remaining graph.
Put the pair back and remove the next one instead, find the best solution for that.
This works, but is slow and unelegant. It might be possible to rescue this approach though with a few adjustments that avoid unnecessary branches.
Firstly, the edge pairs that could still be removed is a set that only shrinks when going deeper. So, in the next recursion, you only need to check for those in the previous set of possibly removable edge pairs. Also, since the order in which you remove the edge pairs doesn't matter, you shouldn't consider any edge pairs that were already considered before.
Then, checking if two nodes are connected is expensive. If you cache the alternative route for an edge, you can check relatively quick whether that route still exists. If it doesn't, you have to run the expensive check, because even though that one route ceased to exist, there might still be others.
Then, some more pruning of the tree: Your set of removable edge pairs gives a lower bound to the weight that the optimal solution has. Further, any existing solution gives an upper bound to the optimal solution. If a set of removable edges doesn't even have a chance to find a better solution than the best one you had before, you can stop there and backtrack.
Lastly, be greedy. Using a regular greedy algorithm will not give you an optimal solution, but it will quickly raise the bar for any solution, making pruning more effective. Therefore, attempt to remove the edge pairs in the order of their weight loss.
We have a directed weighted graph where an edge between two nodes can have more than one possible cost value (more precisely, at most 2 costs). I need to use a time-dependent variant of the Dijkstra's algorithm that can handle two possible ways of getting from one node to another, the cost between the nodes (edge cost) being dependant on the time at which we arrive at the source node and the type of edge we are about to use. When traversing from one node to the other only one of these edges is picked and its cost is added to the same total cost.
I currently model the two possible costs for an edge as two separate edges between the same nodes.
There is a similar problem I found here and it was suggested to augment the graph by duplicating the nodes. However, this does not allow returning to the original graph and implies the overhead of, well, duplicating all the nodes and possibly edges between them and original nodes.
Do you have any suggestions as to how to tackle this problem with as little overhead as possible? (The original graph is expected to be huge)
Thanks
Edit:
I provided more details about the problem in the first paragraph
You can safely ignore the largest of the two costs for algorithm purposes.
Assume there is a shortest path the uses the largest cost between two vertices, you can change it to use the smallest cost and the path will cost less, and that contradicts the assumption.
I think you can hack step 3 of Dijsktra's algorithm :
For the current node, consider all of its unvisited neighbors and calculate their tentative distances. Compare the newly calculated tentative distance to the current assigned value and assign the smaller one. For example, if the current node A is marked with a distance of 6, and the edge connecting it with a neighbor B has length 2, then the distance to B (through A) will be 6 + 2 = 8. If B was previously marked with a distance greater than 8 then change it to 8. Otherwise, keep the current value.
In your setup, you have two distances from A to B, depending on how late it is. You use the second one if your current distance to A is above your time treshold.
This step becomes :
if current distance to A above threshold :
current distance to B = min(current distance to B, current distance to A + d2(A, B))
else:
current distance to B = min(current distance to B, current distance to A + d1(A, B))
I have many points (latitudes and longitudes) on a plane (a city) and I want to find two clusters. Cluster 1 is points cluttered close together and Cluster 2 is everything else.
I know the definition of the problem is not exact. The only thing defined is that I need exactly 2 clusters. Out of N points, how many end up in cluster 1 or cluster 2 is not defined.
The main aim is to identify points which are very close to each other and separate them from the rest (which are more more evenly spread out)
The best I can think of is the following algorithm:
1. For each point, Calculate the sum of the square distances to all other points.
2. Run the k-means with k=2 on these square distances
The squaring (or maybe even higher order) of the distance should help by raising the dimensionality. However this algorithm will be biased towards points near the center of the city. It will struggle to find clusters at the edges of the city.
Any suggestions on how to avoid this problem? And any other suggestions to improve this algorithm
i'd suggest something along the following lines:
key concept
count number of neighbouring points at distance less than a given value.
semiformal description
count number nc(P) of neighbouring points at distance less than a given value d_cutoff for each point P.
cluster all points P_i with nc(P_i) greater than a threshold thres_count into cluster #1.
for each P_i in cluster #1 add its close neighbors, i.e. points Q with d(Q, P_i) < d_cutoff to the very same cluster #1.
set cluster #2 to the complement of cluster #1.
algorithmic angle
build an undirected graph G=(V, E) with your points being the vertex set V and an edge between every pair of points at a distance less than d_cutoff from each other.
delete all edges e=(v,w) from the graph where deg(v) < thres_count and deg(w) < thres_count.
G's isolated vertices form cluster #2, the complement is cluster #1.
heuristic on how to choose d_cutoff
build a minimum spanning tree (mst) of your point set. the frequency distribution of edge lengths should hint at suitable cutoff values. short pairwise distances will be incorporated into the mst first. thus there should be at least one pronounced gap in the ordered sequence of edge lengths for point sets with a natural clustering. so partition the set of mst edge lengths into a small number of adjacent intervals, ordering these intervals in the natural way. count how many actual distance values fall into each interval. consider the map between an interval's ordinal number and its count of distance values. large deltas between functions values for successive arguments would suggest to take the upper bound of distances in the lower interval as d_cutoff.
Since points in cluster 1 are close to each other, I think a density-based clustering algorithm may help. You may try the OPTICS algorithm, which is similar to DBSCAN but is aware of varying density and the number of clusters can be specified by the user.