Removing cycles in weighted directed graph - algorithm

This is a follow-up question on my other posts.
Algorithm for clustering with size constraints
I'm working on a clustering algorithm,
After some reclustering, now I have this set of points that none of them are in their optimal cluster but could not be reassigned individually, since it'll violate the constraint.
I'm trying to use a graph structure to solve the problem but came across a few issues in implementing.
I'm a beginner, please let me know if I'm wrong.
Per #Kittsil's answer
build a directed graph with clusters as nodes, such that an edge (A,B) exists if the global solution would be minimized by some point in A moving to B. Looking for cycles in this graph will allow you to find potential moves (where a move consists of moving every vertex in the cycle).
I revised the graph adding weight as the sum of number of points moving from A to B.
Here are some scenarios where I'm not sure how to decide on which point to reassign.
Scenario 1. For a cycle as below. There are two points can be moved from A to B, and three from B to C. Under this situation, which point should I select to reassign?
Scenario 2. For a cycle as below. Let all edge weights be 1. For the point in cluster A, it could be reassign to either cluster B or D. In this case. How should I do the reassignment?
Scenario 3 Similar to Scenario 2. Let all edge weights be 1. There are two small cycles in the bigger cycle. The point in cluster A can be reassigned to both B and E, how do I decide on the reassignment in this case?
Ideas about either scenario is welcomed!
Also any suggestions on implementing the algorithm considering the cases above? Even better with pseudocode. Thanks!

If the edge weights are proportional to the benefit gained by reclustering the points, then a decent heuristic is to pick the cycle with the highest total weight. I think this addresses all three of your cases: whenever you have a choice, pick the one that will do the most good.
Discussion:
The algorithm described in Algorithm for clustering with size constraints is a greedy algorithm to find a local minimum. Therefore, there is no guarantee that you will find the best solution no matter how you choose in these scenarios, and any choice will drive you closer to the local minimum.
Due to the relation to similar problems of sorting with constraints, it is my intuition that the minimum problem is going to be NP-hard. If that is not the case, then there exists a much better algorithm than the one I previously described. However, this greedy algorithm seems like a fast solution for the minimal problem.

Related

Create N-Clusters out of Min spanning tree?

Let say I created a Minimum Spanning Tree out of Graph with M nodes. Is there an algorithm to create N number of clusters.
I'm looking to cut some of the links such as that I end up with N clusters and label them i.e. given a node X I can query in which cluster it belongs.
What I think is once I have the MST, I cut the top/max M-N edges of the MST and I will get N clusters ?
Is my logic correct ?
That seems a good way to me. You ask whether it's "correct" -- that I can't say, since I don't know what other unstated criteria you have in mind. All you have actually stated that you want is to create N clusters -- which you could also achieve by throwing away the MST, putting vertex 1 in the first cluster, vertex 2 in the second, ..., vertex N-1 in the (N-1)th, and all remaining vertices in the Nth.
If you're using Kruskal's algorithm to build the MST, you can achieve what you're suggesting by simply stopping the algorithm early, as soon as only N components remain.
A tree is a (very sparse) subset of edges of a graph, if you cut based on them you are not taking into consideration a (possible) vast majority of edges in your graph.
Based on the fact that you want to use a M(inimum)ST algorithm to create clusters, it would seem you want to minimize the set of edges that lie in the n-way cut induced by your clustering. Using an MST as a proxy with a graph with very similar weight edges will produce likely terrible results.
Graph clustering is a heavily studied topic, have you considered using an existing library to accomplish this? If you insist on implementing your own algorithm, I would recommend spectral clustering as a starting point as it will produce decent results without much effort.
Edit based on feedback in coments:
If your main bottleneck is the similarity matrix then the following should be considered:
Investigate sparse matrix/graph representation while implementing something like spectral clustering which is probably going to give much more robust results than single-linkage clustering
Investigate pruning edges from the similarity matrix which you think are unimportant. If pruning is combined with a sparse representation of the similarity matrix, this should yield comparable performance to the MST approach while giving a smooth continuum to tune performance vs quality.

Rate node connectedness in a routing graph

I have a directed, weighted routing graph (ca. 10^5 edges, 4 edges per node, lots of circles).
Each edge has a cost associated with it. How can I rate the "connectedness" of each node? It should be a measure of how cheap it is to reach other nodes from this one.
How does everything change if every node gets a reliability factor (the probability that the chosen path containing that node will fail and a new one must be found)?
Thanks for your help
I believe that the problem you've put forth in many terms matches the use case of the PageRank algorithm.
I won't discuss how the algorithm works in general since there are a lot many blogs/videos available online which already explains it in great detail. One of my personal favourite short video on the same is this.
Now lets see how does the algorithm fits for your use case. Let's define connectedness of a node x as C(x). We can rephrase your given statement "how cheap it is to reach other nodes from this node" to "how likely are we to end up on the given node in a random walk across the graph such that we are biased to take edges whose costs are less".
The statement to a large extent relates to the ideology behind the PageRank algorithm. We just need to consider how to include the edge cost for our working.
The original PageRank algorithm uniformly divides the page rank of a given node to all it's adjacent node (denoted as PR(y) / OUT(y) in formula). We on the other hand needs to be more biased towards edges with lower cost for which I'll recommend modifying the formula to,
(SUM-EDGES-COST(y) - EDGE-COST(x, y)) * (C(y) / SUM-EDGES-COST(y))
instead of the traditional C(x) / OUT(x). We take the difference (SUM-EDGES-COST(y) - EDGE-COST(x, y)) since in our scenario lower edge cost means more connectedness. Another possibility is to apply softmax function to the edge cost for each node as a normalisation strategy.
As to answer the part about having a reliability factor, given by R(x) for a node x, we can just multiply it directly to C(x) in the formula.
To wrap things up,
should match your given scenario.
What I've presented here is just one possibility which I can think from the top of my mind and it's highly likely that it just might not work out. All I can hope is that it helps you out in some or the other way. Cheers! :)

Minimal spanning tree with K extra node

Assume we're given a graph on a 2D-plane with n nodes and edge between each pair of nodes, having a weight equal to a euclidean distance. The initial problem is to find MST of this graph and it's quite clear how to solve that using Prim's or Kruskal's algorithm.
Now let's say we have k extra nodes, which we can place in any integer point on our 2D-plane. The problem is to find locations for these nodes so as new graph has the smallest possible MST, if it is not necessary to use all of these extra nodes.
It is obviously impossible to find the exact solution (in poly-time), but the goal is to find the best approximate one (which can be found within 1 sec). Maybe you can come up with some hints of the most efficient way of going throw possible solutions, or provide with some articles, where the similar problem is covered.
It is very interesting problem which you are working on. You have many options to attack this problem. The best known heuristics in such situation are - Genetic Algorithms, Particle Swarm Optimization, Differential Evolution and many others of this kind.
What is nice for such kind of heuristics is that you can limit their execution to a certain amount of time (let say 1 second). If it was my task to do I would try first Genetic Algorithms.
You could try with a greedy algorithm, try the longest edges in the MST, potentially these could give the largest savings.
Select the longest edge, now get the potential edge from each vertex that are closed in angle to the chosen one, from each side.
from these select the best Steiner point.
Fix the MST ...
repeat until 1 sec is gone.
The challenge is what to do if one of the vertexes is itself a Steiner point.

Travelling by bus

If you have the full bus schedule for a country, how can you find the
furthest anyone can travel in one day without visiting the same stop twice?
I assume a bus schedule gives you the full list of leaving and arriving times for every bus stop.
A slow and naive method would be as follows.
You can of course make a graph from the bus schedule with multiple directed edges between bus stops. You could then do a depth first search remembering the arrival time of the edge you took to get to each node and only taking edges from that stop that leave after the one that you took to get there. If you go to a node you have been to before you would only carry on from there if the current time in your traversal is before the earliest time you had ever visited that node before. You could record the furthest you can get from each node and then you could check each node to find the furthest you can travel overall.
This seems very inefficient however and it really isn't a normal graph problem. The problem is that in a normal directed graph if you can get from A to B and from B to C then you can get from A to C. This isn't true here.
What is the fastest you can solve this problem?
I think your original algorithm is pretty good.
You can think of your approach as being a version of Dijkstra's algorithm, in attempting to find the shortest path to each node.
Note that it is best at this stage to weight edges in the graph in terms of time. The idea is to use your Dijkstra-like algorithm to compute all nodes reachable within 1 days worth of time, and then pick whichever of these nodes is furthest in space from the start point.
Implementations of Dijkstra can use a heap to retrieve the next node to explore in O(logn), and I think this would be a good enhancement to your approach as well. If you always choose the node that you can reach earliest, you never need to repeat the calculation for that node.
Overall the approach is:
For each starting point
Use a modified Dijkstra to compute all nodes reachable in 1 day
Find the furthest in space of all these nodes.
So for n starting points and e bus routes, the complexity is about O(n(n+e)log(n)) to get the optimal answer.
You should be able to get improved performance by using an appropriate heuristic in an A* search. The heuristic needs to underestimate the max distance possible from a point, so you could use the maximum speed of a bus multiplied by the remaining time.
Instead of making multiple edges for each departure from a location, you can make multiple nodes per location / time.
Create one node per location per departure time.
Create one node per location per arrival time.
Create edges to connect departures to arrivals.
Create edges to connect a given node to the node belonging to the same location at the nearest future time.
By doing this, any path you can traverse through the graph is "valid" (meaning a traveler would be able to achieve this by a combination of bus trips or choosing to sit at a location and wait for a future bus).
Sorry to say, but as described this problem has a pretty high complexity. Misread the problem originally and thought it was np-hard, but it is not. It does however have a pretty high complexity that I personally would not want to deal with. This algorithm is a pretty good approximation that give a considerable complexity savings that I personally think it worth it.
However, if all you want is an answer that is "pretty good" there are are lot of fairly efficient algorithms out there that will get close very quickly.
Personally I would suggest using a simple greedy algorithm here.
I've done this on a few (granted, small and contrived) examples and it's worked pretty well and has an nlog(n) efficiency.
Associate a velocity with each node, velocity being the fastest you can move away from a given node. In my examples this velocity was distance_travelled/(wait_time + travel_time). I used the maximum velocity of all trips leaving a node as the velocity score for that node.
From your node/time calculate the velocities of all neighboring nodes and travel to the "fastest" node.
This algorithm is pretty good for the complexity as it basically transforms the problem into a static search, but there are a couple potential pitfalls that could be adjusted for depending on your data set.
The biggest issue with this algorithm is the possibility of a really fast bus going into the middle of nowhere. You could get around that by adding a "popularity" term to the velocity calculation (make more popular stops effectively faster) but depending on your data set that could easily make things either better or worse.
The simplistic graph representation will not work. I. e. each city is a node and the edges represent time. That's because the "edge" is not always active -- it is only active at certain times of the day.
The second thing that comes to mind is Edward Tufte's Paris Train Schedule which is a different kind of graph. But that does not quite fit the problem either. With the train schedule, the stations have a sequential relationship between stations, but that's not the case in general with cities and bus schedules.
But Tufte motivates the following way to model it as a graph. You could write code only to construct the graph and use a standard graph library that includes the shortest path algorithm.
Each bus trip is an edge with weight = distance covered
Each (city, departure) and (city, arrival) is a node
All nodes for a given city are connected by zero-weight edges in a time-ordered sequence, ignoring whether it is an arrival or a departure. This subgraph will look like a chain.
(it is a directed graph)
Linear Time Solution: Note that the graph will be a directed, acyclic graph. Finding the longest path in such a graph is linear. "A longest path between two given vertices s and t in a weighted graph G is the same thing as a shortest path in a graph −G derived from G by changing every weight to its negation. Therefore, if shortest paths can be found in −G, then longest paths can also be found in G."
Hope this helps! If somebody can post a visualization of the graph, it would be nice. If I can do so myself, I will do 1 more edit.
Naive is the best you'll get -- http://en.wikipedia.org/wiki/Longest_path_problem
EDIT:
So the problem is two fold.
Create a list of graphs where its possible to travel from pointA to pointB. Possible is in terms of times available for busA to travel from pointA to pointB.
Find longest path from all the possible generated path above.
Another approach would be to reevaluate the graph upon each node traversal and find the longest path.
It still reduces to finding longest possible path, which is NP-Hard.

Best subsample in the Maxmin distance sense

I have a set of N points in a D-dimensional metric space. I want to select K of them in such a way that the smallest distance between any two points in the subset is the largest.
For instance, with N=4 and K=3 in 3D Euclidean space, the solution is the face of the tetrahedron having the longest short side.
Is there a classical way to achieve that ? Can it be solved exactly in polynomial time ?
I have googled as much as I could, but I have not figured out yet how to call this problem.
In my case N=50, K=10 and D=300 typically.
Clarification:
A brute force approach would be to try every combination of K points among the N and determine the closest pair in every subset. The solution is given by the subset that yields the longest pair.
Done the trivial way, an O(K^2) process, to be repeated N! / K!(N-K)! times.
Hum, 10^2 50! / 10! 40! = 1027227817000
I think you might find papers on unit disk graphs informative but discouraging. For instance, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.3113&rep=rep1&type=pdf states that the maximum independent set problem on unit disk graphs in NP-complete, even if the disk representation is known. A unit disk graph is the graph you get by placing points in the plane and forming links between every pair of points at most a unit distance apart.
So I think that if you could solve your problem in polynomial time you could run it on a unit disk graph for different values of K until you find a value at which the smallest distance between two chosen points was just over one, and I think this would be a maximum independent set on the unit disk graph, which would be solving an NP-complete problem in polynomial time.
(Just about to jump on a bicycle so this is a bit rushed, but searching for papers on unit disk graphs might at least turn up some useful search terms)
Here's an attempt to explain it piece by piece:
Here is another attempt to relate the two problems.
For maximum independent set see http://en.wikipedia.org/wiki/Maximum_independent_set#Finding_maximum_independent_sets. A decision problem version of this is "Are there K vertices in this graph such that no two are joined by an edge?" If you can solve this you can certainly find a maximum independent set by finding the largest K by asking this question for different K and then finding the K nodes by asking the question on versions of the graph with one or more nodes deleted.
I state without proof that finding the maximum independent set in a unit disk graph is NP-complete. Another reference for this is http://web.sau.edu/lilliskevinm/wirelessbib/ClarkColbournJohnson.pdf.
A decision version of your problem is "Do there exist K points with distance at least D between any two points?" Again, you can solve this in polynomial time iff you can solve your original problem in polynomial time - play around until you find the largest D that gives answer yes, and then delete points and see what happens.
A unit disk graph has an edge exactly when the distance between two points is 1 or less. So if you could solve the decision version of your original problem you could solve the decision version of the unit disk graph problem just by setting D = 1 and solving your problem.
So I think I have constructed a series of links showing that if you could solve your problem you could solve an NP-complete problem by turning it into your problem, which causes me to think that your problem is hard.

Resources