Finding a subgraph of max weight

Finding a subgraph of max weight - algorithm

I have a city area (let's think of it as a graph of streets), where all streets have some weight and length associated with them. What I want to do is find a connected set of streets, located near other, with some max (or close to max) total weight W, given that my max subgraph can only contain up to N streets.
I'm specifically not interested in a subgraph that would span the entire graph, but rather only a small cluster of streets that has max or close to max combined weight and where all streets are located "near" each other, where "near" would be defined as no street being more than X meters away from the center of the cluster. Resulting subgraph would have to be connected.
Does anyone know if the name for this algorithm assuming it exists?
Also interested in any solutions, exact or approximations.
To show this visually, assume my graph is all the street segments (intersection to intersection) in the image below. So individual street is not Avenue A, it's Avenue A between 10th and 11th, and so on. Street will either have weight of 1 or 0. Assume that the set of streets with max weights are in the selected polygon - what I want to do is find this polygon.

Here's a proposal. Consider each vertex in the graph of nodes as the "center" as you've defined it. For each center C[i], execute Dijkstra's algorithm to construct a shortest path tree with C[i] as the origin. Stop constructing the tree when it would include a vertex more than the max allowed from the center.
Then let A[i] be the set of all edges incident to vertices in the tree centered on V[i]. The result will be the set A[i] with maximum weight.
The run time of one execution of Dijkstra's algorithm is O(|E[i]| + |V[i]| log |V[i]|) for the ith center. The sets here are limited in size by the max distance from center. Total cost is sum_(i in 1..|V|) O(|E[i]| + |V[i]| log |V[i]|). In the degenerate case where the max allowable weight allows the whole graph to be included from each center, cost will be O(|V| (|E| + |V| log |V|)).
I can think of some possible optimizations to improve run time, but want to verify this gets at the problem you have in mind.

Here is an integer programming exact formulation for the problem assuming you have a finite number of total streets, S, and the "center" of a cluster can be one of the finite number of streets S. If you are looking at a cluster center in continuous Euclidean space, that is going to take us into the domain of the Weber Problem. That may still be doable, but we would have to look at a column-generation formulation.
Objective function maximizes the weight of selected streets indexed by j. Constraint (1) specifies that exactly one center be chosen. Constraint (2) specifies that for any potential center, i, only N streets are chosen as neighbors. Constraint (3) stipulates that a street is chosen as part of some neighborhood only if the corresponding center is chosen. The rest are binary integer constraints.
If the street chosen as center counts as one of the N streets, it is easy to enforce by specifying y_{ii} = x_i
Note: If the above formulation is right, or captures the problem accurately, [MIP] can be solved trivially, once set N_i has been identified.
Consider each i in turn. From N_i pick the top N neighboring streets in descending order of weight. This is your incumbent solution. Update incumbent solution if better solution is found as you iterate through the is.

Related

Optimization problem in connected graphs with profits

I am trying to develop an algorithm to solve a problem that I am not able to classify, I expose the subject:
You have a map divided into sections that have a certain area and where a certain number of people live.
The problem consists of finding sets of connected sections whose area does not exceed a certain value maximizing the number of selected inhabitants.
For now I can think of two approaches:
Treat the problem as an all-pairs shortest paths problem in an
undirected graph with positive natural values where the solutions
that do not meet the constraint of the maximum selected area will be
discarded. For this you could use the Floyd-Warshall algorithm,
Dijkstra for all pairs or Thorup algorithm (which can be done in time
V * E, where these are the vertices and edges of the graph).
Treat it as an open vehicle routing problem with profits where each
vehicle can start and end wherever it wants (open vehicle routing
problem with profits or OVRPP).
Another aproach
Also, depending on the combinatorics of the particular problem it is possible in certain cases to use genetic algorithms, together with tabu search, but this is only for cases where finding an optimal solution is inadmissible.
To be clearer, what is sought is to obtain a selection of connected sections whose sum of areas does not exceed a total area. The parameter to maximize is the sum of populations of the selected sections. The objective is to find an optimal solution.
For example, this is the optimal selection with max area of 6 (red color area)
Thank you all in advance!

One pragmatic approach would be to formulate this as an instance of integer linear programming, and use an off-the-shelf ILP solver. One way to formulate this as an ILP instance is build a graph with one vertex per section and an edge between each pair of adjacent sections; then, selecting a connected component in that graph is equivalent to selecting a spanning tree for that component.
So, let x_v be a set of zero-or-one variables, one for each vertex v, and let y_{u,v} be another set of zero-or-one variables, one per edge (u,v). The intended meaning is that x_v=1 means that v is one of the selected sections; and that y_{u,v}=1 if and only if x_u=x_v=1, which can be enforced by y_{u,v} >= x_u + x_v - 1, y_{u,v} <= x_u, y_{u,v} <= x_v. Also add a constraint that the number of y's that are 1 is one less than the number of x's that are 1 (so that the y's form a tree): sum_v x_v = 1 + sum_{u,v} y_{u,v}. Finally, you have a constraint that the total area does not exceed the maximum: sum_v A_v x_v <= maxarea, where A_v is the area of section v.
Then your goal is to maximize sum_v P_v x_v, where P_v is the population of section v. Then the solution to this integer linear programming problem will give the optimal solution to your problem.

Heuristic value in A* algorithm

I am learning A* algorithm and dijkstra algorithm. And found out the only difference is the Heuristic value it used by A* algorithm. But how can I get these Heuristic value in my graph?. I found a example graph for A* Algorithm(From A to J). Can you guys help me how these Heuristic value are calculated.
The RED numbers denotes Heuristic value.
My current problem is in creating maze escape.

In order to get a heuristic that estimates (lower bounds) the minimum path cost between two nodes there are two possibilities (that I know of):
Knowledge about the underlying space the graph is part of
As an example assume the nodes are points on a plane (with x and y coordinate) and the cost of each edge is the euclidean distance between the corresponding nodes. In this case you can estimate (lower bound) the path cost from node U to node V by calculating the euclidean distance between U.position and V.position.
Another example would be a road network where you know its lying on the earth surface. The cost on the edges might represent travel times in minutes. In order to estimate the path cost from node U to node V you can calculate the great-circle distance between the two and divide it by the maximum travel speed possible.
Graph Embedding
Another possibility is to embed your graph in a space where you can estimate the path distance between two nodes efficiently. This approach does not make any assumptions on the underlying space but requires precomputation.
For example you could define a landmark L in your graph. Then you precalculate the distance between each node of the graph to your landmark and safe this distance at the node. In order to estimate the path distance during A* search you can now use the precalculated distances as follows: The path distance between node U and V is lower bounded by |dist(U, L) - dist(V,L)|.You can improve this heuristic by using more than one landmark.
For your graph you could use node A and node H as landmarks, which will give you the graph embedding as shown in the image below. You would have to precompute the shortest paths between the nodes A and H and all other nodes beforehand in order to compute this embedding. When you want to estimate for example the distance between two nodes B and J you can compute the distance in each of the two dimensions and use the maximum of the two distances as estimation. This corresponds to the L-infinity norm.

The heuristic is an estimate of the additional distance you would have to traverse to get to your destination.
It is problem specific and appears in different forms for different problems. For your graph , a good heuristic could be: the actual distance from the node to destination, measured by an inch tape or centimeter scale. Funny right but thats exactly how my college professor did it. He took an inch tape on black board and came up with very good heuristic.
So h(A) could be 10 units means the length measured by a measuring scale physically from A to J.
Of course for your algorithm to work the heuristic must be admissible, if not it could give you wrong answer.

Graph Theory that how to find shortest path from source A to Destination B as explained below

Background: You have a map stored as a undirected graph. Edges present streets or highways. There are two kinds of edges - green and red- and both kind of edges have weights. The weight of an edge is the distance that edge represents. Red edges represent represent toll roads, if you cross a red edge you pay as many cents as the weight of the edge. For example the red edge (r, 20) is 20 miles long and costs you 20 cents. All the green edges are free.
Problem: Write an algorithm to find the cheapest/shortest path from city S to city D. Cost has higher priority than distance. For example a free 500-mile long path is better than a 300-mile long path that costs 70 cents!
he problem is exactly the shortest path problem if all edges are green. Also if there is a connected path from S to V with all red edges removed, then it is the shortest path problem. What if S and V are not connected with all red edges removed? Then maybe insert cheapest red edge, if S and V are connected with it then again the problem becomes the shortest path problem. So now I found three simple cases:
1. All green graph.
2. Graph with a path from S to V when all red edges are connected.
3. Graph with a path from S to V when cheapest red edge is connected.
After this point it get a little difficult, or does it???

Cost has higher priority than distance. For example a free 500-mile long path is better than a 300-mile long path that costs 70 cents!
Run Djikstra's algorithm using a min-priority queue where the routes explored are prioritised by, firstly, lowest cost then, secondly, lowest distance.
Since all-green routes will have zero-cost then these will be explored first looking for the shortest distance route; if there are no zero-cost routes then, and only then, will red routes be explored in priority order of lowest cost.

The problem is more about how you handle the data than the algorithm. Let me propose this. Say the sum of the length of all the paths is less than 1'000'000. Then I can encode the cost and length as cost * 1'000'000 + length. Now you have a simple graph where you only have one metric to optimize.
I think it's clear that the most important thing the search is going to optimize is cost (since we multiplied it by 1'000'000) and that for paths of equal cost the length is going to be the deciding factor.
To actually implement this you don't need the assumption or to conversion I just did. You can keep cost and length everywhere and simply modify your comparisons to first compare the cost and only when it's a tie to compare the lengths.

Just run a "little modified" Dijkstra on this graph with setting weight of each edge to its price. You find all low-cost solutions with it and then you choose the shortest one.
The only real problem is that you can have edges with cost 0, but it can be solved - if you are in situation where you can open multiple nodes (they would have all same price), select the one with the lowest distance.
The only difference to classic Dijkstra.
In dijkstra in each step you open the node with lowest cost. If there are multiple nodes with lowest cost, you just choose any of them (it really does not matter).
In this case, if you have multiple nodes with lowest cost, you choose the one with lowest distance. If there are multiple nodes with lowest cost and lowest distance, you can open any of them (it really does not matter).
This solves every case you mentioned.

Making a fully connected graph using a distance metric

Say I have a series of several thousand nodes. For each pair of nodes I have a distance metric. This distance metric could be a physical distance ( say x,y coordinates for every node ) or other things that make nodes similar.
Each node can connect to up to N other nodes, where N is small - say 6.
How can I construct a graph that is fully connected ( e.g. I can travel between any two nodes following graph edges ) while minimizing the total distance between all graph nodes.
That is I don't want a graph where the total distance for any traversal is minimized, but where for any node the total distance of all the links from that node is minimized.
I don't need an absolute minimum - as I think that is likely NP complete - but a relatively efficient method of getting a graph that is close to the true absolute minimum.

I'd suggest a greedy heuristic where you select edges until all vertices have 6 neighbors. For example, start with a minimum spanning tree. Then, for some random pairs of vertices, find a shortest path between them that uses at most one of the unselected edges (using Dijkstra's algorithm on two copies of the graph with the selected edges, connected by the unselected edges). Then select the edge that yielded in total the largest decrease of distance.

You can use a kernel to create edges only for nodes under a certain cutoff distance.
If you want non-weighted edges You could simply use a basic cutoff to start with. You add an edge between 2 points if d(v1,v2) < R
You can tweak your cutoff R to get the right average number of edges between nodes.
If you want a weighted graph, the preferred kernel is often the gaussian one, with
K(x,y) = e^(-d(x,y)^2/d_0)
with a cutoff to keep away nodes with too low values. d_0 is the parameter to tweak to get the weights that suits you best.
While looking for references, I found this blog post that I didn't about, but that seems very explanatory, with many more details : http://charlesmartin14.wordpress.com/2012/10/09/spectral-clustering/
This method is used in graph-based semi-supervised machine learning tasks, for instance in image recognition, where you tag a small part of an object, and have an efficient label propagation to identify the whole object.
You can search on google for : semi supervised learning with graph

finding very close points on plane - approximate clustering algorithm needed

I have many points (latitudes and longitudes) on a plane (a city) and I want to find two clusters. Cluster 1 is points cluttered close together and Cluster 2 is everything else.
I know the definition of the problem is not exact. The only thing defined is that I need exactly 2 clusters. Out of N points, how many end up in cluster 1 or cluster 2 is not defined.
The main aim is to identify points which are very close to each other and separate them from the rest (which are more more evenly spread out)
The best I can think of is the following algorithm:
1. For each point, Calculate the sum of the square distances to all other points.
2. Run the k-means with k=2 on these square distances
The squaring (or maybe even higher order) of the distance should help by raising the dimensionality. However this algorithm will be biased towards points near the center of the city. It will struggle to find clusters at the edges of the city.
Any suggestions on how to avoid this problem? And any other suggestions to improve this algorithm

i'd suggest something along the following lines:
key concept
count number of neighbouring points at distance less than a given value.
semiformal description
count number nc(P) of neighbouring points at distance less than a given value d_cutoff for each point P.
cluster all points P_i with nc(P_i) greater than a threshold thres_count into cluster #1.
for each P_i in cluster #1 add its close neighbors, i.e. points Q with d(Q, P_i) < d_cutoff to the very same cluster #1.
set cluster #2 to the complement of cluster #1.
algorithmic angle
build an undirected graph G=(V, E) with your points being the vertex set V and an edge between every pair of points at a distance less than d_cutoff from each other.
delete all edges e=(v,w) from the graph where deg(v) < thres_count and deg(w) < thres_count.
G's isolated vertices form cluster #2, the complement is cluster #1.
heuristic on how to choose d_cutoff
build a minimum spanning tree (mst) of your point set. the frequency distribution of edge lengths should hint at suitable cutoff values. short pairwise distances will be incorporated into the mst first. thus there should be at least one pronounced gap in the ordered sequence of edge lengths for point sets with a natural clustering. so partition the set of mst edge lengths into a small number of adjacent intervals, ordering these intervals in the natural way. count how many actual distance values fall into each interval. consider the map between an interval's ordinal number and its count of distance values. large deltas between functions values for successive arguments would suggest to take the upper bound of distances in the lower interval as d_cutoff.

Since points in cluster 1 are close to each other, I think a density-based clustering algorithm may help. You may try the OPTICS algorithm, which is similar to DBSCAN but is aware of varying density and the number of clusters can be specified by the user.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio