Making a fully connected graph using a distance metric - algorithm

Say I have a series of several thousand nodes. For each pair of nodes I have a distance metric. This distance metric could be a physical distance ( say x,y coordinates for every node ) or other things that make nodes similar.
Each node can connect to up to N other nodes, where N is small - say 6.
How can I construct a graph that is fully connected ( e.g. I can travel between any two nodes following graph edges ) while minimizing the total distance between all graph nodes.
That is I don't want a graph where the total distance for any traversal is minimized, but where for any node the total distance of all the links from that node is minimized.
I don't need an absolute minimum - as I think that is likely NP complete - but a relatively efficient method of getting a graph that is close to the true absolute minimum.

I'd suggest a greedy heuristic where you select edges until all vertices have 6 neighbors. For example, start with a minimum spanning tree. Then, for some random pairs of vertices, find a shortest path between them that uses at most one of the unselected edges (using Dijkstra's algorithm on two copies of the graph with the selected edges, connected by the unselected edges). Then select the edge that yielded in total the largest decrease of distance.

You can use a kernel to create edges only for nodes under a certain cutoff distance.
If you want non-weighted edges You could simply use a basic cutoff to start with. You add an edge between 2 points if d(v1,v2) < R
You can tweak your cutoff R to get the right average number of edges between nodes.
If you want a weighted graph, the preferred kernel is often the gaussian one, with
K(x,y) = e^(-d(x,y)^2/d_0)
with a cutoff to keep away nodes with too low values. d_0 is the parameter to tweak to get the weights that suits you best.
While looking for references, I found this blog post that I didn't about, but that seems very explanatory, with many more details : http://charlesmartin14.wordpress.com/2012/10/09/spectral-clustering/
This method is used in graph-based semi-supervised machine learning tasks, for instance in image recognition, where you tag a small part of an object, and have an efficient label propagation to identify the whole object.
You can search on google for : semi supervised learning with graph

Related

How to connect all connected components in shortest way

Given a N*M array of 0 and 1.
A lake is a set of cells(1) which are horizontally or vertically adjacent.
We are going to connect all lakes on map by updating some cell(0) to 1.
The task is finding the way that number of updated cells is the smallest in a given time limit.
I found this similar question: What is the minimum cost to connect all the islands?
The solution on this topic get some problem:
1) It uses lib (pulp) to solve the task
2) It take time to get output
Is there an optimization solution for this problem
Thank you in advance
I think this is a tricky question but if you really draw it out and look at this matrix as a graph it makes it simpler.
Consider each cell as a node and each connection to its top/right/bottom/left to be an edge.
Start by connection the edges of the lakes to the nearby vertices. Keep doing the same for each each and only connect two vertices if it doesn't create a cycle.
At this stage carry out the same process for the immediate neighbours of the lakes. Keep doing the same and break if its creating cycles.
After this you should have a connected tree.
Once you have a connected tree you can find all the articulation point (Cut vertex) of the tree. (A vertex in an undirected connected graph is an articulation point (or cut vertex) iff removing it (and edges through it) disconnects the graph. Articulation points represent vulnerabilities in a connected network – single points whose failure would split the network into 2 or more disconnected components)
The number of cut vertex in the tree (excluding the initial lakes) would be the smallest number of cells that you need to change.
You can search there are many efficient ways to find cut vertex of a graph.
Finding articulation points takes O(V+E)
Preprocessing takes O(V+E) as it somewhat similar to a BFS.
Don't know whether you are still interested but I have an idea. What about a min cost flow algorithm.
Assume you have an m*n 2-d input array and i Islands. Create a graph where each position in the 2-d array is a node and has 4 edges to each neighbour. Each edge will be assigned a cost later on. Each edge has minimum capacity 0 and maximum capacity infinit.
Choose a random island to be the source. Create an extra node target and connect it to all other islands except the source with each new edge having maximum and minimum flow capacity 1 and cost 0.
Now assign the old edges costs, so that an edge connecting two island nodes costs nothing, but an edge between and island and a water node or an edge between two water nodes costs 1.
Calculate min cost flow over this graph. The initial graph generating can be done in nm and the min cost flow algorithm in (nm) ^3

Heuristic value in A* algorithm

I am learning A* algorithm and dijkstra algorithm. And found out the only difference is the Heuristic value it used by A* algorithm. But how can I get these Heuristic value in my graph?. I found a example graph for A* Algorithm(From A to J). Can you guys help me how these Heuristic value are calculated.
The RED numbers denotes Heuristic value.
My current problem is in creating maze escape.
In order to get a heuristic that estimates (lower bounds) the minimum path cost between two nodes there are two possibilities (that I know of):
Knowledge about the underlying space the graph is part of
As an example assume the nodes are points on a plane (with x and y coordinate) and the cost of each edge is the euclidean distance between the corresponding nodes. In this case you can estimate (lower bound) the path cost from node U to node V by calculating the euclidean distance between U.position and V.position.
Another example would be a road network where you know its lying on the earth surface. The cost on the edges might represent travel times in minutes. In order to estimate the path cost from node U to node V you can calculate the great-circle distance between the two and divide it by the maximum travel speed possible.
Graph Embedding
Another possibility is to embed your graph in a space where you can estimate the path distance between two nodes efficiently. This approach does not make any assumptions on the underlying space but requires precomputation.
For example you could define a landmark L in your graph. Then you precalculate the distance between each node of the graph to your landmark and safe this distance at the node. In order to estimate the path distance during A* search you can now use the precalculated distances as follows: The path distance between node U and V is lower bounded by |dist(U, L) - dist(V,L)|.You can improve this heuristic by using more than one landmark.
For your graph you could use node A and node H as landmarks, which will give you the graph embedding as shown in the image below. You would have to precompute the shortest paths between the nodes A and H and all other nodes beforehand in order to compute this embedding. When you want to estimate for example the distance between two nodes B and J you can compute the distance in each of the two dimensions and use the maximum of the two distances as estimation. This corresponds to the L-infinity norm.
The heuristic is an estimate of the additional distance you would have to traverse to get to your destination.
It is problem specific and appears in different forms for different problems. For your graph , a good heuristic could be: the actual distance from the node to destination, measured by an inch tape or centimeter scale. Funny right but thats exactly how my college professor did it. He took an inch tape on black board and came up with very good heuristic.
So h(A) could be 10 units means the length measured by a measuring scale physically from A to J.
Of course for your algorithm to work the heuristic must be admissible, if not it could give you wrong answer.

Clustering nodes on a graph

Say I have a weighted, undirected graph with X vertices. I'm looking separate these nodes into clusters, based on the weight of an edge between each connected vertex (lower weight = closer together).
I was hoping I could use an algorithm like K means clustering to achieve this, but it seems that K means requires data in at least a 2D space, while I only have the weights of each edge to go from.
Is there a way to cluster nodes of a weighted graph together? I don't have a preference as to whether I need to specify the number of clusters or not.
I suspect I'm missing something relatively simple here, but it is rather late.
I've considered whether I just need to traverse the graph, and for each node find its Y closest immediate neighbours, but that seems too simple.
Edit: Apologies: to be more specific: the graph isn't too large (around 150 vertices, maximum), and is not complete. I'm using Python.

Graph Theory that how to find shortest path from source A to Destination B as explained below

Background: You have a map stored as a undirected graph. Edges present streets or highways. There are two kinds of edges - green and red- and both kind of edges have weights. The weight of an edge is the distance that edge represents. Red edges represent represent toll roads, if you cross a red edge you pay as many cents as the weight of the edge. For example the red edge (r, 20) is 20 miles long and costs you 20 cents. All the green edges are free.
Problem: Write an algorithm to find the cheapest/shortest path from city S to city D. Cost has higher priority than distance. For example a free 500-mile long path is better than a 300-mile long path that costs 70 cents!
he problem is exactly the shortest path problem if all edges are green. Also if there is a connected path from S to V with all red edges removed, then it is the shortest path problem. What if S and V are not connected with all red edges removed? Then maybe insert cheapest red edge, if S and V are connected with it then again the problem becomes the shortest path problem. So now I found three simple cases:
1. All green graph.
2. Graph with a path from S to V when all red edges are connected.
3. Graph with a path from S to V when cheapest red edge is connected.
After this point it get a little difficult, or does it???
Cost has higher priority than distance. For example a free 500-mile long path is better than a 300-mile long path that costs 70 cents!
Run Djikstra's algorithm using a min-priority queue where the routes explored are prioritised by, firstly, lowest cost then, secondly, lowest distance.
Since all-green routes will have zero-cost then these will be explored first looking for the shortest distance route; if there are no zero-cost routes then, and only then, will red routes be explored in priority order of lowest cost.
The problem is more about how you handle the data than the algorithm. Let me propose this. Say the sum of the length of all the paths is less than 1'000'000. Then I can encode the cost and length as cost * 1'000'000 + length. Now you have a simple graph where you only have one metric to optimize.
I think it's clear that the most important thing the search is going to optimize is cost (since we multiplied it by 1'000'000) and that for paths of equal cost the length is going to be the deciding factor.
To actually implement this you don't need the assumption or to conversion I just did. You can keep cost and length everywhere and simply modify your comparisons to first compare the cost and only when it's a tie to compare the lengths.
Just run a "little modified" Dijkstra on this graph with setting weight of each edge to its price. You find all low-cost solutions with it and then you choose the shortest one.
The only real problem is that you can have edges with cost 0, but it can be solved - if you are in situation where you can open multiple nodes (they would have all same price), select the one with the lowest distance.
The only difference to classic Dijkstra.
In dijkstra in each step you open the node with lowest cost. If there are multiple nodes with lowest cost, you just choose any of them (it really does not matter).
In this case, if you have multiple nodes with lowest cost, you choose the one with lowest distance. If there are multiple nodes with lowest cost and lowest distance, you can open any of them (it really does not matter).
This solves every case you mentioned.

Smallest path in graph theory (social network analysis)

This is the scenario:
There is an undirected graph with n nodes and e edges, all nodes are connected.
The question in the scenario:
Every node can be considered as a person in a social network that shares or reads a content. It means that if A is connected to B, C and D, if A shares a content with the network, it will reach directly BCD. It means that to reach all the nodes in the network, it's just necessary that they are adjacent to a node which shared the content.
Q1: is there a way to find the best starting point to reach the entire network?
Q2: is there a way to find a smallest path from that point?
I've already looked at salesman problem and prim'algorithm.
Thanks!
The wikipedia page on Centrality describes several different forms of centrality in a graph, and has links to algorithms for some of them.
Raising the adjacency matrix of the network to the nth power gives you the number of walks of length n between two verticies i,j (represented by the ij-th element of the matrix). The first non zero value of x(i,j) will tell you how far apart they are with respect to walks. If you're looking for the best node to reach the whole network, then you could just look for the first instance of a row (or column) of the matrix which has all non zero values whilst increasing n.
Obviously this isn't practical with huge networks...
Otherwise you could apply Dijkstra's algorithm.
Closeness Centrality is a ranking of each individual node and can be thought of as a measure of how "close a node is to the center of a network". So a node with a high closeness centrality value is positioned in the network such that it takes this node a shorter number of hopes (on average) to reach all other nodes in the network. So for Q1 above, the node(s) with the highest closeness could be interpreted to be in the best position to reach all other nodes with a minimum number of hops between nodes on the way. For Q2, the "smallest path" can be considered the smallest average path to all nodes in the network.

Resources