generate a random graph with uniform distribution of edges - algorithm

I'm trying to find an efficient algorithm to generate a simple connected graph with given the number of nodes. Something like:
Input:
N - size of generated graph
Output:
simple connected graph G(v,e) with N vertices and S edges, The number of edges should be uniform distribution.

You might want to create a minimal spanning tree first to ensure connectivity. Later, randomly generate two nodes (which are not yet connected) and connect them. Repeat until you have S edges.
For minimal spanning tree, the easiest thing to do is start with a random node as tree. For every remaining node (ordered randomly), connect it to any node in the tree. The way you select the node within the tree (to connect to) defines the distribution of edges/node.

It may be a bit surprising but if you choose (randomly) log(n) edges (where n - maximum number of edges) then you can be almost sure that your graph is connected (a reference). If number of edges is much lower than log(n) you can be almost sure that the graph is disconnected.
How to generate the graph:
GenerateGraph(N,S):
if (log(N)/N) > S) return null // or you can take another action
V <- {0,...,N-1}
E <- {}
possible_edges <- {{v,u}: v,u in V} // all possible edges
while (size(E) < S):
e <- get_random_edge(possible_edges)
E.add(e)
possible_edges.remove(e)
G=(V,E)
if (G is connected)
return G
else
return GenerateGraph(N,S)
Let me know if you need any more guidance.
(Btw. right now I am dealing with exactly the same problem! Let me know if you need to generate some more sophisticated graphs:-))

A very common random graph generation algorithm (used in many academic works) is based on RMat method, or a generalization in the form of Kronecker graphs. It's a simple, iterative process, that uses very few parameters and is easily scalable.
There's a really good explanation here (including why it's better than other methods) -
http://www.graphanalysis.org/SIAM-PP08/Leskovic.pdf
There are versions of both implemented in many graph benchmark suites, for e.g.
BFS benchmark with a subdir with rMat generator - http://www.cs.cmu.edu/~pbbs/benchmarks/breadthFirstSearch.tar
Kronecker graph generator (both c and matlab codes included) - http://www.graph500.org/sites/default/files/files/graph500-2.1.4.tar.bz2

Related

Compute all edge-disjoint paths between two given vertices of a simple directed graph

Many similar questions have been asked, but non addresses exactly my situation: Given two vertices in a simple directed unweighed graph, and an integer k, how can I find all k-tuples of edge-disjoint paths between the vertices? (In particular, I'm interested in the case that k is the outdegree of the start vertex.)
I know Suurballe's algorithm will give me k edge-disjoint paths, but it will (non-deterministically) settle on one solution, instead of giving me all of them.
It seems maximum-flow algorithms like Edmonds-Karp are related, but they don't compute paths.
Is there any algorithm already in JGraphT that does what I want?
Here's a simple recursive method:
if k is 0, output [] and return
otherwise, choose an arbitrary arc from the source vertex
for all paths p that start with that arc and end at the sink,
for all k-1 tuples T in the graph where the arcs in p are removed,
output [p] + T
This method can be improved by pruning parts of the recursion tree. The easiest idea is to remove all arcs to the source vertex, since if a path uses two arcs from the source vertex, we won't get to k before disconnecting the source and sink.
A broader version of that idea uses maximum flow to identify arcs that can be part of a solution. Compute the max flow from the source to the sink. By the flow decomposition theorem, there is at least one k-tuple of edge disjoint paths if and only if the value of the flow is at least k. If these conditions are true, then there is a solution that uses the set of arcs carrying flow, so this set needs to be retained. To test the others, compute the strongly connected components of the residual graph (arcs without flow appear forward, arcs with flow appear backward). All arcs without flow that go from one SCC to another can be safely discarded.

Finding size of largest connected component of a graph

Consider we have a random undirected graph G = (V,E) with n vertices, now suppose for any two vertices u and v ∈ V, the probability that the edge between u and v ∈ E is 1/n. We need to figure out the size of the largest connected component in the undirected graph C(n).
C(n) should be equal to Θ(n**a), we need to run some experiments to give an estimate of a.
I am a bit confused on how to link the probability 1/n to the largest connected component, is there any way I can do so?
The process you're simulating here is called the Erdős–Rényi model. You have a collection of n nodes, and each pair of nodes has probability p of being linked. The (expected) shape of the resulting graph depends heavily on the choice of p, and there are a lot of famous results about this.
As for how to do this: one option would be to create a collection of n nodes, iterate over all pairs of nodes, and link them with probability 1/n. You can then run an algorithm like BFS or DFS over the graph to find and size the connected components.
Another would be to use the above approach, except that instead of doing a BFS or DFS to use a disjoint-set forest to perform the links and find the largest connected component.
Alternatively, because each edge is absent or present with equal probability and independently of every other edge, the number of edges you have is binomially distributed and pretty tightly packed around n total edges. You could therefore generate n random edges, add them into the graph, then use the above techniques. (This will be much faster, as this does O(n) work rather than O(n2) work to process the edges.)
Once you've gotten this worked out, you can vary n over a large range and run some sort of polynomial regression on it to find the best-first curve. That's something you could either code up yourself, or which you could do by importing your data into Excel and using its regression tools.
As a spoiler, when you're done you'll find that the number of nodes in the largest connected component is Θ(n2/3). If you search for "Erdős–Rényi critical case," you can find online proofs of this result. It's not a trivial result to prove (and definitely isn't obvious!), but it'll drop out of your empirical analysis.

How can I find Maximum Common Subgraph of two graphs?

Hi i need a help in finding a graph algorithm
im working on the following equation related to distance functions
d (g1, g2) = 1- │mcs(g1,g2) │ /
│g1│+│g2│-│mcs (g1, g2) │
Where
d (g1,g2) : is a distance function based on maximum common sub graph
.
g1, g2 are two graphs .
mcs (g1,g2): is the maximum common sub graph of two graphs g1,g2
where mcs is the largest graph (by some measure involving the number
of nodes and edges )contained in both subject graphs .
│g1│: Cardinality of the common induced sub graph g1
│g2│: Cardinality of the common induced sub graph g2
My Question: How can I calculate MCS?
I searched the internet but most of the algorithms are complicated anyone know from where i can get a simple algorithm to program this equation in matlab.
The problem is NP-Complete1.
The reduction from the Clique Problem2. Given an instance of Clique Problem - a graph G=(V,E), create a complete clique G'=(V,E') such that E' = {(u,v) | u != v, for each u,v in V).
The solution to the maximal clique problem is the same solution for the maximal subgraph problem for G and G'. Since clique problem is NP-Hard, so does this problem.
Thus, there is no known polynomial solution to this problem.
If you are looking for an exact algorithm, you could try exhaustive search approach and/or a branch & bound approach to solve it. Sorry for the bad news, but at least you know not to look for something that (probably) doesn't exist (unless P=NP, of course)
EDIT: exponential brute force solution to the problem:
You can just check all possible subsets, and check if it is a feasible solution.
Pseudo Code:
findMCS(vertices,G1,G2,currentSubset):
if vertices is empty: //base clause, no more candidates to check
if isCommonSubgraph(G1,G2,currentSubset):
return clone(currentSubset)
else:
return {}
v <- vertices.pop() //take a look at the first element
cand1 <- findMCS(vertices,G1,G2,currentSubset) //find MCS if it is NOT in the subset
currentSubset.append(v)
if isCommonSubgrah(G1,G2,currentSubset): //find MCS if it is in the subset
cand2 <- findMCS(vertices,G1,G2,currentSubset)
currentSubset.remvoe(v) //clean up environment before getting back from recursive call
return (|cand1| > |cand2| ? cand1 : cand2) //return the maximal subset from all candidates
Complexity of the above is O(2^n) (checking all possible subsets), and invoke it with: findMCS(G1.vertices, G1, G2, []) (where [] is an empty list).
Note:
isCommonSubgrah(G1,G2,currentSubset) is an easy to calculate method that just answers true if and only if currentSubset is a common subgraph of G1 and G2.
|cand1| and |cand2| is the sizes of these lists.
(1)Assuming that Maximum sub graph is a subset U in V such that for each u1,u2 in U (u1,u2) is in E1 if and only if (u1,u2) is in E2 (intuitively, a maximal subset of the vertices that share the exact same edges in the two graphs)
(2) Clique Problem: Given an instance of G=(V,E) find maximal subset U in V such that for each u1,u2 in U : u1 = u2 or (u1,u2) is in E.
The backtracking search algorithm proposed by James J. McGregor may be utilized to identify the MCS between two graphs.
You can't even check if one graph is a subgraph of the other one, it's the subgraph isomorphism problem known to be NP-complete. Hereby you can't find the maximal subgraph because you can't check the isomorphism property (in polynomial time).
The main problem is finding a correspondence between nodes in the original graphs (essentially a renumbering of the vertices). For instance, if we have node p in graph g1 and node q in graph g2 where p and q are equivalent, we'd like to map them to a node s in the common subgraph, c.
The reason that the Clique Problem is so difficult is that, without any way of checking whether two nodes in different graphs actually refer to the same node, we have to try all possible combinations of pairs of nodes and check if each pair is consistent and represents the "best" correspondence.
Since the nodes in these graphs represent geographic locations, we should be able to come up with a reasonable distance metric that tells us how likely it is that a node in one graph is the same as any node in the other graph. Since the GPS coordinates of the two nodes are probably not identical, we need to make some assumptions based on the problem.
If we have a map of the region in which the data points occur, represented as a graph m, we can renumber or rename the nodes in g1 and g2 to correspond to their closest equivalent in in m.
Distance (either between the original graphs and m or between points in g1 and g2) can either be the Euclidean distance or the Manhattan distance, depending on what makes more sense for your graphs.
You'll have to be careful in deciding how far apart two nodes can be and still be considered equivalent. Too small and you won't get any matches; too large and your entire graph could be condensed into one node.
Two or more nodes in an original graph could possible all map to the same node in c. If the location data is updated frequently in relation to the distance between nodes, for instance.
Conversely, an edge between a pair of successive nodes in an original graph could also map to a path containing multiple edges if the update frequency is low in relation to the distances. So you'll have to figure out whether it makes sense to introduce these intermediate nodes into the common graph or treat the whole path as a single edge.
Once you've got the renumbering of the nodes you can use the method that Jens suggests to find the intersection of the renumbered graphs. This is all very general since I don't have a lot of details about your specific problem, but hopefully it's enough to get you started.

computing number of nodes which can be reached by a specific node in a directed graph for each node

In a directed graph (suppose it has lots of cycles) I need to compute number of nodes which can be reached by specific node for each node. How can I do that with minimal effort? Which algorithm do I need to use?
Note: I think a reasonable algorithm for this problem should recursively compute this numbers(like result for 'node a' depends on that of 'node b' if a is connected to b).
The algorithm you're looking for is called the Floyd-Warshall algorithm, a very nice and efficient dynamic programming algorithm. It can be used to calculate the set of nodes reachable from each individual node in a graph (the transitive closure), although it's more often used to calculate the shortest paths from each individual node in a graph to all other nodes.
(Edit: the Floyd-Warshall algorithm is more complicated than it needs to be for your uses, because it's been extended a bit by Floyd to calculate shortest paths. You may find this page helpful, which only describes the "Warshall" part of the algorithm - the part you need.)
I happen to be studying it right now for class and have the paper on my desk. The recurrence for the transitive closure version of F-W is:
T(i,j,k) = T(i,j,k-1) ∨ (T(i,k,k-1) ∧ T(k,j,k-1))
Where T(a,b,c) is true if and only if there is a path from a to b using only the first c vertices in the graph (you must give them an arbitrary numbering before running the algorithm).
Intuitively, the recurrence says that there's a path from i to j using the first k vertices if:
there's a direct path between i and j, using the first k-1 vertices, OR
there's a path between i and k, and a path between k and j, using the first k-1 vertices.
You can build up the entire 3-dimensional table of T(i,j,k) in the typical dynamic programming fashion, and then count all of the TRUE entries along the source node that you want (using the max k), to get the size of the transitive closure for that source node.
If you're still following my poor explanation, you can make the algorithm extremely efficient with a few tricks:
It turns out that you don't need the k dimension in your table; you can just overwrite your same row of values over and over. Now the program would look like:
T(i,j) = T(i,j) || (T(i,k) && T(k,j))
If T(i,k) is 0 then you can skip the whole thing since nothing will change on that step.
If T(i,k) is 1 then the new value will just be T(i,j) || T(k,j). This can be done in huge chunks because block OR is extremely fast on modern processors.
Hope that helps...

Find all subtrees of size N in an undirected graph

Given an undirected graph, I want to generate all subgraphs which are trees of size N, where size refers to the number of edges in the tree.
I am aware that there are a lot of them (exponentially many at least for graphs with constant connectivity) - but that's fine, as I believe the number of nodes and edges makes this tractable for at least smallish values of N (say 10 or less).
The algorithm should be memory-efficient - that is, it shouldn't need to have all graphs or some large subset of them in memory at once, since this is likely to exceed available memory even for relatively small graphs. So something like DFS is desirable.
Here's what I'm thinking, in pseudo-code, given the starting graph graph and desired length N:
Pick any arbitrary node, root as a starting point and call alltrees(graph, N, root)
alltrees(graph, N, root)
given that node root has degree M, find all M-tuples with integer, non-negative values whose values sum to N (for example, for 3 children and N=2, you have (0,0,2), (0,2,0), (2,0,0), (0,1,1), (1,0,1), (1,1,0), I think)
for each tuple (X1, X2, ... XM) above
create a subgraph "current" initially empty
for each integer Xi in X1...XM (the current tuple)
if Xi is nonzero
add edge i incident on root to the current tree
add alltrees(graph with root removed, N-1, node adjacent to root along edge i)
add the current tree to the set of all trees
return the set of all trees
This finds only trees containing the chosen initial root, so now remove this node and call alltrees(graph with root removed, N, new arbitrarily chosen root), and repeat until the size of the remaining graph < N (since no trees of the required size will exist).
I forgot also that each visited node (each root for some call of alltrees) needs to be marked, and the set of children considered above should only be the adjacent unmarked children. I guess we need to account for the case where no unmarked children exist, yet depth > 0, this means that this "branch" failed to reach the required depth, and cannot form part of the solution set (so the whole inner loop associated with that tuple can be aborted).
So will this work? Any major flaws? Any simpler/known/canonical way to do this?
One issue with the algorithm outlined above is that it doesn't satisfy the memory-efficient requirement, as the recursion will hold large sets of trees in memory.
This needs an amount of memory that is proportional to what is required to store the graph. It will return every subgraph that is a tree of the desired size exactly once.
Keep in mind that I just typed it into here. There could be bugs. But the idea is that you walk the nodes one at a time, for each node searching for all trees that include that node, but none of the nodes that were searched previously. (Because those have already been exhausted.) That inner search is done recursively by listing edges to nodes in the tree, and for each edge deciding whether or not to include it in your tree. (If it would make a cycle, or add an exhausted node, then you can't include that edge.) If you include it your tree then the used nodes grow, and you have new possible edges to add to your search.
To reduce memory use, the edges that are left to look at is manipulated in place by all of the levels of the recursive call rather than the more obvious approach of duplicating that data at each level. If that list was copied, your total memory usage would get up to the size of the tree times the number of edges in the graph.
def find_all_trees(graph, tree_length):
exhausted_node = set([])
used_node = set([])
used_edge = set([])
current_edge_groups = []
def finish_all_trees(remaining_length, edge_group, edge_position):
while edge_group < len(current_edge_groups):
edges = current_edge_groups[edge_group]
while edge_position < len(edges):
edge = edges[edge_position]
edge_position += 1
(node1, node2) = nodes(edge)
if node1 in exhausted_node or node2 in exhausted_node:
continue
node = node1
if node1 in used_node:
if node2 in used_node:
continue
else:
node = node2
used_node.add(node)
used_edge.add(edge)
edge_groups.append(neighbors(graph, node))
if 1 == remaining_length:
yield build_tree(graph, used_node, used_edge)
else:
for tree in finish_all_trees(remaining_length -1
, edge_group, edge_position):
yield tree
edge_groups.pop()
used_edge.delete(edge)
used_node.delete(node)
edge_position = 0
edge_group += 1
for node in all_nodes(graph):
used_node.add(node)
edge_groups.append(neighbors(graph, node))
for tree in finish_all_trees(tree_length, 0, 0):
yield tree
edge_groups.pop()
used_node.delete(node)
exhausted_node.add(node)
Assuming you can destroy the original graph or make a destroyable copy I came up to something that could work but could be utter sadomaso because I did not calculate its O-Ntiness. It probably would work for small subtrees.
do it in steps, at each step:
sort the graph nodes so you get a list of nodes sorted by number of adjacent edges ASC
process all nodes with the same number of edges of the first one
remove those nodes
For an example for a graph of 6 nodes finding all size 2 subgraphs (sorry for my total lack of artistic expression):
Well the same would go for a bigger graph, but it should be done in more steps.
Assuming:
Z number of edges of most ramificated node
M desired subtree size
S number of steps
Ns number of nodes in step
assuming quicksort for sorting nodes
Worst case:
S*(Ns^2 + MNsZ)
Average case:
S*(NslogNs + MNs(Z/2))
Problem is: cannot calculate the real omicron because the nodes in each step will decrease depending how is the graph...
Solving the whole thing with this approach could be very time consuming on a graph with very connected nodes, however it could be paralelized, and you could do one or two steps, to remove dislocated nodes, extract all subgraphs, and then choose another approach on the remainder, but you would have removed a lot of nodes from the graph so it could decrease the remaining run time...
Unfortunately this approach would benefit the GPU not the CPU, since a LOT of nodes with the same number of edges would go in each step.... and if parallelization is not used this approach is probably bad...
Maybe an inverse would go better with the CPU, sort and proceed with nodes with the maximum number of edges... those will be probably less at start, but you will have more subgraphs to extract from each node...
Another possibility is to calculate the least occuring egde count in the graph and start with nodes that have it, that would alleviate the memory usage and iteration count for extracting subgraphs...
Unless I'm reading the question wrong people seem to be overcomplicating it.
This is just "all possible paths within N edges" and you're allowing cycles.
This, for two nodes: A, B and one edge your result would be:
AA, AB, BA, BB
For two nodes, two edges your result would be:
AAA, AAB, ABA, ABB, BAA, BAB, BBA, BBB
I would recurse into a for each and pass in a "template" tuple
N=edge count
TempTuple = Tuple_of_N_Items ' (01,02,03,...0n) (Could also be an ordered list!)
ListOfTuple_of_N_Items ' Paths (could also be an ordered list!)
edgeDepth = N
Method (Nodes, edgeDepth, TupleTemplate, ListOfTuples, EdgeTotal)
edgeDepth -=1
For Each Node In Nodes
if edgeDepth = 0 'Last Edge
ListOfTuples.Add New Tuple from TupleTemplate + Node ' (x,y,z,...,Node)
else
NewTupleTemplate = TupleTemplate + Node ' (x,y,z,Node,...,0n)
Method(Nodes, edgeDepth, NewTupleTemplate, ListOfTuples, EdgeTotal
next
This will create every possible combination of vertices for a given edge count
What's missing is the factory to generate tuples given an edge count.
You end up with a list of possible paths and the operation is Nodes^(N+1)
If you use ordered lists instead of tuples then you don't need to worry about a factory to create the objects.
If memory is the biggest problem you can use a NP-ish solution using tools from formal verification. I.e., guess a subset of nodes of size N and check whether it's a graph or not. To save space you can use a BDD (http://en.wikipedia.org/wiki/Binary_decision_diagram) to represent the original graph's nodes and edges. Plus you can use a symbolic algorithm to check if the graph you guessed is really a graph - so you don't need to construct the original graph (nor the N-sized graphs) at any point. Your memory consumption should be (in big-O) log(n) (where n is the size of the original graph) to store the original graph, and another log(N) to store every "small graph" you want.
Another tool (which is supposed to be even better) is to use a SAT solver. I.e., construct a SAT formula that is true iff the sub-graph is a graph and supply it to a SAT solver.
For a graph of Kn there are approximately n! paths between any two pairs of vertices. I haven't gone through your code but here is what I would do.
Select a pair of vertices.
Start from a vertex and try to reach the destination vertex recursively (something like dfs but not exactly). I think this would output all the paths between the chosen vertices.
You could do the above for all possible pairs of vertices to get all simple paths.
It seems that the following solution will work.
Go over all partitions into two parts of the set of all vertices. Then count the number of edges which endings lie in different parts (k); these edges correspond to the edge of the tree, they connect subtrees for the first and the second parts. Calculate the answer for both parts recursively (p1, p2). Then the answer for the entire graph can be calculated as sum over all such partitions of k*p1*p2. But all trees will be considered N times: once for each edge. So, the sum must be divided by N to get the answer.
Your solution as is doesn't work I think, although it can be made to work. The main problem is that the subproblems may produce overlapping trees so when you take the union of them you don't end up with a tree of size n. You can reject all solutions where there is an overlap, but you may end up doing a lot more work than needed.
Since you are ok with exponential runtime, and potentially writing 2^n trees out, having V.2^V algorithms is not not bad at all. So the simplest way of doing it would be to generate all possible subsets n nodes, and then test each one if it forms a tree. Since testing whether a subset of nodes form a tree can take O(E.V) time, we are potentially talking about V^2.V^n time, unless you have a graph with O(1) degree. This can be improved slightly by enumerating subsets in a way that two successive subsets differ in exactly one node being swapped. In that case, you just have to check if the new node is connected to any of the existing nodes, which can be done in time proportional to number of outgoing edges of new node by keeping a hash table of all existing nodes.
The next question is how do you enumerate all the subsets of a given size
such that no more than one element is swapped between succesive subsets. I'll leave that as an exercise for you to figure out :)
I think there is a good algorithm (with Perl implementation) at this site (look for TGE), but if you want to use it commercially you'll need to contact the author. The algorithm is similar to yours in the question but avoids the recursion explosion by making the procedure include a current working subtree as a parameter (rather than a single node). That way each edge emanating from the subtree can be selectively included/excluded, and recurse on the expanded tree (with the new edge) and/or reduced graph (without the edge).
This sort of approach is typical of graph enumeration algorithms -- you usually need to keep track of a handful of building blocks that are themselves graphs; if you try to only deal with nodes and edges it becomes intractable.
This algorithm is big and not easy one to post here. But here is link to reservation search algorithm using which you can do what you want. This pdf file contains both algorithms. Also if you understand russian you can take a look to this.
So you have a graph with with edges e_1, e_2, ..., e_E.
If I understand correctly, you are looking to enumerate all subgraphs which are trees and contain N edges.
A simple solution is to generate each of the E choose N subgraphs and check if they are trees.
Have you considered this approach? Of course if E is too large then this is not viable.
EDIT:
We can also use the fact that a tree is a combination of trees, i.e. that each tree of size N can be "grown" by adding an edge to a tree of size N-1. Let E be the set of edges in the graph. An algorithm could then go something like this.
T = E
n = 1
while n<N
newT = empty set
for each tree t in T
for each edge e in E
if t+e is a tree of size n+1 which is not yet in newT
add t+e to newT
T = newT
n = n+1
At the end of this algorithm, T is the set of all subtrees of size N. If space is an issue, don't keep a full list of the trees, but use a compact representation, for instance implement T as a decision tree using ID3.
I think problem is under-specified. You mentioned that graph is undirected and that subgraph you are trying to find is of size N. What is missing is number of edges and whenever trees you are looking for binary or you allowed to have multi-trees. Also - are you interested in mirrored reflections of same tree, or in other words does order in which siblings are listed matters at all?
If single node in a tree you trying to find allowed to have more than 2 siblings which should be allowed given that you don't specify any restriction on initial graph and you mentioned that resulting subgraph should contain all nodes.
You can enumerate all subgraphs that have form of tree by performing depth-first traversal. You need to repeat traversal of the graph for every sibling during traversal. When you'll need to repeat operation for every node as a root.
Discarding symmetric trees you will end up with
N^(N-2)
trees if your graph is fully connected mesh or you need to apply Kirchhoff's Matrix-tree theorem

Resources