I studied about various algorithms for cycle detection algorithm in directed graphs like incremental way search, strongly connected components, BFS, two way search, etc. Now I want to simulate it and compare the performance. I am calling cycle detection function whenever I am inserting an edge.
So, my question is what kind of datasets should I consider. If I consider random graphs, then what should be the criteria for evaluation of various algorithms. Some random graph might be of huge size; but they might lead to cycle in a few iterations. It would be helpful if someone can suggest how to go about this.
Also, to compare the performance, does it make sense to remove cycle and then continue insertions again. Once it terminates, compare the execution times of all the implementations?
It really depends on what you're doing this for. In general, there are many random graph generating approaches, but arguably the most famous is Erdos-Renyi. Note though, that for a graph with n vertices to not have a cycle, it must have at most n - 1 edges, so such random graph generators will have a cycle with high probability. Depending on your exact case, you might find it best therefore to keep the graph as sparse as possible (i.e. allow for few edges).
Related
I have a graph of many hundred nodes that are mainly connected with each other. I can do processing on entire graph but it really takes a lot of time, so I would like to divide it to smaller sub-graphs of approximately similar size.
With other words. I have a collection of aerial images and I do pairwise image matching on all of them. As a result I get a set of matches for each pair (pixel from first image matched with pixel on second image). Number of matches is considered as weight of this (undirected) edge. These edges then form a graph mentioned above.
I'm not so familiar with graph theory (as it is a very broad topic). What is the best algorithm for this job?
Thank you.
Edit:
This problem has a perfect analogy which I think is easier to understand. Imagine you have a set of people and their connections/friendships, like I social network. Each friendship has a numeric value/weight representing how good friends they are. So in a large group of people I want to get k most interconnected sub-groups .
Unfortunately, the problem you're describing is almost certainly NP-hard. From a graph perspective, you have a graph where each edge has a weight on it. You're trying to split the graph into relatively equal pieces while cutting the lowest total cost of edges cut. This problem is called the maximum k-cut problem and is NP-hard. If you introduce the constraint that you also want to try to make the pieces roughly even in size, you have the balanced k-cut problem, which is also NP-hard.
The good news is that there are nice approximation algorithms for these problems, so if you're looking for solutions that are just "good enough," then you can probably find a library somewhere that implements them. There are also other techniques like spectral clustering which work well in practice and are really fast, but which don't have any guarantees on how well they'll do.
I am looking for a way to perform a topological sorting on a given directed unweighted graph, that contains cycles. The result should not only contain the ordering of vertices, but also the set of edges, that are violated by the given ordering. This set of edges shall be minimal.
As my input graph is potentially large, I cannot use an exponential time algorithm. If it's impossible to compute an optimal solution in polynomial time, what heuristic would be reasonable for the given problem?
Eades, Lin, and Smyth proposed A fast and effective heuristic for the feedback arc set problem. The original article is behind a paywall, but a free copy is available from here.
There’s an algorithm for topological sorting that builds the vertex order by selecting a vertex with no incoming arcs, recursing on the graph minus the vertex, and prepending that vertex to the order. (I’m describing the algorithm recursively, but you don’t have to implement it that way.) The Eades–Lin–Smyth algorithm looks also for vertices with no outgoing arcs and appends them. Of course, it can happen that all vertices have incoming and outgoing arcs. In this case, select the vertex with the highest differential between incoming and outgoing. There is undoubtedly room for experimentation here.
The algorithms with provable worst-case behavior are based on linear programming and graph cuts. These are neat, but the guarantees are less than ideal (log^2 n or log n log log n times as many arcs as needed), and I suspect that efficient implementations would be quite a project.
Inspired by Arnauds answer and other interesting topological sorting algorithms have I created the cyclic-toposort project and published it on github. cyclic_toposort does exactly what you desire in that it quickly sorts a directed cyclic graph providing the minimal amount of violating edges. It optionally also provides the maximum groupings of nodes that are on the same topological level (and can therefore be activated at the same time) if desired.
If the problem is still relevant to you then I would be happy if you check out my project and let me know what you think!
This project was useful to my own neural network topology research, so I had to create something like this anyway. I am answering your question this late in case anyone else stumbles upon this thread in search for the same question.
I have an undirected graph. One edge in that graph is special. I want to find all other edges that are part of a even cycle containing the first edge.
I don't need to enumerate all the cycles, that would be inherently NP I think. I just need to know, for each each edge, whether it satisfies the conditions above.
A brute force search works of course but is too slow, and I'm struggling to come up with anything better. Any help appreciated.
I think we have an answer (I must credit my colleague with the idea). Essentially his idea is to do a flood fill algorithm through the space of even cycles. This works because if you have a large even cycle formed by merging two smaller cycles then the smaller cycles must have been both even or both odd. Similarly merging an odd and even cycle always forms a larger odd cycle.
This is a practical option only because I can imagine pathological cases consisting of alternating even and odd cycles. In this case we would never find two adjacent even cycles and so the algorithm would be slow. But I'm confident that such cases don't arise in real chemistry. At least in chemistry as it's currently known, 30 years ago we'd never heard of fullerenes.
If your graph has a small node degree, you might consider using a different graph representation:
Let three atoms u,v,w and two chemical bonds e=(u,v) and k=(v,w). A typical way of representing such data is to store u,v,w as nodes and e,k as edges in a graph.
However, one may represent e and k as nodes in the graph, having edges like f=(e,k) where f represents a 2-step link from u to w, f=(e,k) or f=(u,v,w). Running any algorithm to find cycles on such a graph will return all even cycles on the original graph.
Of course, this is efficient only if the original graph has a small node degree. When a user performs an edit, you can easily edit accordingly the alternative representation.
I was browsing the Wikipedia entry on maze generation algorithms and found that the article strongly insinuated that different maze generation algorithms (randomized depth-first search, randomized Kruskal's, etc.) produce mazes with different characteristics. This seems to suggest that the algorithms produce random mazes with different probability distributions over the set of all single-solution mazes (spanning trees on a rectangular grid).
My questions are:
Is this correct? That is, am I reading this article correctly, and is the article correct?
If so, why? I don't see an intuitive reason why the different algorithms would produce different distributions.
Uh well I think it's pretty obvious different algorithms generate different mazes. Let's just talk about spanning trees of a grid. Suppose you have a grid G and you have two algorithms to generate a spanning tree for the grid:
Algorithm A:
Pick any edge of the grid, with 99% probability choose a horizontal one, otherwise a vertical one
Add the edge to the maze, unless adding it would create a cycle
Stop when every vertex is connected to every other vertex (spanning tree complete)
Algorithm B:
As algorithm A, but set the probability to 1% instead of 99%
"Obviously" algorithm A produces mazes with lots of horizontal passages and algorithm B mazes with lots of vertical passages. That is, there is a statistical correlation between the number of horizontal passages in a maze and the maze being produced by algorithm A.
Of course the differences between the Wikipedia algorithms are more intricate but the principle is the same. The algorithms sample the space of possible mazes for a given grid in a non-uniform, structured way.
LOL I remember a scientific conference where a researcher presented her results about her algorithm that did something "for graphs". The results were statistical and presented for "random graphs". Someone asked from the audience "which distribution of random graphs did you draw the graphs from?" The answer: "uh... they were produced by our graph generation program". Duh!
Interesting question. Here my random 2c.
Comparing Prim's to, say, DFS, the latter seems to have a proclivity for producing deeper trees simply due to the fact that the first 'runs' have more space to create deep trees with less branches. Prim's algorithm, on the other hand, appears to create trees with more branching due to the fact that any open branch can be selected at each iteration.
One way to ask this would be to look at what is the probability that each algorithm will produce a tree of depth > N. I have a hunch that they would be different. A more formal approach to do proving this might be to assign some weights to each part of the tree and show it's more likely to be taken or attempt to characterize the space some other way, but I'll be hand wavy and guessing it's correct :). I'm interested in what lead to you think it wouldn't be, because my intuition was the opposite. And no, the Wiki article doesn't give a convincing argument.
EDIT
One simple way to see this to consider an initial tree with two children with a total of k nodes
e.g.,
*---* ... *
\--* ... *
Choose a random node as the start and end. DFS will produce one of two mazes, either the entire tree, or the part of it with the direct path from start to end. Prim's algorithm will produce the 'maze' with the direct path from start to end with secondary paths of length 1 ... k.
It is not statistical until you request that each algorithm produce every solution it can.
What you are perceiving as statistical bias is only a bias towards the preferred, first solution.
That bias may not be algorithmic (set-theory-wise) but implementation dependent (like the bias in the choice of the pivot in quicksort).
Yes, it is correct. You can produce different mazes by starting the process in different ways. Some algorithms start with a fully closed grid and remove walls to generate a path through the maze while some start with a empty grid and add walls leaving behind a path. This alone can produce different results.
There's a graph with a lot of nodes, and very few edges between them - the problem is assigning numbers to nodes, so that most nodes are from i to i+1 or otherwise close.
My problem is about printing graph data nicely, but an algorithm just like that is part of pretty much every compiler (intermediate code is just a graph, produced object code gets memory locations).
I thought it was just straightforward depth-first search, but results of that aren't that great - it seems to minimize number of links back well enough, but ones it leaves tend to be horrible (like 1 -> 500 -> 1).
Any better ideas?
This paper discusses this problem, if you use Eyal Schneider's formulation of minimizing the sum of the edge deltas (absolute value of the difference between the endpoints' labels). It's under #2, Optimal Linear Arrangements.
Sadly, there's no algorithm given for achieving an optimal ordering (or labeling), and the general problem is NP-complete. There are references to some polynomial-time algorithms for trees, though.
If you want to get into the academic stuff, google gives lots of hits for "Optimal Linear Arrangements".