Difference between Boruvka and Kruskal in finding MST - algorithm

I would like to know the difference between Boruvkas algorithm and Kruskals algorithm.
What they have in common:
both find the minimum spanning tree (MST) in a undirected graph
both add the shortest edge to the existing tree until the MST is found
both look at the graph in it`s entirety, unlike e.g. Prims algorithm, which adds one node after another to the MST
Both algorithmns are greedy
The only difference seems to be, that Boruvkas perspective is each individual node (from where it looks for the cheapest edge), instead of looking at the entire graph (like Kruskal does).
It therefore seems to be, that Boruvka should be relatively easy to do in parallel (unlike Kruskal). Is that true?

In case of Kruskal's algorithm, first of all we want to sort all edges from the cheapest to the most expensive ones. Then in each step we remove min-weight edge and if it doesn't create a cycle in our graph (which initially consits of |V|-1 separate vertices), then we add it to MST. Otherwise we just remove it.
Boruvka's algorithm looks for nearest neighbour of each component (initially vertex). It keeps selecting cheapest edge from each component and adds it to our MST. When we have only one connected component, it's done.
Finding cheapest outgoing edge from each node/component can be done easily in parallel. Then we can just merge new, obtained components and repeat finding phase till we find MST. That's why this algorithm is a good example for parallelism (in case of finding MST).
Regarding parallel processing using Kruskal's algorithm, we need to keep and check edges in strict order, that's why it's hard to achieve explicit parallelism. It's rather sequential and we can't do much about this (even if we still may consider e.g. parallel sorting). Although there were few approaches to implement this method in parallel way, those papers can be found easily to check their results.

Your description is accurate, but one detail can be clarified: Boruvka's algorithm's perspective is each connected component rather than each individual node.
Your intuition about parallelization is also right -- this paper has more details. Excerpt from the abstract:
In this paper we design and implement four parallel MST algorithms (three variations of Boruvka plus our new approach) for arbitrary sparse graphs that for the first time give speedup when compared with the best sequential algorithm.

The important difference between Boruvka's algorithm and Kruskal's or Prim's is that with Boruvka's you don't need to presort the edges or maintain a priority queue.
Boruvka's still incurs the extra log N factor in the cost, but it does it by requiring O(log N) passes over the edges.
You can parallelize Boruvka's algorithm, but you can also parallelize sorting, so I don't know if Boruvka's has any real advantages over Kruskal's in practice.

Related

Create N-Clusters out of Min spanning tree?

Let say I created a Minimum Spanning Tree out of Graph with M nodes. Is there an algorithm to create N number of clusters.
I'm looking to cut some of the links such as that I end up with N clusters and label them i.e. given a node X I can query in which cluster it belongs.
What I think is once I have the MST, I cut the top/max M-N edges of the MST and I will get N clusters ?
Is my logic correct ?
That seems a good way to me. You ask whether it's "correct" -- that I can't say, since I don't know what other unstated criteria you have in mind. All you have actually stated that you want is to create N clusters -- which you could also achieve by throwing away the MST, putting vertex 1 in the first cluster, vertex 2 in the second, ..., vertex N-1 in the (N-1)th, and all remaining vertices in the Nth.
If you're using Kruskal's algorithm to build the MST, you can achieve what you're suggesting by simply stopping the algorithm early, as soon as only N components remain.
A tree is a (very sparse) subset of edges of a graph, if you cut based on them you are not taking into consideration a (possible) vast majority of edges in your graph.
Based on the fact that you want to use a M(inimum)ST algorithm to create clusters, it would seem you want to minimize the set of edges that lie in the n-way cut induced by your clustering. Using an MST as a proxy with a graph with very similar weight edges will produce likely terrible results.
Graph clustering is a heavily studied topic, have you considered using an existing library to accomplish this? If you insist on implementing your own algorithm, I would recommend spectral clustering as a starting point as it will produce decent results without much effort.
Edit based on feedback in coments:
If your main bottleneck is the similarity matrix then the following should be considered:
Investigate sparse matrix/graph representation while implementing something like spectral clustering which is probably going to give much more robust results than single-linkage clustering
Investigate pruning edges from the similarity matrix which you think are unimportant. If pruning is combined with a sparse representation of the similarity matrix, this should yield comparable performance to the MST approach while giving a smooth continuum to tune performance vs quality.

Topological sorting algorithms

I am building a dependency based scheduler where my tasks/nodes form a directed acyclic graph. I have the following constraints and am trying to decide on the most appropriate algorithm:
New tasks can be added (with or without dependencies) at any time
Some tasks can run in parallel
Two algorithms are mentioned repeatedly with regards to topological sorting; depth first search and Kahn's algorithm.
What are the pro's and con's of these two algorithms?
Is one algorithm objectively better for my scenario?
Is there an alternate that better fits my scenario?
I have one further question about vocabulary. Given dependencies such as:
c->b
b->a
e->d
Is this considered to be a single directed acyclic graph, 2 acyclic graphs (since e and d are not dependent on the other tasks) or an acyclic graph with sub acyclic graphs?
I think you misunderstood these algorithms.
Depth First Search:
So the Depth First Search algorithm (I looked it up on wikipedia. There it is actually called an algorithm. I would rather call it a strategy) is an algorithm for traversing through a graph. So to visit all nodes. It will not return you the topological order of your graph!!
Kahn's algorithm: Kahn's algorithm on the other hand, is the right algorithm for your problem. It will return you the topoligical order if there is one. The algorithm has an asymptotic running time of O(m+n) where m is the amount of edges and n is the amount of vertices in your graph. I do not think, that you will find a better algorithm for that problem.
Vocabulary: In your example, you have one DAG (directed acyclic graph) with two weakly connected components.
EDIT:
After your mention of an algorithm based on Depth First Search:
For the asymptotic running time, it does not matter which algorithm you are using. Both are in O(m+n). Also concerning the storage, it actually should not matter.

Topological sort of cyclic graph with minimum number of violated edges

I am looking for a way to perform a topological sorting on a given directed unweighted graph, that contains cycles. The result should not only contain the ordering of vertices, but also the set of edges, that are violated by the given ordering. This set of edges shall be minimal.
As my input graph is potentially large, I cannot use an exponential time algorithm. If it's impossible to compute an optimal solution in polynomial time, what heuristic would be reasonable for the given problem?
Eades, Lin, and Smyth proposed A fast and effective heuristic for the feedback arc set problem. The original article is behind a paywall, but a free copy is available from here.
There’s an algorithm for topological sorting that builds the vertex order by selecting a vertex with no incoming arcs, recursing on the graph minus the vertex, and prepending that vertex to the order. (I’m describing the algorithm recursively, but you don’t have to implement it that way.) The Eades–Lin–Smyth algorithm looks also for vertices with no outgoing arcs and appends them. Of course, it can happen that all vertices have incoming and outgoing arcs. In this case, select the vertex with the highest differential between incoming and outgoing. There is undoubtedly room for experimentation here.
The algorithms with provable worst-case behavior are based on linear programming and graph cuts. These are neat, but the guarantees are less than ideal (log^2 n or log n log log n times as many arcs as needed), and I suspect that efficient implementations would be quite a project.
Inspired by Arnauds answer and other interesting topological sorting algorithms have I created the cyclic-toposort project and published it on github. cyclic_toposort does exactly what you desire in that it quickly sorts a directed cyclic graph providing the minimal amount of violating edges. It optionally also provides the maximum groupings of nodes that are on the same topological level (and can therefore be activated at the same time) if desired.
If the problem is still relevant to you then I would be happy if you check out my project and let me know what you think!
This project was useful to my own neural network topology research, so I had to create something like this anyway. I am answering your question this late in case anyone else stumbles upon this thread in search for the same question.

Map Reduce algorithm for removing cycles from a graph

This question has a great answer for detecting cycles in a directed graph. Unfortunately, it does not seem easy to make a Map Reduce version of it.
Specifically, I am interested in a Map Reduce algorithm for removing cycles from a directed graph.
I have evaluated using a breadth first search (BFS) algorithm but an issue I see is that two different edges may be removed simultaneously to cut off a cycle. The impact of this scenario is that too many edges could be removed. It is important that cycles are removed while minimizing the number of edges removed.
Solutions with proofs available are preferred!
Thanks.
You need an iterative map reduce to implement this algorithm. See http://www.iterativemapreduce.org/ for a map-reduce framework that centers around iterative map reduces. Or http://www.johnandcailin.com/blog/cailin/breadth-first-graph-search-using-iterative-map-reduce-algorithm for a worked example of how to do a breadth-first search through a graph with Hadoop using an iterative map reduce.
Well if you want to remove all cycles, then you will end up with a tree. So no matter what algorithm you use, you will remove |E| - (n -1) edges. (if it was correct of course)
However, the question is whether the deletion of edges will lead to a disconnected graph. For this you will need to make an ordering of the edges (let's say lexicographic order). You should then always remove the the largest edge in a cycle. [I guess the proof of correctness is very direct whence: simply use Kruskal algorithm and find that they will be the same ! ]
Any spanning tree algorithm would solve the problem for you. Depending on what you want to optimize (either time or messsage complexity or any other perfomance metric), you will find different algorithms. BFS is the best for time. No algorithm can solve the problem for less than c(logn + m) message for c > 0.
There is an algoritm I like using for DAG's is called YO-YO. The description of the algorithm can be found in : http://www.site.uottawa.ca/~flocchin/CSI4509/8-yoyo11_fr.pdf

Is this minimum spanning tree algorithm correct?

The minimum spanning tree problem is to take a connected weighted graph and find the subset of its edges with the lowest total weight while keeping the graph connected (and as a consequence resulting in an acyclic graph).
The algorithm I am considering is:
Find all cycles.
remove the largest edge from each cycle.
The impetus for this version is an environment that is restricted to "rule satisfaction" without any iterative constructs. It might also be applicable to insanely parallel hardware (i.e. a system where you expect to have several times more degrees of parallelism then cycles).
Edits:
The above is done in a stateless manner (all edges that are not the largest edge in any cycle are selected/kept/ignored, all others are removed).
What happens if two cycles overlap? Which one has its longest edge removed first? Does it matter if the longest edge of each is shared between the two cycles or not?
For example:
V = { a, b, c, d }
E = { (a,b,1), (b,c,2), (c,a,4), (b,d,9), (d,a,3) }
There's an a -> b -> c -> a cycle, and an a -> b -> d -> a
#shrughes.blogspot.com:
I don't know about removing all but two - I've been sketching out various runs of the algorithm and assuming that parallel runs may remove an edge more than once I can't find a situation where I'm left without a spanning tree. Whether or not it's minimal I don't know.
For this to work, you'd have to detail how you would want to find all cycles, apparently without any iterative constructs, because that is a non-trivial task. I'm not sure that's possible. If you really want to find a MST algorithm that doesn't use iterative constructs, take a look at Prim's or Kruskal's algorithm and see if you could modify those to suit your needs.
Also, is recursion barred in this theoretical architecture? If so, it might actually be impossible to find a MST on a graph, because you'd have no means whatsoever of inspecting every vertex/edge on the graph.
I dunno if it works, but no matter what your algorithm is not even worth implementing. Finding all cycles will be the freaking huge bottleneck that will kill it. Also doing that without iterations is impossible. Why don't you implement some standard algorithm, let's say Prim's.
Your algorithm isn't quite clearly defined. If you have a complete graph, your algorithm would seem to entail, in the first step, removing all but the two minimum elements. Also, listing all the cycles in a graph can take exponential time.
Elaboration:
In a graph with n nodes and an edge between every pair of nodes, there are, if I have my math right, n!/(2k(n-k)!) cycles of size k, if you're counting a cycle as some subgraph of k nodes and k edges with each node having degree 2.
#Tynan The system can be described (somewhat over simplified) as a systems of rules describing categorizations. "Things are in category A if they are in B but not in C", "Nodes connected to nodes in Z are also in Z", "Every category in M is connected to a node N and has 'child' categories, also in M for every node connected to N". It's slightly more complicated than this. (I have shown that by creating unstable rules you can model a turning machine but that's beside the point.) It can't explicitly define iteration or recursion but can operate on recursive data with rules like the 2nd and 3rd ones.
#Marcin, Assume that there are an unlimited number of processors. It is trivial to show that the program can be run in O(n^2) for n being the longest cycle. With better data structures, this can be reduced to O(n*O(set lookup function)), I can envision hardware (quantum computers?) that can evaluate all cycles in constant time. giving a O(1) solution to the MST problem.
The Reverse-delete algorithm seems to provide a partial proof of correctness (that the proposed algorithm will not produce a non-minimal spanning tree) this is derived by arguing that mt algorithm will remove every edge that the Reverse-delete algorithm will. However I'm not sure how to show that my algorithm won't delete more than that algorithm.
Hhmm....
OK this is an attempt to finish the proof of correctness. By analogy to the Reverse-delete algorithm, we know that enough edges will be removed. What remains is to show that there will not be to many edges removed.
Removing to many edges can be described as removing all the edges between the side of a binary partition of the graph nodes. However only edges in a cycle are ever removed, therefor, for all edge between partitions to be removed, there needs to be a return path to complete the cycle. If we only consider edges between the partitions then the algorithm can at most remove the larger of each pair of edges, this can never remove the smallest bridging edge. Therefor for any arbitrary binary partitioning, the algorithm can't sever all links between the side.
What remains is to show that this extends to >2 way partitions.

Resources