Finding Strongly Connected Components in a graph through DFS

Finding Strongly Connected Components in a graph through DFS - algorithm

I was reading the graph algorithms about BFS and DFS. When I was analyzing the algorithm for finding strongly connected component in a Graph through DFS, a doubt came to my mind. For finding the strongly connected component, what book(Coremen)does, first it ran the DFS on the Graph in order to get the finish time of the vertices then again ran the DFS on the transpose of the graph in decreasing order of the finish time which we got from the first DFS. But I am not able to grasp why the second DFS must be run according to finish time.
What I mean is that even if we directly run the DFS (ignoring the finish time) on the transpose of the graph, could it also have given us the connected components because by doing the transpose we have already blocked the path to other components.

Edit- Here's some good in-depth videos from stanford university on the topic:
http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=IntroToAlgorithms (See 6. CONNECTIVITY IN DIRECTED GRAPHS)
My explanation:
It's possible that you would incorrectly identify the entire graph as a single strongly connected component(SCC) if you don't run the second dfs according to decreasing finish times of the first dfs.
Notice that in my example, node d would always have the lowest finish time from the first dfs. One of nodes a, b, or c will have the highest finish times. Lets assume a has the highest finish time, and so if we ran the second dfs according to decreasing finish times, a would be first.
Now, if you ran the second dfs starting with node d in the transpose of G, you would produce a depth first forest containing the entire graph, and so conclude that the entire graph is a SCC, which is clearly false. However, if you start the dfs with a, then you would not only discover a, b, and c, as being an SCC, but the important part is that they would be marked as visited by being colored grey or black. Then when you continue the dfs on d, you wouldn't traverse out of its SCC because you would realize that its adjacent nodes have been visited.
If you look at cormens code for DFS,
DFS(G)
1 for each vertex u in G.V
2 u.color = WHITE
3 u.π = NIL
4 time = 0
5 for each vertex u in G.V
6 if u.color == WHITE
7 DFS-VISIT(G, u)
DFS-VISIT(G, u)
1 time = time + 1 // white vertex u has just been discovered
2 u.d = time
3 u.color = GRAY
4 for each v in G.adj[u]
5 if v.color == WHITE
6 v.π = u
7 DFS-VISIT(G, u)
8 u.color = BLACK // blacken u; it is finished
9 time = time + 1
10 u.f = time
if you didn't use decreasing finish time, then line 6 of DFS would only be true once, because DFS-VISIT would visit the entire graph recursively. This produces a single tree in the depth first forest, and each tree is an SCC. The reasoning for a single tree is because a tree is identified by its root node having a nil predecessor.

Related

Polynomial Time Algorithm for directed graph with path from 's' to 't'

My textbook for theory of computation has an example for explaining polynomial time algorithms:
PATH = {[G,s,t]|G is a directed graph that has a directed path from s to t}.
A polynomial time algorithm M for PATH operates as follows. M = “On input [G,s,t], where G is a directed graph with nodes s and t:
Place a mark on node s.
Repeat the following until no additional nodes are marked:
Scan all the edges of G. If an edge (a,b) is found going from a marked node a to an unmarked node b, mark node b.
If t is marked, accept. Otherwise, reject.”
Then they go on to explain how the algorithm runs in polynomial time:
Obviously, stages 1 and 4 are executed only once. Stage 3 runs at most m times because each time except the last it marks an additional node in G. Thus, the total number of stages used is at most 1+ 1+m, giving a polynomial in the size of G.
*m is the number of nodes in the graph
My question is that wouldn't Stage 3 run at most m-1 times instead of m times, since the first node is marked in stage 1?
Thanks!

It runs up to m-1 times where it marks an additional node other than s and then 1 time where it finds no additional node to mark.

Detecting connectedness of nodes over a long time in a graph

I start out with a graph of N nodes with no edges.
Then I procede to take M predetermined steps.
At each step, I must either create an edge between two nodes, or delete an edge.
After each step, I must print out how many connected components there are in my graph.
Is there an algorithm for solving this problem in time linear with respect to M? If not, is there one better than O(min(M,N) * M) in the worst case?
EDIT:
The program does not get to decide what the M steps are.
I have to read from the input, whether I am supposed to create an edge or delete it, and also which edge I am supposed to create/delete.
So example input might be
N = 4
M = 4
JOIN 1 2
JOIN 2 3
DELETE 2 3
DELETE 1 2
Then my output should be
3 # (1 2) 3 4
2 # (1 2 3) 4
3 # (1 2) 3 4
4 # 1 2 3 4

There are ways to solve this problem fully online, but they're more complicated than this answer. The algorithm that I'm proposing is to maintain a spanning forest of the available edges, together with the number of components of the spanning forest (and hence the graph). If we were attacking this problem fully online, then this would be problematic, since a spanning forest edge might get deleted, leaving us to paw through the unused edges for a replacement. We know, however, how soon each edge currently in the graph will be deleted.
The particular spanning forest that we maintain is a maximum-weight spanning forest, where the weight of each edge is its deletion time. If an edge belonging to this spanning forest is deleted, then there is no replacement, since every other edge connecting the components represented by its endpoints either hasn't been inserted yet or, having lesser weight, has already been deleted.
There's a dynamic tree data structure, also referred to as a link/cut tree, due to Sleator and Tarjan, that can be made to provide the following operations in logarithmic time.
Link(u, v, w) - inserts an edge between u and v with weight w;
u and v must not be connected already
Cut(u, v) - cuts the edge between u and v, if it exists;
returns a boolean indicating whether an edge was removed
FindMin(u, v) - finds the minimum-weight edge on the path from u to v
and returns its endpoints and weight;
returns null if either u = v or u and v are not connected
To maintain the forest, when an edge from u to v is inserted, compare its removal time to the minimum on the path from u to v. If the minimum does not exist, then insert the edge. If the minimum is less than the new edge, delete the minimum and replace it with the new edge. Otherwise, do nothing. When an edge from u to v is deleted, attempt to delete it from the forest.
The running time of this approach is O(m log n). If you don't have a dynamic tree handy, then it admittedly will take quite a while to implement. Instead of using a proper dynamic tree, I've had success with a much simpler data structure that just stores the forest as a bunch of nodes with weights and parent pointers. The running time then is O(m d) where d is the maximum diameter of the graph, and if you're lucky, d is quite a lot less than n.

Why do we need to run DFS on the complement of a graph in the Kosaraju's algorithm?

There's a famous algorithm to find the strongly connected components called Kosaraju's algorithm, which uses two DFS's to solve this problem, and runs in θ(|V| + |E|) time.
First we use DFS on complement of the graph (GR) to compute reverse postorder of vertices, and then we apply second DFS on the main graph G by taking vertices in reverse post order to compute the strongly connected components.
Although I understand the mechanics of the algorithm, I'm not getting the intuition behind the need of the reverse post order.
How does it helps the second DFS to find the strongly connected components?

suppose result of the first DFS is:
----------v1--------------v2-----------
where "-" indicates any number and all the vertices in a strongly connected component g appear between v1 and v2.
DFS by post order gives the following guarantee that
all vertices after v2 would not points to g in the reverse graph(that is to say, you cannot reach these vertices from g in the origin graph)
all vertices before v1 cannot be pointed to from g in the reverse graph(that is to say, you cannot reach g from these vertices in the origin graph)
in one word, the first DFS ensures that in the second DFS, strongly connected components that are visited earlier cannot have any edge points to other unvisited strongly connected components.
Some Detailed Explanation
let's simplify the graph as follow:
the whole graph is G
G contains two strongly connected components, one is g, the other one is a single vertex v
there is only one edge between v and g, either from v to g or from g to v, the name of this edge is e
g', e' represent the reverse of g, e
the situation in which this algorithm could fail can be conclude as
start DFS from v, and e' points from v to g'
start DFS from a vertex inside of g', and e' points from g' to v
For situation 1
origin graph would be like g-->v, and the reversed graph looks like g'<--v.
To start the second DFS from v, the post order generated by first DFS need to be something like
g1, g2, g3, ..., v
but you would easily find out that neither starting the first DFS from v nor from g' can give you such a post order, so in this situation, it is guaranteed be the first DFS that the second DFS would not start from a vertex that both be out of and points to a strongly connected component.
For situation 2
similar to the situation 1, in situation 2, where the origin graph is g<--v and the reversed on is g'-->v, it is guaranteed that v would be visited before any vertex in g'.

When you run DFS on a graph for the first time, for every node you visit you get the knowledge about all nodes that are reachable from that node (you get this information after the first DFS is finished).
Then, when you inverse all the vertices and run the DFS once more, for every node you visit you get the knowledge about all nodes that can reach that node in the non-inverted graph (again, you get this info after the second DFS finished).
Example: let's say your first DFS reaches node X. From that node "you can see" all the neighbours you can visit. (I hope this is pretty understandable). Then, let's say your second DFS reaches that node X, but this time all the vertices are inverted. If then from your node X "you can see" any other nodes, it means that before inverting the vertices the node X was reachable from all the neighbours you see now. By calling the second DFS in the correct order you get for every node X all the nodes that where reachable from X in both DFS trees (and so, for every node X you get the nodes that were both reachable from X and could reach X - those are strongly connected components by definition).

Suppose the list L is the post-order DFS visit of nodes. u->v indicates that there exists a forwarding path from u to v.
If u->v and not v->u, then u must appear at the left of v in L. The nodes in a SCC, such as v and w, however, may appear in any arbitrary order on the list L.
So, if a node x appear strictly before y on the list L:
case1: x->y and y->x, like the case of v and w
case2: x->y and not y->x, like the case of u and v
case3: not x->y and not y->x
The Kosaraju's algorithm iterates through L from left to right and run DFS starting from each node on the transpose graph (where the direction of edges are reversed). If some node is reachable by DFS and it does not belong to any SCC, then we add this node to the SCC of current root.
In case 1, we will add y to the SCC of x. In case 3, y and x are in different SCCs.
Case 2 requires some special attention. At the time we call DFS from y, x is already in some other SCC, so we will not add x to the SCC of y. Imagine if you called the DFS starting from root y before the DFS starting from root x, then x would be added to the SCC of y, which is wrong.
In short, the first DFS arranges those nodes which can reach y but can not be reached from y on its left. So the second DFS is able to avoid adding such nodes x to the SCC of y.

Using BFS or DFS to determine the connectivity in a non connected graph?

How can i design an algorithm using BFS or DFS algorithms in order to determine the connected components of a non connected graph, the algorithm must be able to denote the set of vertices of each connected component.
This is my aproach:
1) Initialize all vertices as not visited.
2) Do a DFS traversal of graph starting from any arbitrary vertex v.
If DFS traversal doesn’t visit all vertices, then return false.
3) Reverse all arcs (or find transpose or reverse of graph)
4) Mark all vertices as not-visited in reversed graph.
5) Do a DFS traversal of reversed graph starting from same vertex v
(Same as step 2). If DFS traversal doesn’t visit all vertices, then
return false. Otherwise return true.
The idea is, if every node can be reached from a vertex v, and every
node can reach v, then the graph is strongly connected. In step 2, we
check if all vertices are reachable from v. In step 4, we check if all
vertices can reach v (In reversed graph, if all vertices are reachable
from v, then all vertices can reach v in original graph).
Any idea of how to improve this solution?.

How about
let vertices = input
let results = empty list
while there are vertices in vertices:
create a set S
choose an arbitrary unexplored vertex, and put it in S.
run BFS/DFS from that vertex, and with each vertex found, remove it from vertices and add it to S.
add S to results
return results
When this completes, you'll have a list of sets of vertices, where each set was made from graph searching from some vertex (making the vertices in each set connected). Assuming an undirected graph, this should work OK (off the top of my head).

This can be done easily using either BFS or DFS in time complexity of O(V+E).
// this is the DFS solution
numCC = 0;
dfs_num.assign(V, UNVISITED); // sets all vertices’ state to UNVISITED
for (int i = 0; i < V; i++) // for each vertex i in [0..V-1]
if (dfs_num[i] == UNVISITED) // if vertex i is not visited yet
printf("CC %d:", ++numCC), dfs(i), printf("\n");
The output of above code for 3 connected components would be something like :
// CC 1: 0 1 2 3 4
// CC 2: 5
// CC 3: 6 7 8

A standard approach for solving this problem is to run DFS starting from each node.
Start by labeling all nodes as unvisited. Then, iterate over the nodes in any order. For each node, if it's not already labeled as being in a connected component, run DFS from that node and mark all reachable nodes as being in the same CC. If the node was already marked, skip it. This then discovers all CC's of the graph one CC at a time.
Moreover, this is very efficient. If there are m edges and n nodes, the runtime is O(n) for the first step (marking all nodes as unvisited) and O(m + n) for the second, since each node and edge are visited at most twice. Thus the overall runtime is O(m + n).
Hope this helps!

Since you seem to be working with a directed graph, and you want to find the connected components (not strongly connected), you have to convert your graph to an undirected graph first. So for each vertex, add a temporary vertex in the opposite direction. Then you can use a simple DFS starting from each vertex which hasn't been visited yet to find the connected components. Finally, you can remove the temporary vertices.

Algorithm to check if directed graph is strongly connected

I need to check if a directed graph is strongly connected, or, in other words, if all nodes can be reached by any other node (not necessarily through direct edge).
One way of doing this is running a DFS and BFS on every node and see all others are still reachable.
Is there a better approach to do that?

Consider the following algorithm.
Start at a random vertex v of the graph G, and run a DFS(G, v).
If DFS(G, v) fails to reach every other vertex in the graph G, then there is some vertex u, such that there is no directed path from v to u, and thus G is not strongly connected.
If it does reach every vertex, then there is a directed path from v to every other vertex in the graph G.
Reverse the direction of all edges in the directed graph G.
Again run a DFS starting at v.
If the DFS fails to reach every vertex, then there is some vertex u, such that in the original graph there is no directed path from u to v.
On the other hand, if it does reach every vertex, then in the original graph there is a directed path from every vertex u to v.
Thus, if G "passes" both DFSs, it is strongly connected. Furthermore, since a DFS runs in O(n + m) time, this algorithm runs in O(2(n + m)) = O(n + m) time, since it requires 2 DFS traversals.

Tarjan's strongly connected components algorithm (or Gabow's variation) will of course suffice; if there's only one strongly connected component, then the graph is strongly connected.
Both are linear time.
As with a normal depth first search, you track the status of each node: new, seen but still open (it's in the call stack), and seen and finished. In addition, you store the depth when you first reached a node, and the lowest such depth that is reachable from the node (you know this after you finish a node). A node is the root of a strongly connected component if the lowest reachable depth is equal to its own depth. This works even if the depth by which you reach a node from the root isn't the minimum possible.
To check just for whether the whole graph is a single SCC, initiate the dfs from any single node, and when you've finished, if the lowest reachable depth is 0, and every node was visited, then the whole graph is strongly connected.

To check if every node has both paths to and from every other node in a given graph:
1. DFS/BFS from all nodes:
Tarjan's algorithm supposes every node has a depth d[i]. Initially, the root has the smallest depth. And we do the post-order DFS updates d[i] = min(d[j]) for any neighbor j of i. Actually BFS also works fine with the reduction rule d[i] = min(d[j]) here.
function dfs(i)
d[i] = i
mark i as visited
for each neighbor j of i:
if j is not visited then dfs(j)
d[i] = min(d[i], d[j])
If there is a forwarding path from u to v, then d[u] <= d[v]. In the SCC, d[v] <= d[u] <= d[v], thus, all the nodes in SCC will have the same depth. To tell if a graph is a SCC, we check whether all nodes have the same d[i].
2. Two DFS/BFS from the single node:
It is a simplified version of the Kosaraju’s algorithm. Starting from the root, we check if every node can be reached by DFS/BFS. Then, reverse the direction of every edge. We check if every node can be reached from the same root again. See C++ code.

You can calculate the All-Pairs Shortest Path and see if any is infinite.

Tarjan's Algorithm has been already mentioned. But I usually find Kosaraju's Algorithm easier to follow even though it needs two traversals of the graph. IIRC, it is also pretty well explained in CLRS.

test-connected(G)
{
choose a vertex x
make a list L of vertices reachable from x,
and another list K of vertices to be explored.
initially, L = K = x.
while K is nonempty
find and remove some vertex y in K
for each edge (y, z)
if (z is not in L)
add z to both L and K
if L has fewer than n items
return disconnected
else return connected
}

You can use Kosaraju’s DFS based simple algorithm that does two DFS traversals of graph:
The idea is, if every node can be reached from a vertex v, and every node can reach v, then the graph is strongly connected.
In step 2 of the algorithm, we check if all vertices are reachable from v. In step 4, we check if all vertices can reach v (In reversed graph, if all vertices are reachable from v, then all vertices can reach v in original graph).
Algorithm :
1) Initialize all vertices as not visited.
2) Do a DFS traversal of graph starting from any arbitrary vertex v. If DFS traversal doesn’t visit all vertices, then return false.
3) Reverse all arcs (or find transpose or reverse of graph)
4) Mark all vertices as not-visited in reversed graph.
5) Do a DFS traversal of reversed graph starting from same vertex v (Same as step 2). If DFS traversal doesn’t visit all vertices, then return false. Otherwise return true.
Time Complexity: Time complexity of above implementation is same as Depth First Search which is O(V+E) if the graph is represented using adjacency list representation.

One way of doing this would be to generate the Laplacian matrix for the graph, then calculate the eigenvalues, and finally count the number of zeros. The graph is strongly connection if there exists only one zero eigenvalue.
Note: Pay attention to the slightly different method for creating the Laplacian matrix for directed graphs.

The algorithm to check if a graph is strongly connected is quite straightforward. But why does the below algorithm work?
Algorithm: suppose there is a graph with vertices [A, B, C......Z]
Choose any random node, say J, and perform DFS from it. If all the nodes are reachable then continue to step 2.
Reverse the directions of the edges of the graph by doing transpose.
Again run DFS from node J and check if all the nodes are visited. If yes then the graph is strongly connected and return true.
Performing step 1 makes sense because we have to check if we can reach all the nodes from that node. After this, next logical step could be
i) Now do this for all other nodes
ii) or try to reach node J from every other node. Because once you reach node J, you are sure that you can reach every other node because of step 1.
This is what we are trying to do in steps 2 & 3. If in a transposed graph node J is able to reach all other nodes then this implies that in original graph all other nodes can reach J.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio