Visited array mean by the array where we keep records of whether a node is visited or not.
Tree traversal is easy, provided the tree is well-formed. You do not have to keep track of whether you're potentially about to repeat some work in a possibly endless loop because there's exactly one path to reach each node.
But for generalised graph traversal where the graph may contain loops, many trivial algorithms will end up looping endlessly when they encounter a loop. To guarantee that such algorithms do terminate, we usually want to keep track of which nodes we've visited during the entire traversal and not re-explore a node we've already visited.
That's the purpose of the visited collection (whether that be an array, a set, etc is irrelevant). Alternatives may be for each node to have a visited flag which can be unset before traversal and set during traversal. That avoids the need for a global collection but imposes its own limitations (only a single traversal can occur at a time).
The visited collection doesn't need to be "global"1 but does need to be in a common scope shared by all parts of each traversal.
1If it is "global" then, in whatever scope that has meaning, again only a single traversal at a time is possible.
Why the visited array is initialized globally?
Since the array is used for keeping track of the entire graph, it is better to have a global/class level initialization.
Otherwise, in a method level initialization you would need to pass the tracking information (aka the visited[] array) by reference or make a new copy of it for every call to explore a node.
Further, if:
You were tracking something local to the current node; OR
The algorithm's implementation was not recursive;
you could do away with a local initialization too.
I have a graph that is implemented as a list of edges connecting arbitrary nodes, with the data types defined below.
type edge = int * int;;
type graph = edge list;;
How would I perform a purely functional depth-first search while avoiding getting stuck on a cycle? I am not quite sure of how to keep track of all the nodes visited while remaining purely functional. The answer is probably something trivial that I am not conceptually grasping for some reason.
The search function has a parameter that tracks visited nodes. In FP one of the insights is that you can keep calling deeper and deeper (with tail calls). So you can pass the parameter along through all the calls, adding new nodes as you go.
Another parameter could be the nodes you plan to visit later. For DFS this would work like a stack.
I implemented first phase of the highest label push relabel algorithm for maximum flow but I could not find any resources about how to implement the second phase, i.e. to convert the preflow push network to a valid flow network.
If you really need the max flow (it's possible to derive a min cut directly from the preflow and use it to verify the preflow), then I know of two approaches.
The first approach is covered in the original Goldberg--Tarjan paper on the push relabel algorithm. In essence, the second phase is implemented almost exactly as the first. The only difference is that the source is held at distance n (instead of the sink, at distance 0). This has the effect of routing the excesses back to the source.
I'm not sure where the second approach is described. I know that it's in Goldberg's implementation, on which the Boost Graph implementation is based (see convert_preflow_to_flow). Conceptually, there are three steps.
Until the preflow is acyclic, cancel a flow cycle by sending enough flow on the reverse cycle to remove one of the arcs from the flow graph.
Topologically sort the nodes of the flow graph from sinkmost to sourcemost.
For each node in topological order, eliminate its excess by decreasing the flow on incoming arcs (which effects a corresponding increase in the excess of nodes yet to be processed).
Practically speaking, Steps 1 and 2 both involve depth-first searches. Naively, one would restart the depth-first search for cycle-detection after each cycle is detected and canceled, but it's possible to rewind the depth-first search just to the point where it first used an arc that canceling removed, saving the time to get to that point in the search again. The topological order can be obtained as a byproduct of the search, saving a separate traversal for Step 2.
Is there an algorithm that can check, in a directed graph, if a vertex, let's say V2, is reachable from a vertex V1, without traversing all the vertices?
You might find a route to that node without traversing all the edges, and if so you can give a yes answer as soon as you do. Nothing short of traversing all the edges can confirm that the node isn't reachable (unless there's some other constraint you haven't stated that could be used to eliminate the possibility earlier).
Edit: I should add that it depends on how often you need to do queries versus how large (and dense) your graph is. If you need to do a huge number of queries on a relatively small graph, it may make sense to pre-process the data in the graph to produce a matrix with a bit at the intersection of any V1 and V2 to indicate whether there's a connection from V1 to V2. This doesn't avoid traversing the graph, but it can avoid traversing the graph at the time of the query. I.e., it's basically a greedy algorithm that assumes you're going to eventually use enough of the combinations that it's easiest to just traverse them all and store the result. Depending on the size of the graph, the pre-processing step may be slow, but once it's done executing a query becomes quite fast (constant time, and usually a pretty small constant at that).
Depth first search or breadth first search. Stop when you find one. But there's no way to tell there's none without going through every one, no. You can improve the performance sometimes with some heuristics, like if you have additional information about the graph. For example, if the graph represents a coordinate space like a real map, and most of the time you know that there's going to be a mostly direct path, then you can attempt to have the depth-first search look along lines that "aim towards the target". However, imagine the case where the start and end points are right next to each other, but with no vector inbetween, and to find it, you have to go way out of the way. You have to check every case in order to be exhaustive.
I doubt it has a name, but a breadth-first search might go like this:
Add V1 to a queue of nodes to be visited
While there are nodes in the queue:
If the node is V2, return true
Mark the node as visited
For every node at the end of an outgoing edge which is not yet visited:
Add this node to the queue
End for
End while
Return false
Create an adjacency matrix when the graph is created. At the same time you do this, create matrices consisting of the powers of the adjacency matrix up to the number of nodes in the graph. To find if there is a path from node u to node v, check the matrices (starting from M^1 and going to M^n) and examine the value at (u, v) in each matrix. If, for any of the matrices checked, that value is greater than zero, you can stop the check because there is indeed a connection. (This gives you even more information as well: the power tells you the number of steps between nodes, and the value tells you how many paths there are between nodes for that step number.)
(Note that if you know the number of steps in the longest path in your graph, for whatever reason, you only need to create a number of matrices up to that power. As well, if you want to save memory, you could just store the base adjacency matrix and create the others as you go along, but for large matrices that may take a fair amount of time if you aren't using an efficient method of doing the multiplications, whether from a library or written on your own.)
It would probably be easiest to just do a depth- or breadth-first search, though, as others have suggested, not only because they're comparatively easy to implement but also because you can generate the path between nodes as you go along. (Technically you'd be generating multiple paths and discarding loops/dead-end ones along the way, but whatever.)
In principle, you can't determine that a path exists without traversing some part of the graph, because the failure case (a path does not exist) cannot be determined without traversing the entire graph.
You MAY be able to improve your performance by searching backwards (search from destination to starting point), or by alternating between forward and backward search steps.
Any good AI textbook will talk at length about search techniques. Elaine Rich's book was good in this area. Amazon is your FRIEND.
You mentioned here that the graph represents a road network. If the graph is planar, you could use Thorup's Algorithm which creates an O(nlogn) space data structure that takes O(nlogn) time to build and answers queries in O(1) time.
Another approach to this problem would allow you to ignore all of the vertices. If you were to only look at the edges, you can produce a transitive closure array that will show you each vertex that is reachable from any other vertex.
Start with your list of edges:
Va -> Vc
Va -> Vd
....
Create an array with start location as the rows and end location as the columns. Fill the arrays with 0. For each edge in the list of edges, place a one in the start,end coordinate of the edge.
Now you iterate a few times until either V1,V2 is 1 or there are no changes.
For each row:
NextRowN = RowN
For each column that is true for RowN
Use boolean OR to OR in the results of that row of that number with the current NextRowN.
Set RowN to NextRowN
If you run this algorithm until the end, you will quickly have a complete list of all reachable vertices without looking at any of them. The runtime is proportional to the number of edges. This would work well with a reasonable implementation and a reasonable number of edges.
A slightly more complex version of this algorithm would be to only calculate the vertices reachable by V1. To do this, you would focus your scope on the ones that are currently reachable at any given time. You can also limit adding rows to only one time, since the other rows are never changing.
In order to be sure, you either have to find a path, or traverse all vertices that are reachable from V1 once.
I would recommend an implementation of depth first or breadth first search that stops when it encounters a vertex that it has already seen. The vertex will be processed on the first occurrence only. You need to make sure that the search starts at V1 and stops when it runs out of vertices or encounters V2.
I have a list of items (blue nodes below) which are categorized by the users of my application. The categories themselves can be grouped and categorized themselves.
The resulting structure can be represented as a Directed Acyclic Graph (DAG) where the items are sinks at the bottom of the graph's topology and the top categories are sources. Note that while some of the categories might be well defined, a lot is going to be user defined and might be very messy.
Example:
(source: theuprightape.net)
On that structure, I want to perform the following operations:
find all items (sinks) below a particular node (all items in Europe)
find all paths (if any) that pass through all of a set of n nodes (all items sent via SMTP from example.com)
find all nodes that lie below all of a set of nodes (intersection: goyish brown foods)
The first seems quite straightforward: start at the node, follow all possible paths to the bottom and collect the items there. However, is there a faster approach? Remembering the nodes I already passed through probably helps avoiding unnecessary repetition, but are there more optimizations?
How do I go about the second one? It seems that the first step would be to determine the height of each node in the set, as to determine at which one(s) to start and then find all paths below that which include the rest of the set. But is this the best (or even a good) approach?
The graph traversal algorithms listed at Wikipedia all seem to be concerned with either finding a particular node or the shortest or otherwise most effective route between two nodes. I think both is not what I want, or did I just fail to see how this applies to my problem? Where else should I read?
It seems to me that its essentially the same operation for all 3 questions. You're always asking "Find all X below node(s) Y, where X is of type Z". All you need is a generic mechanism for 'locate all nodes below node', (solves Q3) and then you can filter the results for 'nodetype=sink' (solves Q1). For Q2, you have the starting-point (your node set) and your ending point (any sink below the starting point) so your solution set is all paths from starting node specified to the sink. So I would suggest that what you basically have a is a tree, and basic tree-traversal algorithms would be the way to go.
Despite the fact that your graph is acyclic, the operations you cite remind me of similar aspects of control flow graph analysis. There is a rich set of algorithms based on dominance that may be applicable. For example, your third operation reminds me od computing dominance frontiers; I believe that algorithm would work directly if you temporarily introduce "entry" and "exit" nodes. The entry node connects the "given set of nodes" and the exit nodes connects the sinks.
Also see Robert Tarjan's basic algorithms.