Here is the problem:
assuming two persons are registered in a social networking website, how to decide whether they are connected or not?
my analysis (after reading more): actually, the question is looking for - the shortest path from A to B in a graph. I think both BFS and Dijkstra's Algorithms works here and time complexity is exactly the same (O(V+E)) because it is an unweighted graph, so we can't take advantage of the priority queue. So, a simple queue could resolve the problem. But, both of them doesnt resolve the problem that: find the path between them.
Bidrectrol should be a better solution at this point.
To find a path between the two, you should begin with a breadth first search. First find all neighbors of A, then find all neighbors of all neighbors of A, etc. Once B is hit, not only do you have a path from A to B, but you also have a shortest such path.
Dijkstra's algorithm rocks, and you may be able to speed this up by working from both end, i.e. find neighbors of A and neighbors of B, and compare.
If you do a depth first search, then you're following one path at a time. This will be much much slower.
If you do dfs for finding whether two people are connected on a social network, then it will take too long!
You already know the two persons, so you should use Bidirectional Search.. But, simple bidirectional search won't be enough for a graph as big as a social networking site. You will have to use some heuristics. Wikipedia page has some links to it.
You may also be able to use A* search. From wikipedia : "A* uses a best-first search and finds the least-cost path from a given initial node to one goal node (out of one or more possible goals)."
Edit: I suggest A* because "The additional complexity of performing a bidirectional search means that the A* search algorithm is often a better choice if we have a reasonable heuristic." So, if you can't form a reasonable heuristic, then use Bidirectional search. (Forming a good heuristic is never easy ;).)
One way is to use Union Find, add all links union(from,to), and if find(A) is find(B) is True then A and B are connected. This avoids the recursive search but it actually computes the connectivity of all pairs and doesn't give you the paths that connects A and B.
I think that the true criteria is: there are at least N paths between A and B shorter then K, or A and B are connected diectly. I would go with K = 3 and N near 5, i.e. have 5 common friends.
Note: answer edited.
Any method might end up being very slow. If you need to do this repeatedly, it's best to find the connected components of the graph, after which the task becomes a trivial O(1) operation: if two people are in the same component, they are connected.
Note that finding connected components for the first time might be slow, but keeping them updated as new edges/nodes are added to the graph is fast.
There are several methods for finding connected components.
One method is to construct the Laplacian of the graph, and look at its eigenvalues / eigenvectors. The number of zero eigenvalues gives you the number of connected components. The non-zero elements of the corresponding eigenvectors gives the nodes belonging to the respective components.
Another way is along the following lines:
Create a transformation table of nodes. Element n of the array contains the index of the node that node n transforms to.
Loop through all edges (i,j) in the graph (denoting a connection between i and j):
Compute recursively which node do i and j transform to based on the current table. Let us denote the results by k and l. Update entry k to make it transform to l. Update entries i and j to point to l as well.
Loop through the table again, and update each entry to point directly to the node it recursively transforms to.
Now nodes in the same connected component will have the same entry in the transformation table. So to check if two nodes are connected, just check if they transform to the same value.
Every time a new node or edge is added to the graph, the transformation table needs to be updated, but this update will be much faster than the original calculation of the table.
Related
I want to find the connected components in an undirected graph. However, I don't have an adjacency matrix. Instead I have a set of vertices as well as a function telling me whether two vertices are adjacent. What is the most efficient way to find all connected components?
I know I could just calculate the entire adjacency matrix and use depth-first search to find all components. But that would not be very efficient as I'd need to check every pair of vertices.
What I'm currently doing is the following procedure:
Pick any unassigned vertex which is now its own component
Find all neighbors of that vertex and add them to the component
Find all neighbors of the just added vertices (amongst those vertices not yet assigned to any component) and add them too
Repeat previous step until no new neighbors can be found
The component is now complete, repeat from the first step to find other components until all vertices are assigned
This is the pseudocode:
connected_components(vertices):
// vertices belonging to the current component and whose neighbors have not yet been added
vertices_to_check= [vertices.pop()]
// vertices belonging to the current component
current_component = []
components = []
while (vertices.notEmpty() or vertices_to_check.notEmpty()):
if (vertices_to_check.isEmpty()): // All vertices in the current component have been found
components.add(current_component)
current_component = []
vertices_to_check= [vertices.pop()]
next_vertex = vertices_to_check.pop()
current_component.add(next_vertex)
for vertex in vertices: // Find all neighbors of next_vertex
if (vertices_adjacent(vertex, next_vertex)):
vertices.remove(vertex)
vertices_to_check.add(vertex)
components.add(current_component)
return components
I understand that this method is faster than calculating the adjacency matrix in most cases, as I don't need to check whether two vertices are adjacent, if it is already known they belong to the same component. But is there a way to improve this algorithm?
Ultimately, any algorithm will have to call vertices_adjacent for every single pair of vertices that turn out to belong to separate components, because otherwise it will never be able to verify that there's no link between those components.
Now, if a majority of vertices all belong to a single component, then there may not be too many such pairs; but unless you expect a majority of vertices all belong to a single component, there's little point optimizing specifically for that case. So, discarding that case, the very best-case scenario is:
There turn out to be exactly two components, each with the same number of vertices (½|V| each). So there are ¼|V|2 pairs of vertices that belong to separate components, and you need to call vertices_adjacent for each of those pairs.
These two components turn out to be complete, or you turn out to be exceptionally lucky in your choice of edges to check for first, such that you can detect the connected parts by checking just |V| − 2 pairs.
. . . which still involves making ¼|V|2 + |V| − 2 calls to vertices_adjacent. By comparison, the build-an-adjacency-list approach makes ½|V|2 − ½|V| calls — which is more than the best-case scenario, but by a factor of less than 2. (And the worst-case scenario is simply equivalent to the build-an-adjacency-list approach. That would happen if no component contains more than two vertices, or if the graph is acyclic and you get unlikely in your choice of edges to check for first. Most graphs will be somewhere in between.)
So it's probably not worth trying to optimize too closely for the exact minimum number of calls to vertices_adjacent.
That said, your approach seems pretty reasonable to me; it doesn't make any calls to vertices_adjacent that are clearly unnecessary, so the only improvement would be a probabilistic one, if it could do a better job guessing which calls will turn out to be useful for eliminating later calls.
One possibility: in many graphs, there are some vertices that have a lot of neighbors and some vertices that have relatively few, according to a power-law distribution. So if you prioritize vertices based on how many neighbors they're already known to have, you may be able to take advantage of that pattern. (I think this is especially likely to be useful if the majority of vertices really do all belong to a single component, which is the only case where a better-than-factor-of-2 improvement is even conceivable.) But, you'll have to test to see if it actually makes a difference for the graphs you're interested in.
There are many variants of this question asking the solution in O(|V|) time.
But what is the worst case bound if I wanna compute if there is a universal sink in the graph and I have graph represented in adjacency lists. This is important because all other algorithms seem to be better for adjacency lists, so if finding universal sink is not too frequent operation that I need, I will definitely go ahead for lists rather than matrix.
In my opinion, the time complexity would be the size of the graph, that is O(|V| + |E|). the algorithm for finding universal sink of a graph is as follows. Assuming in-neighbor list, Start from the index 1 of a graph. Check the length of adjacency list at index 1, if it is |V| - 1, then traverse the list to check if there is a self loop. If list does not have a self loop and all other vertices are part of a list, store the list index. Then, we must go through other lists to check if this vertex is part of their list. If it is, then the stored vertex cannot be a universal sink. Continue the search from the next index. Even if list is out-neighbor list, we will have to search the vertices which have list with length = 0, then search all other lists to check if this vertex exists in their respective lists.
As it can be concluded from above explanation, no matter what form of adjacency list is considered, in worst case, finding the universal sink must traverse through all the vertices and edges once, hence the complexity is the size of the graph, i.e. O(|V|+|E|)
But my friend who has recently joined as a assistant professor at a university, mentioned it has to be O(|V|*|V|). I am reviewing his notes before he starts teaching the course in the spring, but before correcting it I wanna be one hundred percent sure.
You're quite correct. We can build the structures we need to track all of the intermediate results, but the basic complexity is still straightforward: we go through all of our edges once, marking and counting references. We can even build a full transition matrix in O(E) time.
Depending on the data structures, we may find an improvement by a second pass over all edges, but 2 * O(E) is still O(E).
Then we traverse each node once, looking for in/out counts and a self-loop.
I've been tasked to write an implementation of the A* algorithm (heuristics provided) that will solve the travelling salesman problem. I understand the algorithm, it's simple enough, but I just can't see the code that implements it. I mean, I get it. Priority queue for the nodes, sorted by distance + heuristic(node), add the closest node on to the path. The question is, like, what happens if the closest node can't be reached from the previous closest node? How does one actually take a "graph" as a function argument? I just can't see how the algorithm actually functions, as code.
I read the Wikipedia page before posting the question. Repeatedly. It doesn't really answer the question- searching the graph is way, way different to solving the TSP. For example, you could construct a graph where the shortest node at any given time always results in a backtrack, since two paths of the same length aren't equal, whereas if you're just trying to go from A to B then two paths of the same length are equal.
You could derive a graph by which some nodes are never reached by always going closest first.
I don't really see how A* applies to the TSP. I mean, finding a route from A to B, sure, I get that. But the TSP? I don't see the connection.
I found a solution here
Use minimum spanning tree as a heuristic.
Set
Initial State: Agent in the start city and has not visited any other city
Goal State: Agent has visited all the cities and reached the start city again
Successor Function: Generates all cities that have not yet visited
Edge-cost: distance between the cities represented by the nodes, use this cost to calculate g(n).
h(n): distance to the nearest unvisited city from the current city + estimated distance to travel all the unvisited cities (MST heuristic used here) + nearest distance from an unvisited city to the start city. Note that this is an admissible heuristic function.
You may consider maintaining a list of visited cities and a list of unvisited cities to facilitate computations.
The confusion here is that the graph on which you are trying to solve the TSP is not the graph you are performing an A* search on.
See related: Sudoku solving algorithm C++
To solve this problem you need to:
Define your:
TSP states
TSP initial state
TSP goal state(s)
TSP state successor function
TSP state heuristic
Apply a generic A* solver to this TSP state graph
A quick example I can think up:
TSP states: list of nodes (cities) currently in the TSP cycle
TSP initial state: the list containing a single node, the travelling salesman's home town
TSP goal state(s): a state is a goal if it contains every node in the graph of cities
TSP successor function: can add any node (city) that isn't in the current cycle to the end of the list of nodes in the cycle to get a new state
The cost of the transition is equal to the cost of the edge you're adding to the cycle
TSP state heuristic: you decide
If it's just a problem of understanding the algorithm and how it works you might want to consider drawing a graph on paper, assigning weights to it and drawing it out. Also you can probably find some animations that show Dijkstra's shortest path, Wikipedia has a good one. The only difference between Dijkstra and A* is the addition of the heuristic, and you stop the search as soon as you reach the target node. As far as using it to solve the TSP, good luck with that!
Think about this a little more abstractly. Forget about A* for a moment, it's just dijkstra's with a heuristic anyway. Before, you wanted to get from A to B. What was your goal? To get to B. The goal was to get to B with the least cost. At any given point, what was your current "state"? Probably just your location on the graph.
Now, you want to start at A, then go to both B and C. What is your goal now? To pass over both B and C, maintaining least cost. You can generalize this with more nodes: D, E, F, ... or just N nodes. Now, at any given point, what is your current "state"? This is critical: it ISN'T just your location in the graph--it's also which of B or C or whatever nodes you have visited so far in the search.
Implement your original algorithm so that it calls some function asking if it has reached "the goal state" after making X move. Before, the function would have just said "yes, you're at state B, therefore you are at the goal". But now, let that function return "yes, you're at the goal state" if the search's path has passed over each of the points of interest. It'll know whether or not the search has passed over all points of interest because that's included in the current state.
After you get that, improve the search with some heuristic, and A* it up.
To answer one of your questions...
To pass a graph as a function argument, you have several options. You could pass a pointer to an array containing all the nodes. You could pass just the one starting node and work from there, if it's a fully connected graph. And finally, you could write a graph class with whatever data structures you need inside it, and pass a reference to an instance of that class.
As for your other question about closest nodes, isn't part of A* search that it will backtrack as needed? Or you could implement your own sort of backtracking to handle that kind of situation.
The question is, like, what happens if the closest node can't be reached from the previous closest node?
This step isn't necessary. As in, you aren't computing a path from the previous closest to the current closest, you are trying to get to your goal node, and the current closest is the only thing that matters (e.g. the algorithm doesn't care that last iteration you were 100km away, because this iteration you are only 96km away).
As a broad introduction, A* doesn't directly construct a path: it explores until it definitely knows that the path is contained within the region it has explored, and then constructs the path based on the information recorded during the exploration.
(I'm going to use the code in the Wikipedia article as a reference implementation to aid my explanation.)
You have a two sets of nodes: closedset and openset
closedset holds nodes that have been fully evaluated, that is, you know exactly how far they are from start and all their neighbours are in one of the two sets. This there is no more computation you can do with them and so we can (sort of) ignore them. (Basically these are completely contained within the border.)
openset holds "border" nodes, you know how far these are from start, but you haven't touched their neighbours yet, so they are on the edge of your search so far.
(Implicitly, there is a third set: completely untouched nodes. But you don't really touch them until they are in openset so they don't matter.)
At a given iteration, if you've got nodes to explore (that is, nodes in openset), you need to work out which one to explore. This is the job of the heuristic, it basically gives you a hint about which point on the border will be the best to explore next by telling you which node it thinks will have the shortest path to goal.
The previous closest node is irrelevant, it just expanded the border a bit, adding new nodes to openset. These new nodes are now candidates for the closest node in this iteration.
At first, openset only contains start, but then you iterate and at each step the border is expanded a little (in the most promising direction), until you eventually reach goal.
When A* is actually doing the exploration, it doesn't worry about which nodes came from where. It doesn't need to, because it knows their distance from start and the heuristic function and that's all it needs.
However to reconstruct the path later, you need to have some record of the path, this is what camefrom is. For a given node, camefrom links it to the node that is closest to start, so you can reconstruct the shortest path by following the links backwards from goal.
How does one actually take a "graph" as a function argument?
By passing one of the representations of a graph.
I don't really see how A* applies to the TSP. I mean, finding a route from A to B, sure, I get that. But the TSP? I don't see the connection.
You need a different heuristic and a different end condition: goal is no longer a single node any more, but the state of having everything connected; and your heuristic is some estimate of the length of the shortest path connecting the remaining nodes.
Is there an algorithm that can check, in a directed graph, if a vertex, let's say V2, is reachable from a vertex V1, without traversing all the vertices?
You might find a route to that node without traversing all the edges, and if so you can give a yes answer as soon as you do. Nothing short of traversing all the edges can confirm that the node isn't reachable (unless there's some other constraint you haven't stated that could be used to eliminate the possibility earlier).
Edit: I should add that it depends on how often you need to do queries versus how large (and dense) your graph is. If you need to do a huge number of queries on a relatively small graph, it may make sense to pre-process the data in the graph to produce a matrix with a bit at the intersection of any V1 and V2 to indicate whether there's a connection from V1 to V2. This doesn't avoid traversing the graph, but it can avoid traversing the graph at the time of the query. I.e., it's basically a greedy algorithm that assumes you're going to eventually use enough of the combinations that it's easiest to just traverse them all and store the result. Depending on the size of the graph, the pre-processing step may be slow, but once it's done executing a query becomes quite fast (constant time, and usually a pretty small constant at that).
Depth first search or breadth first search. Stop when you find one. But there's no way to tell there's none without going through every one, no. You can improve the performance sometimes with some heuristics, like if you have additional information about the graph. For example, if the graph represents a coordinate space like a real map, and most of the time you know that there's going to be a mostly direct path, then you can attempt to have the depth-first search look along lines that "aim towards the target". However, imagine the case where the start and end points are right next to each other, but with no vector inbetween, and to find it, you have to go way out of the way. You have to check every case in order to be exhaustive.
I doubt it has a name, but a breadth-first search might go like this:
Add V1 to a queue of nodes to be visited
While there are nodes in the queue:
If the node is V2, return true
Mark the node as visited
For every node at the end of an outgoing edge which is not yet visited:
Add this node to the queue
End for
End while
Return false
Create an adjacency matrix when the graph is created. At the same time you do this, create matrices consisting of the powers of the adjacency matrix up to the number of nodes in the graph. To find if there is a path from node u to node v, check the matrices (starting from M^1 and going to M^n) and examine the value at (u, v) in each matrix. If, for any of the matrices checked, that value is greater than zero, you can stop the check because there is indeed a connection. (This gives you even more information as well: the power tells you the number of steps between nodes, and the value tells you how many paths there are between nodes for that step number.)
(Note that if you know the number of steps in the longest path in your graph, for whatever reason, you only need to create a number of matrices up to that power. As well, if you want to save memory, you could just store the base adjacency matrix and create the others as you go along, but for large matrices that may take a fair amount of time if you aren't using an efficient method of doing the multiplications, whether from a library or written on your own.)
It would probably be easiest to just do a depth- or breadth-first search, though, as others have suggested, not only because they're comparatively easy to implement but also because you can generate the path between nodes as you go along. (Technically you'd be generating multiple paths and discarding loops/dead-end ones along the way, but whatever.)
In principle, you can't determine that a path exists without traversing some part of the graph, because the failure case (a path does not exist) cannot be determined without traversing the entire graph.
You MAY be able to improve your performance by searching backwards (search from destination to starting point), or by alternating between forward and backward search steps.
Any good AI textbook will talk at length about search techniques. Elaine Rich's book was good in this area. Amazon is your FRIEND.
You mentioned here that the graph represents a road network. If the graph is planar, you could use Thorup's Algorithm which creates an O(nlogn) space data structure that takes O(nlogn) time to build and answers queries in O(1) time.
Another approach to this problem would allow you to ignore all of the vertices. If you were to only look at the edges, you can produce a transitive closure array that will show you each vertex that is reachable from any other vertex.
Start with your list of edges:
Va -> Vc
Va -> Vd
....
Create an array with start location as the rows and end location as the columns. Fill the arrays with 0. For each edge in the list of edges, place a one in the start,end coordinate of the edge.
Now you iterate a few times until either V1,V2 is 1 or there are no changes.
For each row:
NextRowN = RowN
For each column that is true for RowN
Use boolean OR to OR in the results of that row of that number with the current NextRowN.
Set RowN to NextRowN
If you run this algorithm until the end, you will quickly have a complete list of all reachable vertices without looking at any of them. The runtime is proportional to the number of edges. This would work well with a reasonable implementation and a reasonable number of edges.
A slightly more complex version of this algorithm would be to only calculate the vertices reachable by V1. To do this, you would focus your scope on the ones that are currently reachable at any given time. You can also limit adding rows to only one time, since the other rows are never changing.
In order to be sure, you either have to find a path, or traverse all vertices that are reachable from V1 once.
I would recommend an implementation of depth first or breadth first search that stops when it encounters a vertex that it has already seen. The vertex will be processed on the first occurrence only. You need to make sure that the search starts at V1 and stops when it runs out of vertices or encounters V2.
I have an graph with the following attributes:
Undirected
Not weighted
Each vertex has a minimum of 2 and maximum of 6 edges connected to it.
Vertex count will be < 100
Graph is static and no vertices/edges can be added/removed or edited.
I'm looking for paths between a random subset of the vertices (at least 2). The paths should simple paths that only go through any vertex once.
My end goal is to have a set of routes so that you can start at one of the subset vertices and reach any of the other subset vertices. Its not necessary to pass through all the subset nodes when following a route.
All of the algorithms I've found (Dijkstra,Depth first search etc.) seem to be dealing with paths between two vertices and shortest paths.
Is there a known algorithm that will give me all the paths (I suppose these are subgraphs) that connect these subset of vertices?
edit:
I've created a (warning! programmer art) animated gif to illustrate what i'm trying to achieve: http://imgur.com/mGVlX.gif
There are two stages pre-process and runtime.
pre-process
I have a graph and a subset of the vertices (blue nodes)
I generate all the possible routes that connect all the blue nodes
runtime
I can start at any blue node select any of the generated routes and travel along it to reach my destination blue node.
So my task is more about creating all of the subgraphs (routes) that connect all blue nodes, rather than creating a path from A->B.
There are so many ways to approach this and in order not confuse things, here's a separate answer that's addressing the description of your core problem:
Finding ALL possible subgraphs that connect your blue vertices is probably overkill if you're only going to use one at a time anyway. I would rather use an algorithm that finds a single one, but randomly (so not any shortest path algorithm or such, since it will always be the same).
If you want to save one of these subgraphs, you simply have to save the seed you used for the random number generator and you'll be able to produce the same subgraph again.
Also, if you really want to find a bunch of subgraphs, a randomized algorithm is still a good choice since you can run it several times with different seeds.
The only real downside is that you will never know if you've found every single one of the possible subgraphs, but it doesn't really sound like that's a requirement for your application.
So, on to the algorithm: Depending on the properties of your graph(s), the optimal algorithm might vary, but you could always start of with a simple random walk, starting from one blue node, walking to another blue one (while making sure you're not walking in your own old footsteps). Then choose a random node on that path and start walking to the next blue from there, and so on.
For certain graphs, this has very bad worst-case complexity but might suffice for your case. There are of course more intelligent ways to find random paths, but I'd start out easy and see if it's good enough. As they say, premature optimization is evil ;)
A simple breadth-first search will give you the shortest paths from one source vertex to all other vertices. So you can perform a BFS starting from each vertex in the subset you're interested in, to get the distances to all other vertices.
Note that in some places, BFS will be described as giving the path between a pair of vertices, but this is not necessary: You can keep running it until it has visited all nodes in the graph.
This algorithm is similar to Johnson's algorithm, but greatly simplified thanks to the fact that your graph is unweighted.
Time complexity: Since there is a constant number of edges per vertex, each BFS will take O(n), and the total will take O(kn), where n is the number of vertices and k is the size of the subset. As a comparison, the Floyd-Warshall algorithm will take O(n^3).
What you're searching for is (if I understand it correctly) not really all paths, but rather all spanning trees. Read the wikipedia article about spanning trees here to determine if those are what you're looking for. If it is, there is a paper you would probably want to read:
Gabow, Harold N.; Myers, Eugene W. (1978). "Finding All Spanning Trees of Directed and Undirected Graphs". SIAM J. Comput. 7 (280).