I'm sure there is an abundance of information on how to do exactly what I'm after, but it's a matter of not knowing the technical term for it. Basically what I want to create is an adjacency matrix for a directed graph, however rather than simply storing whether or not each vertex pair has a direct adjacency, for every vertex pair in the matrix I want to store if there is ANY path connecting the two (and what those paths are).
This would give me constant time complexity for lookups which is desirable, however what's not immediately clear to me is what the expected optimal time complexity of building this matrix will be.
Also, is there a formal name for such a matrix?
Playing this out in my head, it seems like a dynamic programming problem. If I want to know if A is connected to Z, I should be able to ask each of A's neighbors, B, C and D if they are (in some way) connected to Z, and if so, then I know A is. And if B doesn't have this answer stored, then he would ask the same question of his direct neighbors, and so on. I would memoize the results along the way, so subsequent lookups would be constant.
I haven't spent time to implement this yet, because it feels like ϴ(n^n) to build a complete matrix, so my question is whether or not I'm going about this the right way, and if indeed there is a lower-cost way to build such a matrix?
The transitive closure of a graph (https://en.wikipedia.org/wiki/Transitive_closure#In_graph_theory) can indeed be computed by dynamic programming with a variation of Floyd Warshall algorithm: https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm.
Using |V| DFS (or BFS) is more efficient, though.
Using networkx connected components
G = nx.path_graph(4)
G.add_path([10, 11, 12])
d = {}
for group in idx, group in enumerate(nx.connected_components(G)):
for node in group:
d[node] = idx
def connected(node1, node2):
return d[node1]==d[node2]
Generation should be O(N) lookup should be O(1)
Related
Please forgive me if I'm not using the correct terms or have overlooked an existing solution. I'm not experienced in search algorithms and the theories behind it. I just would like to solve a problem.
I've previously used what I was told to be the A* algorithm to solve a different problem. But reading up on it I've realized that what I learned is not quite what wikipedia tells me.
What I learned was:
Start at your origin node
Open a new solution for each path you can take
Recursively create a new subsolution for each path you can take from there
When you arrive at the same place with multiple solutions, drop those who took longer than the fastest
Now if I understand wikipedia correctly, this is what I was supposed to do:
Start at your origin node
Open a new solution for each path you can take
Order the solutions by "cost of path taken" + "estimated cost to target"
Take cheapest solution and create subsolutions for each possible path
order those solutions into the others then rinse repeat
I can see how this would help with not calculating quite as many solutions but my problem is that I see no possiblity to create an "optimistic" estimate.
I'm not searching for a path on a geographical map. I'm trying to find the best sequence of actions. There's a minimum sequence of - say - ABCDEFGH. You cannot do F before E but repeating previous actions in particilar ordering might make later actions more efficient.
Do I need a different search algorithm? Do I do what I originally learned and just live with the fact that doing more work is the price for not having a good heuristic function?
I believe my teacher recognized this problem. And what I learned was simply A* with a heuristic function of f(n) = 0.
I'm not searching for a path on a geographical map. I'm trying to find
the best sequence of actions. There's a minimum sequence of - say -
ABCDEFGH. You cannot do F before E but repeating previous actions in
particular ordering might make later actions more efficient.
It is not clear to me whether you can repeat one action, i.e., a solution is ABCDEFGH, but would ABBBBCDEFGH be possible?
If not, then you might be able to have A* algorithm, implemented like this:
1. At some stage (say the first, "empty"), you have one of several actions
available.
2. The cost of going from Empty City to A City is the cost of action A.
3. The cost of going from Empty City to B city is the cost of action B.
When you've reached B, the cost of doing C is constant (if it is not, then you can't use A* as is) and you insert the cost of going from B City to C City as the cost of C.
So you can handle the case in which an action has different costs, provided that this difference is completely described by the previous state. For example, if you can only do C if you have done A or B, and the cost of C is 5 and 8, you enter the "distance" between A and C as 5, and B to C as 8.
If the cost of, say, D depends on the two previous states, you can still use a more complicated A* implementation where you define the virtual "cities" BC, AB and AC, and the distance from BC to D is "the cost of D having done B and C", and so on. The cost of reaching BC from A is "the cost of B given A, and the cost of C given A and B". So if these costs depend on the previous states, things get even more complicated.
In the end, the complexity of this revised A* will grow until it becomes your algorithm, where every state depends potentially on the sequence of all preceding states. The more this is true, the more your algorithm is convenient; the more every state is a cost unto itself, the more A* is convenient.
And of course the possibility of closed loops (visiting the same state/action twice, making this a cyclic graph) blows A* straight out of the water.
I asked about the minimum cost maximum flow several weeks ago. Kraskevich's answer was brilliant and solved my problem. I've implemented it and it works fine (available only in French, sorry). Additionaly, the algorithm can handle the assignement of i (i > 1) projects to each student.
Now I'm trying something more difficult. I'd like to add constraints on choices. In the case one wants to affect i (i > 1) projects to each student, I'd like to be able to specify which projects are compatible (each other).
In the case some projects are not compatible, I'd like the algorithm to return the global optimum, i.e. affect i projects to each student maximizing global happiness and repecting compatibility constraints.
Chaining i times the original method (and checking constraints at each step) will not help, as it would only return a local optimum.
Any idea about the correct graph to work with ?
Unfortunately, it is not solvable in polynomial time(unless P = NP or there are additional constraints).
Here is a polynomial time reduction from the maximum independent set problem(which is known to be NP-complete) to this one:
Given a graph G and a number k, do the following:
Create a project for each vertex in the graph G and say that two project are incompatible iff there is an edge between the corresponding vertices in G.
Create one student who likes each project equally(we can assume that the happiness each project gives to him is equal to 1).
Find the maximum happiness using an algorithm that solves the problem stated in your question. Let's call it h.
A set of projects can be picked iff they all are compatible, which means that the picked vertices of G form an independent set(due to the way we constructed the graph).
Thus, h is equal to the size of the maximum independent set.
Return h >= k.
What does it mean in practice? It means that it is not reasonable to look for a polynomial time solution to this problem. There are several things that can be done:
If the input is small, you can use exhaustive search.
If it is not, you can use heuristics and/or approximations to find a relatively good solution(not necessary the optimal one, though).
If you can stomach the library dependency, integer programming will be quicker and easier than anything you can implement yourself. All you have to do is formulate the original problem as an integer program and then add your ad hoc constraints at the end.
I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.
Here's visualisation of my problem.
I've been trying to use djikstra on that however, It haven't worked.
The complication, as I see it, is that Dijkstra's algorithm throws away information that you need to keep around: if you are trying to get from A to E in
B
/ \
A D - E
\ /
C
And ABD is shorter than ACD, Dijkstra's will forget that ACD was ever a possibility (it uses ACD as the canonical route from A to D). But if ABD has a higher cost than ACD, and ABDE is above the quota while ACDE is below, the now eliminated ACD was correct. The problem is that Dijkstra's algorithm assumes that if one path is at least as long as another, it is weakly dominated: there is no reason to prefer it. And in one dimension of comparison, paths are weakly ordered: given any two paths, one weakly dominates the other.
But here we have two dimensions of comparison, and so ordering does not hold: one path can be shorter, the other cheaper. Since we can only discard dominated paths, we must keep all paths that do not already exceed the budget and are not dominated. I have put a bit of work into implementing this approach; it looks doable but cannot find an argument for a worst-case bound below exponential complexity (although normal performance should be much better, since in a sane graphs most paths are dominated).
You can also, as Billiska notes, use k-th shortest routes algorithms and then proceed through their results until you find one below the budget. That uses time O(m+ K*n*log(m/n)); but unless someone sees an upper bound on K such that K is guaranteed to include a path under the budget (if one exists), we need to set K to be the total number of paths, again yielding exponential complexity (although again a strategy of incrementally increasing K would likely yield a reasonable average runtime, at least if length and cost are reasonably correlated).
EDIT:
Complicating (perhaps fatally) the implementation of my proposed modification is that Dijkstra's algorithm relies on an ordering of the accessibility of nodes, such that we know that if we take the unexplored node to which we have the shortest path, we will never find a better route to it (since all other routes are already known to be longer). If that shortest route is also expensive, that need not hold; even after exploring a node, we must be prepared to update paths out of it on the basis of longer but cheaper routes into it. I suspect that this will prevent it from reaching polynomial time in the worst case.
Basically you need to find the first shortest-path, check if it works, then find the second shortest-path, check if it works, and so on...
Dijkstra's algorithm isn't designed to work with such task.
And just a Google search on this new definition of the problem,
I arrive at Stack Overflow question on finding kth-shortest-paths.
I haven't read into it yet, so don't ask me.
I hope this helps.
I think you can do it with Dijkstra, but you have to change the way you are calculating the tentative distance in each step. Instead of just taking into account the distance, consider also the cost. the new distance should be 2-d number (dist, cost), when you will choose what is the minimal distance you should take the one with minimal dist AND cost <= 6, that's it.
I hope this is correct.
I am working on some revision at the moment and specifically going over Big-O notation. I have asked a similar question (which dealt with a different algorithm) but am still unsure if I am going the right way about it or not.
The algorithm that I am looking at is Exhaustive Search (aka Brute Force, I believe) and looks like this:
Input: G- the graph
n- the current node
p– the path so far
1) For every edge nm (from n to m) in G do
2) If m ∉ p then
3) p = p ∪ {m}
4) Exhaustive(G, m, p)
5) End If
6) End For
So far I have come to the result that this algorithm is O(n) - is this correct? I doubt that it is, and would love to know exactly how to go about working it out; what to look for, what exactly it is that I 'count' each time, etc. I understand that the number of operations taking place need to be counted, but is that all that I need to take note of/count?
EDIT: I have learned that this algorithm is, in fact, O((n-1)!) - is this correct and if so, how did this solution come about as I cannot work it out?
Usually (but not always) with graphs, the input size n is the number of nodes in the graph. It's fairly easy to prove to ourselves that the function (let alone the runtime) is called at least n times - a single path through a graph (assuming it's connected, that is, every node is reachable from every other node via some path) will take `n' calls.
To compute running time of recursive functions, an upper bound on the running time will be the number of times the recursive function is called multiplied by the runtime of the function in a single call.
To see that the worst case runtime is O((n-1)!), consider how many paths are in a fully connected graph - you can visit any node directly from any node. Another way of phrasing this is that you can visit the nodes in any order, save the starting state. This is the same as the number of permutations of (n-1) elements. I believe it's actually going to be O(n!), since we are iterating over all edges which takes O(n) for each state on the path (n*(n-1)!). EDIT: More precisely, we can say it's big-omega(N!). See comments for more details.
Sometimes, it's easier to look at what the algorithm computes than the actual code - that is, the cardinality of all the states (more specificity here, paths).