Proving a recursive algorithm - algorithm

I need to prove a recursive algorithm. Normally this would be done using some integer value within the code as the base case for induction like when computing a factorial but with a graph traversal I have no idea where to begin. Here is my algorithm. Subscripts didn't convert.
Algorithm
Goal: Traverse a graph creating a depth-first spanning tree, and compute the Last descendent of each vertex that is the descendent vk that has the highest value of k
Input:
A connected graph G with vertices ordered v1, v2, v3 … vn
Output:
A spanning tree T where each vertex in T has had its Last vertex computed
Initialization
Set each vertex to unvisited. Let ak denote a list of all vertices adjacent to vk. Let lk denote the Last descendent of vk. Let ck denote the list of all the children of vk in the spanning tree. Let dk denote the list of all vertices that are descendents of vk in the spanning tree including vk.
dfs(vk){
add vK to T
set v to visited
lk = vk
add vk to dk
foreach(vertex m in ak with lowest value of k){
add m to ck
add dfs(m) to dk
}
foreach(vertex vc in dk){
if( c > k){
lk = vc
}
}
if(k = 1)
return T
else
return dk
}
This is for a group project at school so I don't want the whole proof but a starting point and some direction would be greatly appreciated.

I'm having a hard time understanding your pseudo code. It seems unclear at best, probably the algorithm doesn't even work. Some issues:
The "visited" property is set, but never used.
T is supposed to be a tree. But you only ever add vertices, no edges to it. Without edges, it's certainly not a spanning tree. If you consider all edges from the original graph between the nodes in T part of T, then it will contain cycles and won't be a tree, either.
Why do you calculate lk, if it's never used?
Is the parameter of your function dfs supposed to be k instead of vk? Otherwise k is never set, but it is being used.
I fail to see how your recursion ever terminates. I don't see any guards for the base case (except for maybe the "with lowest value of k" condition in your loop - which I don't understand because I understand from the rest of the code that k is the input parameter of the function and therefore m doesn't depend on it).
So let me tell you about proving recursive algorithms over graphs in general. Apart from the induction over natural numbers that you mentioned, there is also Structural Induction. It has a base case and an inductive step, just like the induction you know. But the base case is usually a trivial component of your data structure and the inductive step proves your proposition for more complex composites, assuming that your proposition holds for its less complex components.
For example, you can prove an algorithm over trees by proving that it works for the leafs (your base case) and by proving that it works for a whole tree, assuming that your algorithm works for the left and right sub-trees of the root (induction step).
Since your graph, other than the tree example above, may contain cycles, it's not automatically guaranteed that the data structure that is passed to your recursive call is less complex than the original one. But your algorithm probably has some way to ensure that the recursive call takes into account only a part of the graph, probably via the "visited" flag. In that case the recursive call has to take into account only the "not visited" subgraph. So you can induce over those unvisited subgraphs. Start from the base case with only one vertex being unvisited. And then inductively add one unvisited node (including its edges) to the unvisited subgraph.

Related

Kruskals Algorithm explained

So I'm currently having a little play with some algorithms and have come across the Kruskals algorithm.
Understand the concept, understand how to do the actual process. But do not understand the algorithm.
Here is the algorithm:
From what I can figure out, |V| is all the vertices?
What is E'?
I have no idea why this algorithm is confusing me so much, I've picked up other ones with absolute ease
Kruskal's algorithm adds edges to the MST in order of weight, unless they would introduce a cycle (this detection is typically done using union-find). The code starts by initialising some values:
n := |V| // the number of vertices
E' := ∅ // the edges in our MST; it starts as the empty set
Cands = E // the edges still under consideration for adding to the MST, starts as all edges
The loop condition is:
while |E'| < n - 1 and Cands != ∅ do
That is, we continue as long as we have selected fewer than n - 1 edges (because we know this is the number of edges contained in any spanning tree: if we have found them, we're done) and the set of edges we haven't considered yet is not empty.
Lines (1) and (2) find the minimum weight edge in Cands, removing it from the set. A suitable structure for Cands would be a min-heap, in which case this is just a pop-operation.
Lines (3) and (4) determine whether the edge we retrieved from Cands in (1) would introduce a cycle in E' if added. If it doesn't, we know this edge is in the MST, otherwise it's not.
The last line just checks to see whether we actually found a tree. It's possible the loop terminates without finding n - 1 edges, for instance when the graph is not one connected component.

Sum of Vertices in Induced Graph - Dynamic Programming

This is a homework question so I'll be glad to get a hint.
I have a graph G, where each vertex v has a weight w(v).
S(G) is the sum of weights of the all the vertexes in the graph.
I need to find an algorithm that determines if there is a group of vertexes A, when G[A] (G's graph induced by A) is a tree, that conducts S(G[A])=S(G[V\A]).
I know that i should go over all vertexes, sum their weights, and then try to find a tree that reaches half of that sum, but i'm not sure how exactly. I'm pretty sure it involves dynamic programming.
Thank you very much,
Yaron.
This is not really a dynamic programming problem, it is a search problem, the key being that you are trying to find a tree.
If you think about it, you already know an algorithm or two that will will tell you the minimum spanning tree. By the same logic, you can make a maximum spanning tree. For example, if you find the maximum spanning tree and the sum of its weights is less than 50% (or whatever the target value is), then you know the problem is impossible.
So, following this logic, you can go along as though you were making a spanning tree and reject any path that goes over the target amount. This strategy is known as "branch and bound". Let's imagine how we could do this with Kruskal's algorithm:
(1) you will have a set of trees; start with each vertex as a separate "tree"
(2) maintain a queue of edges you have not used yet, sorted from least to greatest
(3) maintain a stack of edges that you have used
(4) look for an edge that (a) connects two different trees, and (b) the sum of the two trees is less than (or equal to the target value, ie a solution)
(4a) if no such edge exists, then pop a value from the stack (remove the edge and seperate the trees) and try the next value in the queue
(4b) if such an edge does exist, then add the edge (combine two of the trees), push it onto the stack and go back to step 4
Obviously there are different ways to do the same process. For example, you could use a variant of Prim's algorithm as well.

How to prove that finding a successor n-1 times in the BST from the minimum node is O(n)?

How to prove that finding a successor n-1 times in the BST from the minimum node is O(n)?
The questions is that we can create sorted order by
1) let the node = minimum node of the BST.
2) From that node, we recursively call find a successor.
I was told that the result is O(n) but I do not understand and do not know how to prove it.
Should not it be O(n*log n) instead? Because for the step 1, it is O(log n), for the step 2, it is also O(log n) but it is called n-1 times. Therefore, it will be O(n*log n)
Please clarify my doubt. Thank you! :)
You are correct that any individual operation might take O(log n) time, so if you perform those operations n times, you should get a runtime of O(n log n). This bound is correct, but it's not tight. The actual runtime is Θ(n).
One way to see this is to look at any individual edge in the tree. How many times will you visit each edge if you start at the leftmost node and repeatedly perform a successor query? If you look closely at how the operations work, you'll discover that every edge is visited exactly twice: once downward and once upward. Since all the work done is done traversing up and down edges, this means that the total amount of work done is proportional to twice the number of edges. In any tree, the number of edges is the number of nodes minus one, and so the total work done is Θ(n).
To formalize this as a proof, try showing that you never descend down the same edge twice and that when you ascend up an edge, you never descend down that edge again. Once you've done this, the conclusion that the runtime is Θ(n) follows from the above logic.
Hope this helps!
I wanted to post this as a comment on templatetypedef's answer, but it's too long.
His answer is right in that the easiest way to see that this is linear is because every edge is visited exactly twice, and the number of edges in a tree is always one less than the number of nodes (because every node has one parent, except the root!).
The issue is that the way he phrases the formal proof uses words that seem to imply contradiction as the way to go. In general, mathematicians frown on using contradiction because it often produces proofs with superfluous content. For instance:
Proof that 2 + 2 != 5:
Assume for contradiction that 2 + 2 = 5 (<- Remove this line)
Well 2 + 2 = 4
And 4 != 5
Contradiction! (<- Remove this line)
Contradiction tends to be verbose, and sometimes it can even obfuscate the idea behind the proof! There are times when contradiction seems pretty much necessary, but it's relatively rare and that's a separate discussion.
In this case, I don't see a proof by contradiction being any easier than a direct proof. On the other hand, regardless of proof technique, this proof is pretty ugly to do formally. Here's an attempt:
1) The succ(n) algorithm traverses one of two paths
In the first case every edge is visited on the simple path from a node to the leftmost node of its right subtree
In the other case, the node n has no right child in which case we go up its ancestors p_1, p_2, p_3, ..., p_k such that p_(k-1) is the first ancestor which is the left child of it's parent. All of those edges are visited in that simple path
We want to show that an arbitrary edge is traversed in precisely two succ() calls, once for the first case of succ() and once for the second case of succ(). Well, this is true for every edge other than the rightmost branch, but you can handle those edge cases separately. Alternatively we could prove the simpler argument where we return to the root after visiting the last element
This is two-fold because for a given edge e we have to find the n1 and n2 such that succ(n1) traverses e and succ(n2) also traverses e, as well as prove that every other succ() generates a path which does not include e.
2) First we actually prove that for each type of path that succ() visits, no two paths overlap (i.e. if succ(n) and succ(n') both traverse paths of the same type, those paths share no edges)
In the first case, the simple path is precisely defined as follows. Start at node n and go one edge to the right to r. Then traverse the left branch of the subtree rooted at r. Now consider any other such path that starts at some other node n' (note, we don't assume that n != n'). It must go right one node to r'. Then it traverses the leftmost branch of the subtree rooted at r'. If the paths overlap then pick one of the edges that overlap. If it's (n,r) = (n',r') then we have n = n' and so it's the same path. If it's some e = e' in both leftmost branches then you can show, again, that n = n' (you can trace the leftmost branches and show that every edge is the same, then finally reach the conclusion that r = r' => n = n' because for a tree the parent is unique. You'll see this tracing argument below). Thus we know that for any n and n', if their paths overlap, they are actually the same node! The contrapositive says this: if they are different nodes, then their paths don't overlap. That's exactly what we want (and the contrapositive is always equally true to the original statement).
In the second case we define the simple path starting at node n and go up the ancestors p_1, p_2, ..., p_k = g until we reach the first node p_k such that p_(k-1) is to the left of p_k. Consider some other path of the same type that starts at node n' where n != n'. Similarly it visits p_1', p_2', ..., p_k' = g'. Because it's a tree, none of those ancestors are the same as the first set. Because none of the nodes on the two paths are the same, none of the edges can be the same and hence succ(n) and succ(n') do not traverse any of the same edges
3) Now we just need to show that at least one path of each type exists for a given edge. Well take any such edge e = (c,p) (note here I am ignoring the special edges on the rightmost branch which are technically only visited once and I am also ignoring the special edges on the leftmost branch which are technically visited once by find_min() and then once by succ() calls)
If it's from a left child c to its parent p then succ(c) will cover the second type of path. To find the other path, keep going up p's ancestors p_1, p_2, ..., p_k such that p_(k-1) is to the right of p_k. succ(p_k) will traverse a path containing e by definition (since e is on the leftmost branch of the subtree of p_(k-1) which is p_k's right child).
A similar argument holds for symmetric case when c is the right child of p
To summarize the proof we've shown that succ() generates two types of path. For each type of path, all of the paths of those types do not overlap. Furthermore, for any edge we have at least one of each of those types of paths. Since we call succ() on every node we can finally conclude that each edge is traversed twice (and hence the algorithm is Theta(n)).
Despite how long this proof was, it isn't actually complete (even ignoring the points when I explicitly said I was skipping details!). There are cases where I said something exists without proving it exists. You can figure out those details if you want and it is actually really satisfying to get it completely right (in my opinion at least. Maybe when you're a genius you'll find it tedious, heh)
Hope this helped. Let me know if you want me to clarify some steps

Path from s to e in a weighted DAG graph with limitations

Consider a directed graph with n nodes and m edges. Each edge is weighted. There is a start node s and an end node e. We want to find the path from s to e that has the maximum number of nodes such that:
the total distance is less than some constant d
starting from s, each node in the path is closer than the previous one to the node e. (as in, when you traverse the path you are getting closer to your destination e. in terms of the edge weight of the remaining path.)
We can assume there are no cycles in the graph. There are no negative weights. Does an efficient algorithm already exist for this problem? Is there a name for this problem?
Whatever you end up doing, do a BFS/DFS starting from s first to see if e can even be reached; this only takes you O(n+m) so it won't add to the complexity of the problem (since you need to look at all vertices and edges anyway). Also, delete all edges with weight 0 before you do anything else since those never fulfill your second criterion.
EDIT: I figured out an algorithm; it's polynomial, depending on the size of your graphs it may still not be sufficiently efficient though. See the edit further down.
Now for some complexity. The first thing to think about here is an upper bound on how many paths we can actually have, so depending on the choice of d and the weights of the edges, we also have an upper bound on the complexity of any potential algorithm.
How many edges can there be in a DAG? The answer is n(n-1)/2, which is a tight bound: take n vertices, order them from 1 to n; for two vertices i and j, add an edge i->j to the graph iff i<j. This sums to a total of n(n-1)/2, since this way, for every pair of vertices, there is exactly one directed edge between them, meaning we have as many edges in the graph as we would have in a complete undirected graph over n vertices.
How many paths can there be from one vertex to another in the graph described above? The answer is 2n-2. Proof by induction:
Take the graph over 2 vertices as described above; there is 1 = 20 = 22-2 path from vertex 1 to vertex 2: (1->2).
Induction step: assuming there are 2n-2 paths from the vertex with number 1 of an n vertex graph as described above to the vertex with number n, increment the number of each vertex and add a new vertex 1 along with the required n edges. It has its own edge to the vertex now labeled n+1. Additionally, it has 2i-2 paths to that vertex for every i in [2;n] (it has all the paths the other vertices have to the vertex n+1 collectively, each "prefixed" with the edge 1->i). This gives us 1 + Σnk=2 (2k-2) = 1 + Σn-2k=0 (2k-2) = 1 + (2n-1 - 1) = 2n-1 = 2(n+1)-2.
So we see that there are DAGs that have 2n-2 distinct paths between some pairs of their vertices; this is a bit of a bleak outlook, since depending on weights and your choice of d, you may have to consider them all. This in itself doesn't mean we can't choose some form of optimum (which is what you're looking for) efficiently though.
EDIT: Ok so here goes what I would do:
Delete all edges with weight 0 (and smaller, but you ruled that out), since they can never fulfill your second criterion.
Do a topological sort of the graph; in the following, let's only consider the part of the topological sorting of the graph from s to e, let's call that the integer interval [s;e]. Delete everything from the graph that isn't strictly in that interval, meaning all vertices outside of it along with the incident edges. During the topSort, you'll also be able to see whether there is a
path from s to e, so you'll know whether there are any paths s-...->e. Complexity of this part is O(n+m).
Now the actual algorithm:
traverse the vertices of [s;e] in the order imposed by the topological
sorting
for every vertex v, store a two-dimensional array of information; let's call it
prev[][] since it's gonna store information about the predecessors
of a node on the paths leading towards it
in prev[i][j], store how long the total path of length (counted in
vertices) i is as a sum of the edge weights, if j is the predecessor of the
current vertex on that path. For example, pres+1[1][s] would have
the weight of the edge s->s+1 in it, while all other entries in pres+1
would be 0/undefined.
when calculating the array for a new vertex v, all we have to do is check
its incoming edges and iterate over the arrays for the start vertices of those
edges. For example, let's say vertex v has an incoming edge from vertex w,
having weight c. Consider what the entry prev[i][w] should be.
We have an edge w->v, so we need to set prev[i][w] in v to
min(prew[i-1][k] for all k, but ignore entries with 0) + c (notice the subscript of the array!); we effectively take the cost of a
path of length i - 1 that leads to w, and add the cost of the edge w->v.
Why the minimum? The vertex w can have many predecessors for paths of length
i - 1; however, we want to stay below a cost limit, which greedy minimization
at each vertex will do for us. We will need to do this for all i in [1;s-v].
While calculating the array for a vertex, do not set entries that would give you
a path with cost above d; since all edges have positive weights, we can only get
more costly paths with each edge, so just ignore those.
Once you reached e and finished calculating pree, you're done with this
part of the algorithm.
Iterate over pree, starting with pree[e-s]; since we have no cycles, all
paths are simple paths and therefore the longest path from s to e can have e-s edges. Find the largest
i such that pree[i] has a non-zero (meaning it is defined) entry; if non exists, there is no path fitting your criteria. You can reconstruct
any existing path using the arrays of the other vertices.
Now that gives you a space complexity of O(n^3) and a time complexity of O(n²m) - the arrays have O(n²) entries, we have to iterate over O(m) arrays, one array for each edge - but I think it's very obvious where the wasteful use of data structures here can be optimized using hashing structures and other things than arrays. Or you could just use a one-dimensional array and only store the current minimum instead of recomputing it every time (you'll have to encapsulate the sum of edge weights of the path together with the predecessor vertex though since you need to know the predecessor to reconstruct the path), which would change the size of the arrays from n² to n since you now only need one entry per number-of-nodes-on-path-to-vertex, bringing down the space complexity of the algorithm to O(n²) and the time complexity to O(nm). You can also try and do some form of topological sort that gets rid of the vertices from which you can't reach e, because those can be safely ignored as well.

Graph serialization

I'm looking for a simple algorithm to 'serialize' a directed graph. In particular I've got a set of files with interdependencies on their execution order, and I want to find the correct order at compile time. I know it must be a fairly common thing to do - compilers do it all the time - but my google-fu has been weak today. What's the 'go-to' algorithm for this?
Topological Sort (From Wikipedia):
In graph theory, a topological sort or
topological ordering of a directed
acyclic graph (DAG) is a linear
ordering of its nodes in which each
node comes before all nodes to which
it has outbound edges. Every DAG has
one or more topological sorts.
Pseudo code:
L ← Empty list where we put the sorted elements
Q ← Set of all nodes with no incoming edges
while Q is non-empty do
remove a node n from Q
insert n into L
for each node m with an edge e from n to m do
remove edge e from the graph
if m has no other incoming edges then
insert m into Q
if graph has edges then
output error message (graph has a cycle)
else
output message (proposed topologically sorted order: L)
I would expect tools that need this simply walk the tree in a depth-first manner and when they hit a leaf, just process it (e.g. compile) and remove it from the graph (or mark it as processed, and treat nodes with all leaves processed as leaves).
As long as it's a DAG, this simple stack-based walk should be trivial.
I've come up with a fairly naive recursive algorithm (pseudocode):
Map<Object, List<Object>> source; // map of each object to its dependency list
List<Object> dest; // destination list
function resolve(a):
if (dest.contains(a)) return;
foreach (b in source[a]):
resolve(b);
dest.add(a);
foreach (a in source):
resolve(a);
The biggest problem with this is that it has no ability to detect cyclic dependencies - it can go into infinite recursion (ie stack overflow ;-p). The only way around that that I can see would be to flip the recursive algorithm into an interative one with a manual stack, and manually check the stack for repeated elements.
Anyone have something better?
If the graph contains cycles, how can there exist allowed execution orders for your files?
It seems to me that if the graph contains cycles, then you have no solution, and this
is reported correctly by the above algorithm.

Resources