How to prove that finding a successor n-1 times in the BST from the minimum node is O(n)? - algorithm

How to prove that finding a successor n-1 times in the BST from the minimum node is O(n)?
The questions is that we can create sorted order by
1) let the node = minimum node of the BST.
2) From that node, we recursively call find a successor.
I was told that the result is O(n) but I do not understand and do not know how to prove it.
Should not it be O(n*log n) instead? Because for the step 1, it is O(log n), for the step 2, it is also O(log n) but it is called n-1 times. Therefore, it will be O(n*log n)
Please clarify my doubt. Thank you! :)

You are correct that any individual operation might take O(log n) time, so if you perform those operations n times, you should get a runtime of O(n log n). This bound is correct, but it's not tight. The actual runtime is Θ(n).
One way to see this is to look at any individual edge in the tree. How many times will you visit each edge if you start at the leftmost node and repeatedly perform a successor query? If you look closely at how the operations work, you'll discover that every edge is visited exactly twice: once downward and once upward. Since all the work done is done traversing up and down edges, this means that the total amount of work done is proportional to twice the number of edges. In any tree, the number of edges is the number of nodes minus one, and so the total work done is Θ(n).
To formalize this as a proof, try showing that you never descend down the same edge twice and that when you ascend up an edge, you never descend down that edge again. Once you've done this, the conclusion that the runtime is Θ(n) follows from the above logic.
Hope this helps!

I wanted to post this as a comment on templatetypedef's answer, but it's too long.
His answer is right in that the easiest way to see that this is linear is because every edge is visited exactly twice, and the number of edges in a tree is always one less than the number of nodes (because every node has one parent, except the root!).
The issue is that the way he phrases the formal proof uses words that seem to imply contradiction as the way to go. In general, mathematicians frown on using contradiction because it often produces proofs with superfluous content. For instance:
Proof that 2 + 2 != 5:
Assume for contradiction that 2 + 2 = 5 (<- Remove this line)
Well 2 + 2 = 4
And 4 != 5
Contradiction! (<- Remove this line)
Contradiction tends to be verbose, and sometimes it can even obfuscate the idea behind the proof! There are times when contradiction seems pretty much necessary, but it's relatively rare and that's a separate discussion.
In this case, I don't see a proof by contradiction being any easier than a direct proof. On the other hand, regardless of proof technique, this proof is pretty ugly to do formally. Here's an attempt:
1) The succ(n) algorithm traverses one of two paths
In the first case every edge is visited on the simple path from a node to the leftmost node of its right subtree
In the other case, the node n has no right child in which case we go up its ancestors p_1, p_2, p_3, ..., p_k such that p_(k-1) is the first ancestor which is the left child of it's parent. All of those edges are visited in that simple path
We want to show that an arbitrary edge is traversed in precisely two succ() calls, once for the first case of succ() and once for the second case of succ(). Well, this is true for every edge other than the rightmost branch, but you can handle those edge cases separately. Alternatively we could prove the simpler argument where we return to the root after visiting the last element
This is two-fold because for a given edge e we have to find the n1 and n2 such that succ(n1) traverses e and succ(n2) also traverses e, as well as prove that every other succ() generates a path which does not include e.
2) First we actually prove that for each type of path that succ() visits, no two paths overlap (i.e. if succ(n) and succ(n') both traverse paths of the same type, those paths share no edges)
In the first case, the simple path is precisely defined as follows. Start at node n and go one edge to the right to r. Then traverse the left branch of the subtree rooted at r. Now consider any other such path that starts at some other node n' (note, we don't assume that n != n'). It must go right one node to r'. Then it traverses the leftmost branch of the subtree rooted at r'. If the paths overlap then pick one of the edges that overlap. If it's (n,r) = (n',r') then we have n = n' and so it's the same path. If it's some e = e' in both leftmost branches then you can show, again, that n = n' (you can trace the leftmost branches and show that every edge is the same, then finally reach the conclusion that r = r' => n = n' because for a tree the parent is unique. You'll see this tracing argument below). Thus we know that for any n and n', if their paths overlap, they are actually the same node! The contrapositive says this: if they are different nodes, then their paths don't overlap. That's exactly what we want (and the contrapositive is always equally true to the original statement).
In the second case we define the simple path starting at node n and go up the ancestors p_1, p_2, ..., p_k = g until we reach the first node p_k such that p_(k-1) is to the left of p_k. Consider some other path of the same type that starts at node n' where n != n'. Similarly it visits p_1', p_2', ..., p_k' = g'. Because it's a tree, none of those ancestors are the same as the first set. Because none of the nodes on the two paths are the same, none of the edges can be the same and hence succ(n) and succ(n') do not traverse any of the same edges
3) Now we just need to show that at least one path of each type exists for a given edge. Well take any such edge e = (c,p) (note here I am ignoring the special edges on the rightmost branch which are technically only visited once and I am also ignoring the special edges on the leftmost branch which are technically visited once by find_min() and then once by succ() calls)
If it's from a left child c to its parent p then succ(c) will cover the second type of path. To find the other path, keep going up p's ancestors p_1, p_2, ..., p_k such that p_(k-1) is to the right of p_k. succ(p_k) will traverse a path containing e by definition (since e is on the leftmost branch of the subtree of p_(k-1) which is p_k's right child).
A similar argument holds for symmetric case when c is the right child of p
To summarize the proof we've shown that succ() generates two types of path. For each type of path, all of the paths of those types do not overlap. Furthermore, for any edge we have at least one of each of those types of paths. Since we call succ() on every node we can finally conclude that each edge is traversed twice (and hence the algorithm is Theta(n)).
Despite how long this proof was, it isn't actually complete (even ignoring the points when I explicitly said I was skipping details!). There are cases where I said something exists without proving it exists. You can figure out those details if you want and it is actually really satisfying to get it completely right (in my opinion at least. Maybe when you're a genius you'll find it tedious, heh)
Hope this helped. Let me know if you want me to clarify some steps

Related

Proof by contradiction for edge in tree

I have a problem from my textbook that goes like the following; Assume that i have a shortest path matrix S that could look like the following:
And a tree T that consist of the shortest paths constructed from the shortest path matrix S (like a minimum spanning tree).
The tree has the following properties;
n - 1 edges, all nodes are connected with each other.
The task is then to prove by contradiction, that if the entry S_{ij} has the minimum value, then that entry must be an edge in the tree T. I don't quite understand what there is to proof. The way i see it is that if we assume that T does not contain the smallest element from S, then we will have a contradiction at the end, since there will be a path that is larger than the one chosen with the smallest element. This doesn't seem like much of a proof to me, and even if it is, i don't see how I could generalize the proof.
Since T is a tree, there exists only one path between every pair of nodes. If nodes i and j are not connected by an edge, than path connecting them has to have at least one more node, call it k. Than for S_{ij} (length of path between i and j holds):
S_{ij} = S_{ik} + S_{kj} >= S_{ij} + S_{ij} = 2 * S_{ij}
Which is a contradiction.

Proving a recursive algorithm

I need to prove a recursive algorithm. Normally this would be done using some integer value within the code as the base case for induction like when computing a factorial but with a graph traversal I have no idea where to begin. Here is my algorithm. Subscripts didn't convert.
Algorithm
Goal: Traverse a graph creating a depth-first spanning tree, and compute the Last descendent of each vertex that is the descendent vk that has the highest value of k
Input:
A connected graph G with vertices ordered v1, v2, v3 … vn
Output:
A spanning tree T where each vertex in T has had its Last vertex computed
Initialization
Set each vertex to unvisited. Let ak denote a list of all vertices adjacent to vk. Let lk denote the Last descendent of vk. Let ck denote the list of all the children of vk in the spanning tree. Let dk denote the list of all vertices that are descendents of vk in the spanning tree including vk.
dfs(vk){
add vK to T
set v to visited
lk = vk
add vk to dk
foreach(vertex m in ak with lowest value of k){
add m to ck
add dfs(m) to dk
}
foreach(vertex vc in dk){
if( c > k){
lk = vc
}
}
if(k = 1)
return T
else
return dk
}
This is for a group project at school so I don't want the whole proof but a starting point and some direction would be greatly appreciated.
I'm having a hard time understanding your pseudo code. It seems unclear at best, probably the algorithm doesn't even work. Some issues:
The "visited" property is set, but never used.
T is supposed to be a tree. But you only ever add vertices, no edges to it. Without edges, it's certainly not a spanning tree. If you consider all edges from the original graph between the nodes in T part of T, then it will contain cycles and won't be a tree, either.
Why do you calculate lk, if it's never used?
Is the parameter of your function dfs supposed to be k instead of vk? Otherwise k is never set, but it is being used.
I fail to see how your recursion ever terminates. I don't see any guards for the base case (except for maybe the "with lowest value of k" condition in your loop - which I don't understand because I understand from the rest of the code that k is the input parameter of the function and therefore m doesn't depend on it).
So let me tell you about proving recursive algorithms over graphs in general. Apart from the induction over natural numbers that you mentioned, there is also Structural Induction. It has a base case and an inductive step, just like the induction you know. But the base case is usually a trivial component of your data structure and the inductive step proves your proposition for more complex composites, assuming that your proposition holds for its less complex components.
For example, you can prove an algorithm over trees by proving that it works for the leafs (your base case) and by proving that it works for a whole tree, assuming that your algorithm works for the left and right sub-trees of the root (induction step).
Since your graph, other than the tree example above, may contain cycles, it's not automatically guaranteed that the data structure that is passed to your recursive call is less complex than the original one. But your algorithm probably has some way to ensure that the recursive call takes into account only a part of the graph, probably via the "visited" flag. In that case the recursive call has to take into account only the "not visited" subgraph. So you can induce over those unvisited subgraphs. Start from the base case with only one vertex being unvisited. And then inductively add one unvisited node (including its edges) to the unvisited subgraph.

k successive calls to tree successor in bst

Prove that K-successive calls to tree successor takes O(k+h) time. Since each node is visited atmost twice the maximum bound on number of nodes visited must be 2k. The time complexity must be O(k). I dont get where is the factor of O(h) coming. Is it because of nodes which are visited but are not the successor. I am not exactly able to explain myself how is the factor O(h) is involved in the whole process
PS:I know this question already exists but I was not able to understand the solution.
Plus in the O(k+h) notation is an alternative form of writing O(MAX(k, h)).
Finding a successor once could take up to O(h) time. To see why this is true, consider a situation when you are looking for a successor of the rightmost node of the left subtree of the root: its successor is at the bottom of the right subtree, so you must traverse the height of the tree twice. That's why you need to include h in the calculation: if k is small compared to h, then h would dominate the timing of the algorithm.
The point of the exercise is to prove that the time of calling the successor k times in a row is not O(k*h), as one could imagine after observing that a single call could take up to O(h). You prove it by showing that the cost of traversing the height of the tree is distributed among the k calls, as you did by noting that each node is visited at most twice.

Split a tree into equal parts by deleting an edge

I am looking for an algorithm to split a tree with N nodes (where the maximum degree of each node is 3) by removing one edge from it, so that the two trees that come as the result have as close as possible to N/2. How do I find the edge that is "the most centered"?
The tree comes as an input from a previous stage of the algorithm and is input as a graph - so it's not balanced nor is it clear which node is the root.
My idea is to find the longest path in the tree and then select the edge in the middle of the longest path. Does it work?
Optimally, I am looking for a solution that can ensure that neither of the trees has more than 2N / 3 nodes.
Thanks for your answers.
I don't believe that your initial algorithm works for the reason I mentioned in the comments. However, I think that you can solve this in O(n) time and space using a modified DFS.
Begin by walking the graph to count how many total nodes there are; call this n. Now, choose an arbitrary node and root the tree at it. We will now recursively explore the tree starting from the root and will compute for each subtree how many nodes are in each subtree. This can be done using a simple recursion:
If the current node is null, return 0.
Otherwise:
For each child, compute the number of nodes in the subtree rooted at that child.
Return 1 + the total number of nodes in all child subtrees
At this point, we know for each edge what split we will get by removing that edge, since if the subtree below that edge has k nodes in it, the spilt will be (k, n - k). You can thus find the best cut to make by iterating across all nodes and looking for the one that balances (k, n - k) most evenly.
Counting the nodes takes O(n) time, and running the recursion visits each node and edge at most O(1) times, so that takes O(n) time as well. Finding the best cut takes an additional O(n) time, for a net runtime of O(n). Since we need to store the subtree node counts, we need O(n) memory as well.
Hope this helps!
If you see my answer to Divide-And-Conquer Algorithm for Trees, you can see I'll find a node that partitions tree into 2 nearly equal size trees (bottom up algorithm), now you just need to choose one of the edges of this node to do what you want.
Your current approach is not working assume you have a complete binary tree, now add a path of length 3*log n to one of leafs (name it bad leaf), your longest path will be within one of a other leafs to the end of path connected to this bad leaf, and your middle edge will be within this path (in fact after you passed bad leaf) and if you partition base on this edge you have a part of O(log n) and another part of size O(n) .

Find all subtrees of size N in an undirected graph

Given an undirected graph, I want to generate all subgraphs which are trees of size N, where size refers to the number of edges in the tree.
I am aware that there are a lot of them (exponentially many at least for graphs with constant connectivity) - but that's fine, as I believe the number of nodes and edges makes this tractable for at least smallish values of N (say 10 or less).
The algorithm should be memory-efficient - that is, it shouldn't need to have all graphs or some large subset of them in memory at once, since this is likely to exceed available memory even for relatively small graphs. So something like DFS is desirable.
Here's what I'm thinking, in pseudo-code, given the starting graph graph and desired length N:
Pick any arbitrary node, root as a starting point and call alltrees(graph, N, root)
alltrees(graph, N, root)
given that node root has degree M, find all M-tuples with integer, non-negative values whose values sum to N (for example, for 3 children and N=2, you have (0,0,2), (0,2,0), (2,0,0), (0,1,1), (1,0,1), (1,1,0), I think)
for each tuple (X1, X2, ... XM) above
create a subgraph "current" initially empty
for each integer Xi in X1...XM (the current tuple)
if Xi is nonzero
add edge i incident on root to the current tree
add alltrees(graph with root removed, N-1, node adjacent to root along edge i)
add the current tree to the set of all trees
return the set of all trees
This finds only trees containing the chosen initial root, so now remove this node and call alltrees(graph with root removed, N, new arbitrarily chosen root), and repeat until the size of the remaining graph < N (since no trees of the required size will exist).
I forgot also that each visited node (each root for some call of alltrees) needs to be marked, and the set of children considered above should only be the adjacent unmarked children. I guess we need to account for the case where no unmarked children exist, yet depth > 0, this means that this "branch" failed to reach the required depth, and cannot form part of the solution set (so the whole inner loop associated with that tuple can be aborted).
So will this work? Any major flaws? Any simpler/known/canonical way to do this?
One issue with the algorithm outlined above is that it doesn't satisfy the memory-efficient requirement, as the recursion will hold large sets of trees in memory.
This needs an amount of memory that is proportional to what is required to store the graph. It will return every subgraph that is a tree of the desired size exactly once.
Keep in mind that I just typed it into here. There could be bugs. But the idea is that you walk the nodes one at a time, for each node searching for all trees that include that node, but none of the nodes that were searched previously. (Because those have already been exhausted.) That inner search is done recursively by listing edges to nodes in the tree, and for each edge deciding whether or not to include it in your tree. (If it would make a cycle, or add an exhausted node, then you can't include that edge.) If you include it your tree then the used nodes grow, and you have new possible edges to add to your search.
To reduce memory use, the edges that are left to look at is manipulated in place by all of the levels of the recursive call rather than the more obvious approach of duplicating that data at each level. If that list was copied, your total memory usage would get up to the size of the tree times the number of edges in the graph.
def find_all_trees(graph, tree_length):
exhausted_node = set([])
used_node = set([])
used_edge = set([])
current_edge_groups = []
def finish_all_trees(remaining_length, edge_group, edge_position):
while edge_group < len(current_edge_groups):
edges = current_edge_groups[edge_group]
while edge_position < len(edges):
edge = edges[edge_position]
edge_position += 1
(node1, node2) = nodes(edge)
if node1 in exhausted_node or node2 in exhausted_node:
continue
node = node1
if node1 in used_node:
if node2 in used_node:
continue
else:
node = node2
used_node.add(node)
used_edge.add(edge)
edge_groups.append(neighbors(graph, node))
if 1 == remaining_length:
yield build_tree(graph, used_node, used_edge)
else:
for tree in finish_all_trees(remaining_length -1
, edge_group, edge_position):
yield tree
edge_groups.pop()
used_edge.delete(edge)
used_node.delete(node)
edge_position = 0
edge_group += 1
for node in all_nodes(graph):
used_node.add(node)
edge_groups.append(neighbors(graph, node))
for tree in finish_all_trees(tree_length, 0, 0):
yield tree
edge_groups.pop()
used_node.delete(node)
exhausted_node.add(node)
Assuming you can destroy the original graph or make a destroyable copy I came up to something that could work but could be utter sadomaso because I did not calculate its O-Ntiness. It probably would work for small subtrees.
do it in steps, at each step:
sort the graph nodes so you get a list of nodes sorted by number of adjacent edges ASC
process all nodes with the same number of edges of the first one
remove those nodes
For an example for a graph of 6 nodes finding all size 2 subgraphs (sorry for my total lack of artistic expression):
Well the same would go for a bigger graph, but it should be done in more steps.
Assuming:
Z number of edges of most ramificated node
M desired subtree size
S number of steps
Ns number of nodes in step
assuming quicksort for sorting nodes
Worst case:
S*(Ns^2 + MNsZ)
Average case:
S*(NslogNs + MNs(Z/2))
Problem is: cannot calculate the real omicron because the nodes in each step will decrease depending how is the graph...
Solving the whole thing with this approach could be very time consuming on a graph with very connected nodes, however it could be paralelized, and you could do one or two steps, to remove dislocated nodes, extract all subgraphs, and then choose another approach on the remainder, but you would have removed a lot of nodes from the graph so it could decrease the remaining run time...
Unfortunately this approach would benefit the GPU not the CPU, since a LOT of nodes with the same number of edges would go in each step.... and if parallelization is not used this approach is probably bad...
Maybe an inverse would go better with the CPU, sort and proceed with nodes with the maximum number of edges... those will be probably less at start, but you will have more subgraphs to extract from each node...
Another possibility is to calculate the least occuring egde count in the graph and start with nodes that have it, that would alleviate the memory usage and iteration count for extracting subgraphs...
Unless I'm reading the question wrong people seem to be overcomplicating it.
This is just "all possible paths within N edges" and you're allowing cycles.
This, for two nodes: A, B and one edge your result would be:
AA, AB, BA, BB
For two nodes, two edges your result would be:
AAA, AAB, ABA, ABB, BAA, BAB, BBA, BBB
I would recurse into a for each and pass in a "template" tuple
N=edge count
TempTuple = Tuple_of_N_Items ' (01,02,03,...0n) (Could also be an ordered list!)
ListOfTuple_of_N_Items ' Paths (could also be an ordered list!)
edgeDepth = N
Method (Nodes, edgeDepth, TupleTemplate, ListOfTuples, EdgeTotal)
edgeDepth -=1
For Each Node In Nodes
if edgeDepth = 0 'Last Edge
ListOfTuples.Add New Tuple from TupleTemplate + Node ' (x,y,z,...,Node)
else
NewTupleTemplate = TupleTemplate + Node ' (x,y,z,Node,...,0n)
Method(Nodes, edgeDepth, NewTupleTemplate, ListOfTuples, EdgeTotal
next
This will create every possible combination of vertices for a given edge count
What's missing is the factory to generate tuples given an edge count.
You end up with a list of possible paths and the operation is Nodes^(N+1)
If you use ordered lists instead of tuples then you don't need to worry about a factory to create the objects.
If memory is the biggest problem you can use a NP-ish solution using tools from formal verification. I.e., guess a subset of nodes of size N and check whether it's a graph or not. To save space you can use a BDD (http://en.wikipedia.org/wiki/Binary_decision_diagram) to represent the original graph's nodes and edges. Plus you can use a symbolic algorithm to check if the graph you guessed is really a graph - so you don't need to construct the original graph (nor the N-sized graphs) at any point. Your memory consumption should be (in big-O) log(n) (where n is the size of the original graph) to store the original graph, and another log(N) to store every "small graph" you want.
Another tool (which is supposed to be even better) is to use a SAT solver. I.e., construct a SAT formula that is true iff the sub-graph is a graph and supply it to a SAT solver.
For a graph of Kn there are approximately n! paths between any two pairs of vertices. I haven't gone through your code but here is what I would do.
Select a pair of vertices.
Start from a vertex and try to reach the destination vertex recursively (something like dfs but not exactly). I think this would output all the paths between the chosen vertices.
You could do the above for all possible pairs of vertices to get all simple paths.
It seems that the following solution will work.
Go over all partitions into two parts of the set of all vertices. Then count the number of edges which endings lie in different parts (k); these edges correspond to the edge of the tree, they connect subtrees for the first and the second parts. Calculate the answer for both parts recursively (p1, p2). Then the answer for the entire graph can be calculated as sum over all such partitions of k*p1*p2. But all trees will be considered N times: once for each edge. So, the sum must be divided by N to get the answer.
Your solution as is doesn't work I think, although it can be made to work. The main problem is that the subproblems may produce overlapping trees so when you take the union of them you don't end up with a tree of size n. You can reject all solutions where there is an overlap, but you may end up doing a lot more work than needed.
Since you are ok with exponential runtime, and potentially writing 2^n trees out, having V.2^V algorithms is not not bad at all. So the simplest way of doing it would be to generate all possible subsets n nodes, and then test each one if it forms a tree. Since testing whether a subset of nodes form a tree can take O(E.V) time, we are potentially talking about V^2.V^n time, unless you have a graph with O(1) degree. This can be improved slightly by enumerating subsets in a way that two successive subsets differ in exactly one node being swapped. In that case, you just have to check if the new node is connected to any of the existing nodes, which can be done in time proportional to number of outgoing edges of new node by keeping a hash table of all existing nodes.
The next question is how do you enumerate all the subsets of a given size
such that no more than one element is swapped between succesive subsets. I'll leave that as an exercise for you to figure out :)
I think there is a good algorithm (with Perl implementation) at this site (look for TGE), but if you want to use it commercially you'll need to contact the author. The algorithm is similar to yours in the question but avoids the recursion explosion by making the procedure include a current working subtree as a parameter (rather than a single node). That way each edge emanating from the subtree can be selectively included/excluded, and recurse on the expanded tree (with the new edge) and/or reduced graph (without the edge).
This sort of approach is typical of graph enumeration algorithms -- you usually need to keep track of a handful of building blocks that are themselves graphs; if you try to only deal with nodes and edges it becomes intractable.
This algorithm is big and not easy one to post here. But here is link to reservation search algorithm using which you can do what you want. This pdf file contains both algorithms. Also if you understand russian you can take a look to this.
So you have a graph with with edges e_1, e_2, ..., e_E.
If I understand correctly, you are looking to enumerate all subgraphs which are trees and contain N edges.
A simple solution is to generate each of the E choose N subgraphs and check if they are trees.
Have you considered this approach? Of course if E is too large then this is not viable.
EDIT:
We can also use the fact that a tree is a combination of trees, i.e. that each tree of size N can be "grown" by adding an edge to a tree of size N-1. Let E be the set of edges in the graph. An algorithm could then go something like this.
T = E
n = 1
while n<N
newT = empty set
for each tree t in T
for each edge e in E
if t+e is a tree of size n+1 which is not yet in newT
add t+e to newT
T = newT
n = n+1
At the end of this algorithm, T is the set of all subtrees of size N. If space is an issue, don't keep a full list of the trees, but use a compact representation, for instance implement T as a decision tree using ID3.
I think problem is under-specified. You mentioned that graph is undirected and that subgraph you are trying to find is of size N. What is missing is number of edges and whenever trees you are looking for binary or you allowed to have multi-trees. Also - are you interested in mirrored reflections of same tree, or in other words does order in which siblings are listed matters at all?
If single node in a tree you trying to find allowed to have more than 2 siblings which should be allowed given that you don't specify any restriction on initial graph and you mentioned that resulting subgraph should contain all nodes.
You can enumerate all subgraphs that have form of tree by performing depth-first traversal. You need to repeat traversal of the graph for every sibling during traversal. When you'll need to repeat operation for every node as a root.
Discarding symmetric trees you will end up with
N^(N-2)
trees if your graph is fully connected mesh or you need to apply Kirchhoff's Matrix-tree theorem

Resources