find a tree, given the data on its leaves - algorithm

let there be an undirected tree T, and let there be: T.leaves - all the leaves (each v such that d(v) = 1). we know: |T.leaves| and the distance between u and v for each u,v in T.leaves.
in other words: we have an undirected tree, and we know how many leaves it has, and the distance between every 2 leaves.
we need to find how many inside vertices (d(v)>1) are in the tree.
note: building the complete tree is impossible because if we have only 2 leaves but the distance between them is 2^30, it will take too long...
I tried to start from the shortest distance and count how many vertices are between them, and then adding the vertex closest to them, but for this I need some formula f(leaves_counted,next_leaf) but I could not manage to find that f...
any ideas?

Continued from discussion in comments. This is how to check a particular (compressed) edge to see if you can attach the new vertex n somewhere in the middle of it, without iterating over the distances.
Ok, so you need to find three numbers: l (the distance of the attach point from the left node of the edge in question), x (the distance of the new node from the attach point) and r (symmetrical to l.)
Obviously, for every node y in set L (the left part of the tree), its distance to A must differ from its distance to n by the same number (lets call it dl which must be equal l + x). If this is not the case, there is no solution for this particular edge. Same goes for nodes in R, with dr and r + x respectively.
If the above holds, then you have three equations:
l + x = dl
r + x = dr
r+l = dist(A,B)
Three equations, three numbers. If this has a solution then you have found the right edge.
At worst you need to iterate the above for every edge, but I think it can be optimized - the distance check on L and R might exclude one of the parts of the tree from further search. It might also be possible to somehow get the number of nodes without even constructing the tree.

if your binary tree has L leaves then it has L-1 internal vertices regardless of the shape of the tree.
You can easily prove this: start with the tree with only one node (root) node. Then take any leaf, and add two descendants to it, converting the leaf into an internal vertex and adding to leaves. This removes one leaf (the old node), and adds one internal node and two leaves, i.e. net is +1 internal node and +1 leaf. Because you start with one leaf and 0 internal nodes, you have always |leaves| = |internal nodes|+1 --- any tree shape can be produced by this process.
Here examples of all the two shapes of trees with 4 leaves (save for trivial left-right symmetries):
o o
o L o o
o L L L L L
L L
The number of internal vertices is always 3.

Related

Specific Graph and need to more Creative solution

Directed Graph (|V|=a, |E|=b) is given.
each vertexes has specific weight. we want for each vertex (1..a) find a vertex with maximum weight that can be reachable from that vertex.
Update 1: one nice answer is prepare by #Paul in O(b + a log a). but I
search for O(a + b) algorithms, if any?
Is there any different efficient or fastest any other ways for doing it?
Yes, it's possible to modify Tarjan's SCC algorithm to solve this problem in linear time.
Tarjan's algorithm uses two node fields to drive its SCC finding logic: index, which represents the order in which the algorithm discovers the nodes; and lowlink, the minimum index reachable by a sequence of tree arcs followed by a back arc. As part of the same depth-first traversal, we can compute another field, maxweight, which has one of two meanings:
For a node not yet included in a finished SCC, it represents the maximum weight reachable by a sequence of tree arcs, optionally followed by a cross arc to another SCC and then any subsequent path.
For nodes in a finished SCC, it represents the maximum weight reachable.
The logic for computing maxweight is as follows. If we discover an arc from v to a new node w, then vw is a tree arc, so we compute w.maxweight recursively and update v.maxweight = max(v.maxweight, w.maxweight). If w is on the stack, then we do nothing, because vw is a back arc and not included in the definition of maxweight. Otherwise, vw is a cross arc, and we do the same update that we would have done for a tree arc, just without the recursive call.
When Tarjan's algorithm identifies an SCC, it's because it has a node r with r.lowlink == r.index. Since r is the depth-first search root of this SCC, its value of maxweight is correct for the whole SCC. Instead of recording each node in the SCC, we simply update its maxweight to r.maxweight.
Sort all nodes by weight in decreasing order and create the graph g' with all edges in E reversed (i.e. if there's an edge a -> b in g, there's an edge b -> a in g'). In this graph you can now propagate the maximum-value by simple DFS. Do this iteratively for all nodes and terminate when a maximum-weight has already been assigned.
As pseudocode:
dfs_assign_weight_reachable(node, weight):
if node.max_weight_reachable >= weight:
return
node.max_weight_reachable = weight
for n = neighbor of node:
dfs_assign_weight_reachable(n, weight)
g' = g with all edges reversed
nodes = nodes from g' sorted descendingly by weight
assign max_weight_reachable = -inf to each node in nodes
for node in nodes:
dfs_assign_weight_reachable(node, node.weight)
UPDATE:
The tight bound is O(b + a log a). a log a is caused by the sorting step. And each edge gets visited once during the reversal step and once during the assigning maximum weights, giving the second term in the max-expression.
Acknowledgement:
I'd like to thank #SerialLazer for the time invested in a discussion about the time-complexity of the above algorithm and helping me figure out the correct bound.

ET-tree in Henzinger and King (1995)

I'm having trouble understanding the following from the beginning of section 2.5 of Henzinger and King (FOCS 1995):
We encode an arbitrary tree T with n vertices using a sequence of 2n - 1 symbols, which is generated as follows: Root the tree at an arbitrary vertex.
Then traverse T in depth-first search order traversing each edge twice (once in each direction) and visiting every degree-d vertex d times, except for the root which is visited d + 1 times. Each time any vertex u is encountered, we call this an occurrence of the vertex. Let ET(T) be the sequence of node occurrences representing an arbitrary tree T.
For each spanning tree T(B) of a block B of H_i each occurrence of ET(T(B)) is stored in a node of a balanced binary search tree, called the ET(T(B))-tree. For each vertex u in T(B), we arbitrarily choose one occurrence to be the active occurrence of u.
With the active occurrence of each vertex v, we keep the (unordered) list of nontree edges in B which are incident to u, stored as a balanced binary tree. Each node in the ET-tree contains the number of nontree edges stored in its subtree.
Using this data structure for each level we can sample an edge of T_1 in time O(logn).
My questions follow some explanation of some terms.
From section 2.2,
A block is a maximal set of nodes that are biconnected.
H_i is a subgraph of G, whose dynamic biconnectivity (which is much of the topic of the paper) we wish to maintain. (More specifically, the edge set of G has been partitioned into l = log(|E(G)|) sets and each set induces a graph H_i for 1 <= i <= l.)
Here's my understanding. Pick some T(B) of H_i. The key of each active node in the corresponding ET tree is the number of "nontree edges in B [that] are incident to u". The value of each node is the appropriate list of edges.
Questions:
To finalise the ET tree, we have each node store the sum of the keys of all nodes in its subtree (including the node itself)?
The list for each node is not a balanced tree, right?
If only the active nodes matter, why create an ET-tree in the first place?
To sample an edge, we generate a number between 1 and |E(B))|. If the number is between 1 and |V(B)| - 1 then we're choosing some tree edge. Otherwise, we traverse the balanced tree and then select the appropriate edge from a list?

BFS and correctness on the term "VISITED"

mark x as visited
list L = x
tree T = x
while L nonempty
choose some vertex v from front of list
process v
for each unmarked neighbor w
mark w as visited
add it to end of list
add edge vw to T
Most of the code will choose to mark the adjacent node as visited before visiting them. Won't it technically be correct to add all neighbor first and visit them later?
list L = x
tree T = x
while L nonempty
choose some vertex v from front of list
if (V NOT YET VISITD)
MARK V AS VISITED HERE
for each unmarked neighbor w
add it to end of list
add edge vw to T
Why is it that every BFS seems to mark node as visited when you did not even visit them yet? I am trying to find a theoretically correct code for BFS. Which one is correct?
Both algorithms work, but the second version might add the same node to the list L twice. This doesn't affect correctness because of the additional check whether a node was visited, but it increases memory consumption and requires an extra check. That's why you'll typically see the first algorithm in text books.
Both are correct, but they use different definitions of the word visited. It is common for algorithms to have many variations and have many different implementations that are all correct, and BFS is one example.

Detecting connectedness of nodes over a long time in a graph

I start out with a graph of N nodes with no edges.
Then I procede to take M predetermined steps.
At each step, I must either create an edge between two nodes, or delete an edge.
After each step, I must print out how many connected components there are in my graph.
Is there an algorithm for solving this problem in time linear with respect to M? If not, is there one better than O(min(M,N) * M) in the worst case?
EDIT:
The program does not get to decide what the M steps are.
I have to read from the input, whether I am supposed to create an edge or delete it, and also which edge I am supposed to create/delete.
So example input might be
N = 4
M = 4
JOIN 1 2
JOIN 2 3
DELETE 2 3
DELETE 1 2
Then my output should be
3 # (1 2) 3 4
2 # (1 2 3) 4
3 # (1 2) 3 4
4 # 1 2 3 4
There are ways to solve this problem fully online, but they're more complicated than this answer. The algorithm that I'm proposing is to maintain a spanning forest of the available edges, together with the number of components of the spanning forest (and hence the graph). If we were attacking this problem fully online, then this would be problematic, since a spanning forest edge might get deleted, leaving us to paw through the unused edges for a replacement. We know, however, how soon each edge currently in the graph will be deleted.
The particular spanning forest that we maintain is a maximum-weight spanning forest, where the weight of each edge is its deletion time. If an edge belonging to this spanning forest is deleted, then there is no replacement, since every other edge connecting the components represented by its endpoints either hasn't been inserted yet or, having lesser weight, has already been deleted.
There's a dynamic tree data structure, also referred to as a link/cut tree, due to Sleator and Tarjan, that can be made to provide the following operations in logarithmic time.
Link(u, v, w) - inserts an edge between u and v with weight w;
u and v must not be connected already
Cut(u, v) - cuts the edge between u and v, if it exists;
returns a boolean indicating whether an edge was removed
FindMin(u, v) - finds the minimum-weight edge on the path from u to v
and returns its endpoints and weight;
returns null if either u = v or u and v are not connected
To maintain the forest, when an edge from u to v is inserted, compare its removal time to the minimum on the path from u to v. If the minimum does not exist, then insert the edge. If the minimum is less than the new edge, delete the minimum and replace it with the new edge. Otherwise, do nothing. When an edge from u to v is deleted, attempt to delete it from the forest.
The running time of this approach is O(m log n). If you don't have a dynamic tree handy, then it admittedly will take quite a while to implement. Instead of using a proper dynamic tree, I've had success with a much simpler data structure that just stores the forest as a bunch of nodes with weights and parent pointers. The running time then is O(m d) where d is the maximum diameter of the graph, and if you're lucky, d is quite a lot less than n.

Chaitin-Briggs Algorithm explanation

I googled it but I haven't found some good material about this topic.
Where can I find more information about Chaitin-Briggs graph-coloring algorithm? Or can somebody explain how it works?
The key insight to Chaitin’s algorithm is called the degree < R rule which is as follows.
Given a graph G which contains a node N with degree less than R, G is R-colorable iff the graph G’, where G’ is G with node N removed, is R-colorable. The proof is obvious in one direction: if a graph G can be colored with R colors then the graph G’ can be created without changing the coloring. In the other direction supposed we have an R-coloring of G’. Since N has a degree of less than R there must be at least one color that is not in use for a node adjacent to N. We can color N with this color.
The algorithm is as follows:
While G cannot be R-colored
While graph G has a node N with degree less than R
Remove N and its associated edges from G and push N on a stack S
End While
If the entire graph has been removed then the graph is R-colorable
While stack S contains a node N
Add N to graph G and assign it a color from the R colors
End While
Else graph G cannot be colored with R colors
Simplify the graph G by choosing an object to spill and remove its node N from G
(spill nodes are chosen based on object’s number of definitions and references)
End While
The complexity of the Chaitin-Briggs algorithm is O(n2) because of the problem of spilling. A graph G will only fail to be R-colorable if at some point the reduced graph G’ only has nodes of degree R or greater. When a graph is easily R-colorable the cost of a single iteration is O(n) because we make two trips through the graph and either remove or add one node each time. But spilling brings in additional complexity because we may need to spill an arbitrary number of nodes before G becomes R-colorable. For every node we spill we make another trip through the linear algorithm
You can also go through this Register allocation algorithm

Resources