Algorithm for "balanced" breadth-first search - algorithm

I'm looking for references for an algorithm that conducts a breadth-first tree search in a balanced manner that is resilient in a situation where
we expect most nodes to have few children, but
a few nodes may have a many (possibly infinitely many) children.
Consider this simple tree (modified from this post):
A
/ \
B C
/ / \
D E F
| /|\ \
G HIJ... K
Depth-first visits nodes in this order:
A B D G C E H I J ... F K
Breadth-first visits nodes in this order:
A B C D E F G H I J ... K
The balanced breadth-first algorithm that I have in mind would visit nodes in this order:
A B C D E G F H K I J ...
Note how
we visit G before F, and
we visit K after H but before I.
G is deeper than F, but it is an only child of B whereas F is a second child of C and must share its search priority with E. Similarly between K and the many children H, I, J, ... of E.
I call this "balanced" because a node with lots of children cannot choke the algorithm. Concretely, if E has 𝜔 (infinitely) many nodes then a pure breadth-first strategy would never reach K, whereas the "balanced" algorithm would still reach K after H but before the other children of E.
(The reader who does not like 𝜔 can attain a similar effect with a large but still finite number such as "the greatest number of steps any practical search algorithm will ever make, plus 1".)
I can only imagine that this style of search or something like it must have been the subject of much research and practical application. I would be grateful to be pointed in the right direction. Thank you.

Transform your tree to a different kind of representation. In this new representation, each node has at most two links: one to its leftmost child, and one to its right sibling.
A
/ \
B C
/ / \
D E F
| /|\ \
G HIJ... K
  ⇓
A
/
B --> C
/ /
D E -> F
| / \
G / K
/
H -> I -> J -> ...
Then treat this representation as a normal binary tree, and traverse it breadth-first. It may have an infinite height, but like with any binary tree, the width at any particular level is finite.

depth-first-search, breadth-first-search, A*-search (and others) only differ in how you handle the list of "nodes still to visit".
(I assume you always process the node at the start of the list next)
depth-first-search: append new nodes at the front of the list
breadth-first-search: add new nodes to the end of the list
A*-search: insert new nodes in the list according to costs+heuristic
So you need to formalize how to insert new nodes to the list to fulfill your requirements.

Related

Can one extract the in-order given pre- and post-order for a binary search tree over < with only O(n) <-comparisons?

Assume one has a binary search tree B, which is not necessarily balanced, over some domain D with the strict order relation < and n elements.
Given B's extracted pre-order R, post-order T:
Is it possible to compute B's in-order S in O(n) without access to <?
Is it possible to compute B's in-order S using only O(n) comparisons with <?
Furthermore is it possible to compute S in O(n) total?
Note: This is a re-post of a now-deleted unanswered question.
1. In-order without relation <
This is impossible as explained in Why it is impossible to construct Binary Tree with Pre-Order, Post Order and Level Order traversals given?. In short, the input R = cba, T = abc is ambiguous and could stem from both these trees:
a a
/ \
b b
\ /
c c
S = bca S = acb
2. In-order in O(n) comparisons
Using the < relation, one can suddenly differentiate trees like the above although the produce the same pre-order R and post-order T. Given:
R = Ca
with C some arbitrary non-empty range of a's children nodes with C = u...v (i.e. range starts with u and ends with v) one can infer the following:
(1) a < u -> a has 1 direct child (to its right) -> all children are greater than a
(2) v < a -> a has 1 direct child (to its left) -> all children are less than a
(3) otherwise -> a has 2 direct children
Recursion in (1) and (2) is trivial and we have spent O(1) <. In case of (3) we have something of the form:
R = XbYca
T = abXcY
where X and Y are arbitrary sequences. We can split this into the recursion steps:
R = XbYca
T = abXcY
/ \
R = Xb R = Yc
T = bX T = cY
Note that this requires no comparisons with <, but requires to split both ranges. Since X and Y need not be the same length, finding the splitting-point requires us to find b in R, which can be done in O(n) for each level of the recursion tree. So in total we need O(n*d) equality comparisons, where d is the depth of the original binary search tree B (and the recursion tree which mirrors B).
In each recursion step we use at most 2 < comparisons, getting out one element of the range, hence we cannot use more than 2*n < comparisons (which is in O(n)).
3. In-order in O(n) total
In the algorithm given above the problem is that finding the point where to split the range containing all children cannot be done better than in linear time if one cannot afford a lookup table for all elements.
But, if the universe over which B is defined is small enough to be able to create an index table for all entries, one can pre-parse R (e.g. R = xbyca) in O(n) and create a table like this:
a -> 4
b -> 1
c -> 3
d -> N/A
e -> N/A
....
x -> 0
y -> 2
z -> N/A
Only if this is feasible one can achieve overall O(n) with the algorithm described in 2. It consume O(2^D) space.
Is it otherwise possible to produce the in-order S in O(n).
I don't have a proof for this, but do not think it is possible. The rationale is that the problem is too similar to comparison sorting which cannot be better than linearythmic.
In "2." we get away with linear comparisons because we can exploit the structure of the input in conjunction with a lot of equality checks to partially reconstruct the original structure of the binary search tree at least locally. However, I don't see how the size of each sub-tree can be extracted in less than linear time.

why when we change the cost of every edge in G as c'= log17(c),every MST in G is still an MST in G′ (and vice versa)?

remarks:c' is logc with base 17
MST means (minimum spanning tree)
it's easy to prove the conclusion is correct when we use linear function to transform the cost of every edge.
But log function is not a linear function ,I could not understand why this conclusion is correct。
Supplementary notes:
I did not consider specific algorithms, such as the greedy algorithm. I simply consider the relationship between the sum of the weights of the two trees after transformation.
Numerically if (a + b) > (c + d) , (log a + log b) maybe not > ( logc + logd) .
If a tree generated by G has two edge a and b ,another tree generated by G has c and d,a + b < c + d and the first tree is a MST,but in transformed graph G' ,the sum of weights of edges of second tree may be smaller.
Because of this, I want to construct a counterexample based on "if (a + b)> (c + d), (log a + log b) maybe not> (logc + logd) ", but I failed.
One way to characterize when a spanning tree T is a minimum spanning tree is that, for every edge e not in T, the cycle formed by e and edges of T (the fundamental cycle of e with respect to T) has no edge more expensive than e. Using this characterization, I hope you see how to prove that transforming the costs with any increasing function preserves minimum spanning trees.
There's a one line proof that this condition is necessary. If the fundamental cycle contained a more expensive edge, we could replace it with e and get a spanning tree that costs less than T.
It's less obvious that this condition is sufficient, since at first glance it looks like we're trying to prove global optimality from a local optimality condition. To prove this statement, let T be a spanning tree that satisfies the condition, let T' be a minimum spanning tree, and let G' be the graph whose edges are the union of the edges of T and T'. Run Kruskal's algorithm on G', breaking ties by favoring edges in T over edges not in T. Let T'' be the resulting minimum spanning tree in G'. Since T' is a spanning tree in G', the cost of T'' is not greater than T', hence T'' is a minimum spanning tree in G as well as G'.
Suppose to the contrary that T'' ≠ T. Then there exists an edge in T but not in T''. Let e be the first such edge considered by Kruskal's algorithm. At the time that e was considered, it formed a cycle C in the edges that had been selected from T''. Since T is acyclic, C \ T is nonempty. By the tie breaking criterion, we know that every edge in C \ T costs less than e. Observing that some edge e' in C \ T must have one endpoint in each of the two connected components of T \ {e}, we infer that the fundamental cycle of e' with respect to T contains e, which violates the local optimality condition. In conclusion, T = T'', hence is a minimum spanning tree in G.
If you want a deeper dive, this logic gets abstracted out in the theory of matroids.
Well, its pretty easy to understand...let's see if I can break it down for you:
c` = log_17(c) // here 17 is base
log may not be linear function...but we can say that:
log_b(x) > log_b(y) if x > y and b > 1 (and of course x > 0 and y > 0)
I hope you get the equation I've written...In words in means, consider a base "b" such that b > 1, then log_b(x) would be greater than log_b(y) if x > y.
So, if we apply this rule in your costs of MST of G, then we see that the edges those were selected for G, would still produce the least possible edges to construct MST G' if c' = log_17(c) // here 17 is base.
UPDATE: As I can see you've problem understanding the proof, I'm elaborating a bit:
I guess, you know MST construction is greedy. We're going to use kruskal's algo to proof why it is correct.(In case, you don't know, how kruskal's algo works, you can read it somewhere, or just google it, you'll find millions of resources). Now, Let me write some steps of kruskal's edge selection for MST of G:
// the following edges are sorted by cost..i.e. c_0 <= c_1 <= c_2 ....
c_0: A, F // here, edge c_0 connects A, F, we've to take the edge in MST
c_1: A, B // it is also taken to construct MST
c_2: B, R // it is also taken to construct MST
c_3: A, R // we won't take it to construct to MST, cause (A, R) already connected through A -> B -> R
c_4: F, X // it is also taken to construct MST
...
...
so on...
Now, when constructing MST of G', we've to select edges which are in the form c' = log_17(c) // where 17 is base
Now, if we convert the edges using log of base 17, then c_0 becomes c_0', c_1 becomes c_1' and so on...
But we, know that:
log_b(x) > log_b(y) if x > y and b > 1 (and of course x > 0 and y > 0)
So, we may say that,
log_17(c_0) <= log_17(c_1), cause c_0 <= c_1
in general,
log_17(c_i) <= log_17(c_j), where i <= j
And now, we may say:
c_0` <= c_1` <= c_2` <= c_3` <= ....
So, the edge selection process to construct MST of G' would be:
// the following edges are sorted by cost..i.e. c_0` <= c_1` <= c_2` ....
c_0`: A, F // here, edge c_0` connects A, F, we've to take the edge in MST
c_1`: A, B // it is also taken to construct MST
c_2`: B, R // it is also taken to construct MST
c_3`: A, R // we won't take it to construct to MST, cause (A, R) already connected through A -> B -> R
c_4`: F, X // it is also taken to construct MST
...
...
so on...
Which is same as MST of G...
That proves the theorem ultimately....
I hope you get it...if not ask me in the comment what is not clear to you...

How to represent a dependency graph with alternative paths

I'm having some trouble trying to represent and manipulate dependency graphs in this scenario:
a node has some dependencies that have to be solved
every path must not have dependencies loops (like in DAG graphs)
every dependency could be solved by more than one other node
I starts from the target node and recursively look for its dependencies, but have to mantain the above properties, in particular the third one.
Just a little example here:
I would like to have a graph like the following one
(A)
/ \
/ \
/ \
[(B),(C),(D)] (E)
/\ \
/ \ (H)
(F) (G)
which means:
F,G,C,H,E have no dependencies
D dependends on H
B depends on F and G
A depends on E and
B or
C or
D
So, if I write down all the possible topological-sorted paths to A I should have:
E -> F -> G -> B -> A
E -> C -> A
E -> H -> D -> A
How can I model a graph with these properties? Which kind of data structure is the more suitable to do that?
You should use a normal adjacency list, with an additional property, wherein a node knows its the other nodes that would also satisfy the same dependency. This means that B,C,D should all know that they belong to the same equivalence class. You can achieve this by inserting them all into a set.
Node:
List<Node> adjacencyList
Set<Node> equivalentDependencies
To use this data-structure in a topo-sort, whenever you remove a source, and remove all its outgoing edges, also remove the nodes in its equivalency class, their outgoing edges, and recursively remove the nodes that point to them.
From wikipedia:
L ← Empty list that will contain the sorted elements
S ← Set of all nodes with no incoming edges
while S is non-empty do
remove a node n from S
add n to tail of L
for each node o in the equivalency class of n <===removing equivalent dependencies
remove j from S
for each node k with an edge e from j to k do
remove edge e from the graph
if k has no other incoming edges then
insert k into S
for each node m with an edge e from n to m do
remove edge e from the graph
if m has no other incoming edges then
insert m into S
if graph has edges then
return error (graph has at least one cycle)
else
return L (a topologically sorted order)
This algorithm will give you one of the modified topologicaly-sorted paths.

Transitive set merging

Is there a well-known algorithm, that given a collection of sets, would merge together every two sets that have at least one common element? So for example, given the following input:
A B C
B D
A E
F G H
I J
K F
L M N
E O
It would produce:
A B C D E O
F G H K
I J
L M N
I already have a working implementation, but it seems to be common enough that there has to be a name for what I am doing.
You can model this as a simple graph problem: Introduce a node for every distinct element. Introduce a node for every set. Connect every set to the elements it contains. You get an (undirected) bipartite graph, in which the connected components are the solution of your problem. You can use depth-first search to find the CCs.
The runtime should be linear (with hash tables, so only expected runtime unless your numbers are bounded).
I don't think it deserves a special name, it's just an application of well-known concepts.

Select the subtrees containing exactly K leaves

I'm given a tree T which has n nodes and l leaves.
I have to select some subtrees which contains exactly k (<=l) leaves. If I select node t's ancestors' subtree, we cannot select t's subtree.
For example:
This is the tree T which has 13 nodes (7 leaves).
If I want to select k = 4 leaves, I can select node 4 and 6 (or, node 2 and 5). This is the minimum number of the selection. (we can select node 6, 7, 8, 9 either, but this is not the minimum).
If I want to select k = 5 leaves, I can select node 3, and this is the minimum number of the selection.
I want to select the minimum numbers of subtrees. I can find only O(nk^2) and O(nk) algorithm, which uses BFS and dynamic programming. Is there any better solution with selecting this?
Thanks :)
Actually, to know the number of leaves of each subtree you just need to go through each node once so complexity should be O(nm) where m is the mean number of children of each node, which in most cases evaluates to O(n) because m is just a constant. To do this, you should:
Find which nodes of your tree are leaves
Go up the tree, saving for each node the number of leaves in its subtree
You can do this by starting with leaves and putting parents inside a queue. When you pop a node n_i out of the queue, sum the number of leaves contained in each subtree starting from each of n_i's children. Once you're done, mark n_i as visited (so you don't visit it multiple times, since it can be added once per children)
This gives something like this:
^
| f (3) This node last
| / \
| / \
| / \
| / \
| d (2) e (1) These nodes second
| / \ /
| / \ /
| a (1) b (1) c (1) These nodes first
The steps would be:
Find leaves `a`, `b` and `c`.
For each leave, add parent to queue # queue q = (d, d, e)
Pop d # queue q = (d, e)
Count leaves in subtree: d.leaves = a.leaves + b.leaves
Mark d as visited
Add parent to queue # queue q = (d, e, f)
Pop d # queue q = (e, f)
d is visited, do nothing
Pop e # queue q = (f)
Count leaves in subtree: e.leaves = c.leaves
Mark d as visited
Add parent to tree # queue q = (f, f)
Pop f # queue q = (f)
Count leaves in subtree: f.leaves = d.leaves + e.leaves
Mark d as visited
Add parent to tree (none)
Pop f # queue q = ()
f is visited, do nothing
You can also use a smart data structure that will ignore nodes added twice. Note that you can't use an ordered set because it is very important that you explore "lower" nodes before "higher" nodes.
In your case, you can eliminate nodes in your queue if they have more than k leaves, and return each node that you find that haves k leaves, which will give an even faster algorithm.

Resources