Finding number of nodes within a certain distance in a rooted tree - algorithm

In a rooted and weighted tree, how can you find the number of nodes within a certain distance from each node? You only need to consider down edges, e.g. nodes going down from the root. Keep in mind each edge has a weight.
I can do this in O(N^2) time using a DFS from each node and keeping track of the distance traveled, but with N >= 100000 it's a bit slow. I'm pretty sure you could easily solve it with unweighted edges with DP, but anyone know how to solve this one quickly? (Less than N^2)

It's possible to improve my previous answer to O(nlog d) time and O(n) space by making use of the following observation:
The number of sufficiently-close nodes at a given node v is the sum of the numbers of sufficiently-close nodes of each of its children, less the number of nodes that have just become insufficiently-close.
Let's call the distance threshold m, and the distance on the edge between two adjacent nodes u and v d(u, v).
Every node has a single ancestor that is the first ancestor to miss out
For each node v, we will maintain a count, c(v), that is initially 0.
For any node v, consider the chain of ancestors from v's parent up to the root. Call the ith node in this chain a(v, i). Notice that v needs to be counted as sufficiently close in some number i >= 0 of the first nodes in this chain, and in no other nodes. If we are able to quickly find i, then we can simply decrement c(a(v, i+1)) (bringing it (possibly further) below 0), so that when the counts of a(v, i+1)'s children are added to it in a later pass, v is correctly excluded from being counted. Provided we calculate fully accurate counts for all children of a node v before adding them to c(v), any such exclusions are correctly "propagated" to parent counts.
The tricky part is finding i efficiently. Call the sum of the distances of the first j >= 0 edges on the path from v to the root s(v, j), and call the list of all depth(v)+1 of these path lengths, listed in increasing order, s(v). What we want to do is binary-search the list of path lengths s(v) for the first entry greater than the threshold m: this would find i+1 in log(d) time. The problem is constructing s(v). We could easily build it using a running total from v up to the root -- but that would require O(d) time per node, nullifying any time improvement. We need a way to construct s(v) from s(parent(v)) in constant time, but the problem is that as we recurse from a node v to its child u, the path lengths grow "the wrong way": every path length x needs to become x + d(u, v), and a new path length of 0 needs to be added at the beginning. This appears to require O(d) updates, but a trick gets around the problem...
Finding i quickly
The solution is to calculate, at each node v, the total path length t(v) of all edges on the path from v to the root. This is easily done in constant time per node: t(v) = t(parent(v)) + d(v, parent(v)). We can then form s(v) by prepending -t to the beginning of s(parent(v)), and when performing the binary search, consider each element s(v, j) to represent s(v, j) + t (or equivalently, binary search for m - t instead of m). The insertion of -t at the start can be achieved in O(1) time by having a child u of a node v share v's path length array, with s(u) considered to begin one memory location before s(v). All path length arrays are "right-justified" inside a single memory buffer of size d+1 -- specifically, nodes at depth k will have their path length array begin at offset d-k inside the buffer to allow room for their descendant nodes to prepend entries. The array sharing means that sibling nodes will overwrite each other's path lengths, but this is not a problem: we only need the values in s(v) to remain valid while v and v's descendants are processed in the preorder DFS.
In this way we gain the effect of O(d) path length increases in O(1) time. Thus the total time required to find i at a given node is O(1) (to build s(v)) plus O(log d) (to find i using the modified binary search) = O(log d). A single preorder DFS pass is used to find and decrement the appropriate ancestor's count for each node; a postorder DFS pass then sums child counts into parent counts. These two passes can be combined into a single pass over the nodes that performs operations both before and after recursing.

[EDIT: Please see my other answer for an even more efficient O(nlog d) solution :) ]
Here's a simple O(nd)-time, O(n)-space algorithm, where d is the maximum depth of any node in the tree. A complete tree (a tree in which every node has the same number of children) with n nodes has depth d = O(log n), so this should be much faster than your O(n^2) DFS-based approach in most cases, though if the number of sufficiently-close descendants per node is small (i.e. if DFS only traverses a small number of levels) then your algorithm should not be too bad either.
For any node v, consider the chain of ancestors from v's parent up to the root. Notice that v needs to be counted as sufficiently close in some number i >= 0 of the first nodes in this chain, and in no other nodes. So all we need to do is for each node, climb upwards towards the root until such time as the total path length exceeds the threshold distance m, incrementing the count at each ancestor as we go. There are n nodes, and for each node there are at most d ancestors, so this algorithm is trivially O(nd).

Related

Updating a tree and keeping track of the change in the nodes of some subtree

Problem:
You are given a rooted tree where each node is numbered from 1 to N. Initially each node contains some positive value, say X. Now we are to perform two type of operations on the tree. Total 100000 operation.
First Type:
Given a node nd and a positive integer V, you need to decrease the value of all the nodes by some amount. If a node is at a distance of d from the given node then decrease its value by floor[v/(2^d)]. Do this for all the nodes.
That means value of node nd will be decreased by V (i.e, floor[V/2^0]). Values of its nearest neighbours will be decreased by floor[V/2] . And so on.
Second Type:
You are given a node nd. You have to tell the number of nodes in the subtree rooted at nd whose value is positive.
Note: Number of nodes in the tree may be upto 100000 and the initial values, X, in the nodes may be upto 1000000000. But the value of V by which the the decrement operation is to performed will be at most 100000.
How can this be done efficiently? I am stuck with this problem for many days. Any help is appreciated.
My Idea : I am thinking to solve this problem offline. I will store all the queries first. then, if somehow I can find the time[After which operation] when some node nd's value becomes less than or equal to zero(say it death time, for each and every node. Then we can do some kind of binary search (probably using Binary Indexed Trees/ Segment Trees) to answer all the queries of second type. But the problem is I am unable to find the death time for each node.
Also I have tried to solve it online using Heavy Light Decomposition but I am unable to solve it using it either.
Thanks!
Given a tree with vertex weights, there exists a vertex that, when chosen as the root, has subtrees whose weights are at most half of the total. This vertex is a "balanced separator".
Here's an O((n + k) polylog(n, k, D))-time algorithm, where n is the number of vertices and k is the number of operations and D is the maximum decrease. In the first phase, we compute the "death time" of each vertex. In the second, we count the live vertices.
To compute the death times, first split each decrease operation into O(log(D)) decrease operations whose arguments are powers of two between 1 and 2^floor(lg(D)) inclusive. Do the following recursively. Let v be a balanced separator, where the weight of a vertex is one plus the number of decrease operations on it. Compute distances from v, then determine, for each time and each power of two, the cumulative number of operations on v with that effective argument (i.e., if a vertex at distance 2 from v is decreased by 2^i, then record a -1 change in the 2^(i - 2) coefficient for v). Partition the operations and vertices by subtree. For each subtree, repeat this cumulative summary for operations originating in the subtree, but make the coefficients positive instead of negative. By putting the summary for a subtree together with v's summary, we determine the influence of decrease operations originating outside of the subtree. Finally, we recurse on each subtree.
Now, for each vertex w, we compute the death time using binary search. The decrease operations affecting w are given in a logarithmic number of summaries computed in the manner previously described, so the total cost for one vertex is log^2.
It sounds as though you, the question asker, know how the next part goes, but for the sake of completeness, I'll describe it. Do a preorder traversal to assign new labels to vertices and also compute for each vertex the interval of labels that comprises its subtree. Initialize a Fenwick tree mapping each vertex to one (live) or zero (dead), initially one. Put the death times and queries in a priority queue. To process a death, decrease the value of that vertex by one. To process a query, sum the values of vertices in the subtree interval.

optimal way to calculate all nodes at distance less than k from m given nodes

A graph of size n is given and a subset of size m of it's nodes is given . Find all nodes which are at a distance <=k from ALL nodes of the subset .
eg . A->B->C->D->E is the graph , subset = {A,C} , k = 2.
Now , E is at distance <=2 from C , but not from A , so it should not be counted .
I thought of running Breadth First Search from each node in subset , and taking intersection of the respective answers .
Can it be further optimized ?
I went through many posts on SO , but they all direct to kd-trees which i don't understand , so is there any other way ?
I can think of two non-asymptotic (I believe) optimizations:
If you're done with BFS from one of the subset nodes, delete all nodes that have distance > k from it
Start with the two nodes in the subset whose distance is largest to get the smallest possible leftover graph
Of course this doesn't help if k is large (close to n), I have no idea in that case. I am positive however that k/d trees are not applicable to general graphs :)
Nicklas B's optimizations can be applied to both of the following optimizations.
Optimization #1: Modify BFS to do the intersection as it runs rather than afterwords.
The BFS and intersection seems to be the way to go. However, there is redudant work being done by the BFS. Specicially, it is expanding nodes that it doesn't need to expand (after the first BFS). This can be resolved by merging the intersection aspect into the BFS.
The solution seems to be to keep two sets of nodes, call them "ToVisit" and "Visited", rather than label nodes visited or not.
The new rules of the BFS are as followed:
Only nodes in ToVisit are expanded upon by the BFS. They are then moved from ToVisit to Visited to prevent being expanded twice.
The algorithm returns the Visited set as it's result and any nodes left in the ToVisit are discarded. This is then used as the ToVisit set for the next node.
The first node either uses a standard BFS algorithm or ToVisit is the list of all nodes. Either way, the result becomes the second ToVisit set for the second node.
It works better if The ToVisit set is small on average, which tends to be the case of m and k are much less than N.
Optimization #2: Pre-compute the distances if there are enough queries so queries just do intersections.
Although, this is incompatible with the first optimization. If there are a sufficient number of queries on differing subsets and k values, then it is better to find the distances between every pair of nodes ahead of time at a cost of O(VE).
This way you only need to do the intersections, which is O(V*M*Q), where Q is the number of queries, M is the average size of the subset over the queries and V is the number of nodes. If it is expected to the be case that O(M*Q) > O(E), then this approach should be less work. Noting the two most distant nodes are useful as any k equal or higher will always return the set of all vertices, resulting in just O(V) for the query cost in that case.
The distance data should then be stored in four forms.
The first is "kCount[A][k] = number of nodes with distance k or less from A". This provides an alternative to Niklas B.'s suggestion of "Start with the two nodes in the subset whose distance is largest to get the smallest possible leftover graph" in the case that O(m) > O(sqrt(V)) since finding the smallest is O(m^2) and it may be better to avoid trying to find the best choice for the starting pair and just pick a good choice. You can start with the two nodes in the subset with the smallest value for the given k in this data structure. You could also just sort the nodes in the subset by this metric and do the intersections in that order.
The second is "kMax[A] = max k for A", which can be done using a hashmap/dictionary. If the k >= this value, then this this one can be skipped unless kCount[A][kMax[A]] < (number of vertices), meaning not all nodes are reachable from A.
The third is "kFrom[A][k] = set of nodes k distance from A", since k is valid from 0 to the max distance, an hashmap/dictionary to an array/list could be used here rather than a nested hashmap/dictionary. This allows for space and time efficient*** creating the set of nodes with distance <= k from A.
The fourth is "dist[A][B] = distance from A to B", this can be done using a nested hashmap/dictionary. This allows for handling the intersection checks fairly quickly.
* If space isn't an issue, then this structure can store all the nodes k or less distance from A, but that requires O(V^3) space and thus time. The main benefit however is that it allow for also storing a separate list of nodes that are greater than k distance. This allows the algorithm use the smaller of the sets, dist > k or dist <= k. Using an intersection in the case of dist <= k and set subtraction in the case of dist <= k or intersection then set subtraction if the main set has the minimize size.
Add a new node (let's say s) and connect it to all the m given nodes.
Then, find all the nodes which are at a distance less than or equal to k+1 from s and subtract m from it. T(n)=O(V+E)

Given a node network, how to find the highest scoring loop with finite number of moves?

For a project of mine, I'm attempting to create a solver that, given a random set of weighted nodes with weighted paths, will find the highest scoring path with a finite number of moves. I've created a visual to help describe the problem.
This example has all the connection edges shown for completeness. The number on edges are traversal costs and numbers inside nodes are scores. A node is only counted when traversed to and cannot traverse to itself from itself.
As you can see from the description in the image, there is a start/finish node with randomly placed nodes that each have a arbitrary score. Every node is connected to all other nodes and every connection has an arbitrary weight that subtracts from the total number of move units remaining. For simplicity, you could assume the weight of a connection is a function of distance. Nodes can be traveled to more than once and their score is applied again. The goal is to find a loop path that has the highest score for the given move limit.
The solver will never be dealing with more than 30 nodes, usually dealing with 10-15 nodes. I still need to try and make it as fast as possible.
Any ideas on algorithms or methods that would help me solve this problem other than pure brute force methods?
Here's an O(m n^2)-time algorithm, where m is the number of moves and n is the number of nodes.
For every time t in {0, 1, ..., m} and every node v, compute the maximum score of a t-step walk that begins at the start node and ends at v as follows. If t = 0, then there's only walk, namely, doing nothing at the start node, so the maximum for (0, v) is 0 if v is the start node and -infinity (i.e., impossible) otherwise.
For t > 0, we use the entries for t - 1 to compute the entries for t. To compute the (t, v) entry, we add the score for v to the difference of the maximum over all nodes w of the (t - 1, w) entry minus the transition penalty from w to v. In other words, an optimal t-step walk to v consists of a step from some node w to v preceded by a (t - 1)-step walk to w, and this (t - 1)-step walk must be optimal because history does not influence future scoring.
At the end, we look at the (m, start node) entry. To recover the actual walk involves working backward and determining repeatedly which w was the best node to have come from.

Split a tree into equal parts by deleting an edge

I am looking for an algorithm to split a tree with N nodes (where the maximum degree of each node is 3) by removing one edge from it, so that the two trees that come as the result have as close as possible to N/2. How do I find the edge that is "the most centered"?
The tree comes as an input from a previous stage of the algorithm and is input as a graph - so it's not balanced nor is it clear which node is the root.
My idea is to find the longest path in the tree and then select the edge in the middle of the longest path. Does it work?
Optimally, I am looking for a solution that can ensure that neither of the trees has more than 2N / 3 nodes.
Thanks for your answers.
I don't believe that your initial algorithm works for the reason I mentioned in the comments. However, I think that you can solve this in O(n) time and space using a modified DFS.
Begin by walking the graph to count how many total nodes there are; call this n. Now, choose an arbitrary node and root the tree at it. We will now recursively explore the tree starting from the root and will compute for each subtree how many nodes are in each subtree. This can be done using a simple recursion:
If the current node is null, return 0.
Otherwise:
For each child, compute the number of nodes in the subtree rooted at that child.
Return 1 + the total number of nodes in all child subtrees
At this point, we know for each edge what split we will get by removing that edge, since if the subtree below that edge has k nodes in it, the spilt will be (k, n - k). You can thus find the best cut to make by iterating across all nodes and looking for the one that balances (k, n - k) most evenly.
Counting the nodes takes O(n) time, and running the recursion visits each node and edge at most O(1) times, so that takes O(n) time as well. Finding the best cut takes an additional O(n) time, for a net runtime of O(n). Since we need to store the subtree node counts, we need O(n) memory as well.
Hope this helps!
If you see my answer to Divide-And-Conquer Algorithm for Trees, you can see I'll find a node that partitions tree into 2 nearly equal size trees (bottom up algorithm), now you just need to choose one of the edges of this node to do what you want.
Your current approach is not working assume you have a complete binary tree, now add a path of length 3*log n to one of leafs (name it bad leaf), your longest path will be within one of a other leafs to the end of path connected to this bad leaf, and your middle edge will be within this path (in fact after you passed bad leaf) and if you partition base on this edge you have a part of O(log n) and another part of size O(n) .

How to find the maximum-weight path between two vertices in a DAG?

In a DAG G, with non negative weighted edges, how do you find the maximum-weight path between two vertices in G?
Thank you guys!
You can solve this in O(n + m) time (where n is the number of nodes and m the number of edges) using a topological sort. Begin by doing topological sort on the reverse graph, so that you have all the nodes ordered in a way such that no node is visited before all its children are visited.
Now, we're going to label all the nodes with the weight of the highest-weight path starting with that node. This is done based on the following recursive observation:
The weight of the highest-weight path starting from a sink node (any node with no outgoing edges) is zero, since the only path starting from that node is the length-zero path of just that node.
The weight of the highest-weight path starting from any other node is given by the maximum weight of any path formed by following an outgoing edge to a node, then taking the maximum-weight path from that node.
Because we have the nodes reverse-topologically sorted, we can visit all of the nodes in an order that guarantees that if we ever try following an edge and looking up the cost of the heaviest path at the endpoint of that node, we will have already computed the maximum-weight path starting at that node. This means that once we have the reverse topological sorted order, we can apply the following algorithm to all the nodes in that order:
If the node has no outgoing edges, record the weight of the heaviest path starting at that node (denoted d(u)) as zero.
Otherwise, for each edge (u, v) leaving the current node u, compute l(u, v) + d(v), and set d(u) to be the largest value attained this way.
Once we've done this step, we can make one last pass over all the nodes and return the highest value of d attained by any node.
The runtime of this algorithm can be analyzed as follows. Computing a topological sort can be done in O(n + m) time using many different methods. When we then scan over each node and each outgoing edge from each node, we visit each node and edge exactly once. This means that we spend O(n) time on the nodes and O(m) time on the edges. Finally, we spend O(n) time on one final pass over the elements to find the highest weight path, which takes O(n). This gives a grand total of O(n + m) time, which is linear in the size of the input.
A simple brute-force algorithm can be written using recursive functions.
Start with an empty vector (in C++: std::vector) and insert the first node.
Then call your recursive function with the vector as argument that does the following:
loop over all neighbours and for each neighbour
copy the vector
add the neighbour
call ourself
Also add the total weight as argument to the recursive function and add the weight in every recursive call.
The function should stop whenever it reaches the end node. Then compare the total weight with the maximum weight you have so far (use a global variable) and if the new total weight is bigger, set the maximum weight and store the vector.
The rest is up to you.

Resources