Nearest node in a tree - algorithm

Recent I encountered this problem on trees whose solution I found in O(n*q) . I am thinking if there is much better way to deal this with lesser complexity.
The problem is here as follows :
Given an unweighted tree of 'n' nodes ( n>=1 and n can go to 105 ) , Its nodes can be special or non special. Node 1 is always special and rest non special initially. Now ,There are two operations :
1.we can update any non special node to special node by an update operation by "U Node_Number"
OR
2.At any time , we can ask user "Q Node_Number" which should return that special node in tree closest to "Node_Number".
These operations can also go upto 105.
My Solution :
I thought of creating adjacency list. For operation 1, I can keep record of special or Non special by boolean flag. But for operation 2 , my solution comprises of doing BFS whenever "Q Node_Number" is asked taking "Node_Number" as root to begin my BFS.
But complexity is quadratic. Is this the most optimal way of going about this problem ?

Here's an O(n^1.5 + n^0.5 q)-time algorithm via a sqrt decomposition. We need a constant-time distance oracle (this is basically least common ancestors). The idea is, every n^0.5 times a node is made special, perform a breadth-first search from all special nodes, which yields for each node in the tree the closest node that is currently special. On each query, take the closest of (i) the nodes that were special as of the last breadth-first search (ii) the at most n^0.5 newly special nodes.
As I mentioned in the comments, I expect that there's a very complicated O((n + q) log n)-time algorithm via top trees.

Related

Why is the time complexity O(lgN) for an operation in weighted union find algorithm?

So everywhere I see weighted union find algorithm, they use this approach:
Maintain an array pointing to the parent node of a particular node
Maintain an array denoting the size of the tree a node is in
For union (p,q) merge the smaller tree with larger
Time complexity here is O(lgN)
Now an optimzation on this is to flatten the trees, ie, whenever I am calculating the root of a particular node, set all nodes in that path to point to that root.
Time complexity of this is O(lg*N)
This I can understand, but what I don't get is why don't they start off with an array/hashset wherein nodes point to the root (instead of the immediate parent node)? That would bring the time complexity down to O(1).
I am going to assume that the time complexity you are asking for is the time to check whether 2 nodes belong to the same set.
The key is in how sets are joined, specifically you take the root of one set (the smaller one) and have it point to the root of the other set. Let the two sets have p and q as roots respectively, and |p| will represents the size of the set p if p is a root, while in general it will be the number of items whos set path goes through p (which is 1 + all its children).
We can without loss of generality assume that |p| <= |q| (otherwise we just exchange their names). We then have that |p u q| = |p|+|q| >= 2|p|. This shows us that each subtree in data-structure can at most be half as big as its parent, so given N items it can at most have the depth 1+lg N = O(lg(N)).
If the two choosen items are the furthest possible from the root, it will take O(N) operations to find the root for each of their sets, since you only need O(1) operations to move up one layer in the set, and then O(1) operations to compare those roots.
This cost is also applied to each union operation itself, since you need to figure out which two roots you need to merge. The reason we dont just have all nodes point directly to the root, is several-fold. First we would need to change all the nodes in the set every time we perform a union, secondly we only have edges pointing from the nodes toward the root and not the other way, so we would have to look through all nodes to find the ones we would need to change. The next reason is that we have good optimizations that can help in this kind of step and still work. Finally you could do such a step at the end if you really need to, but it would cost O(N lg(N)) time to perform it, which is compariable to how long it would take to run the entire algorithm by itself without the running short-cut optimization.
You are correct that the solution you suggest will bring down the time complexity of a Find operation to O(1). However, doing so will make a Union operation slower.
Imagine that you use an array/hashtable to remember the representative (or the root, as you call it) of each node. When you perform a union operation between two nodes x and y, you would either need to update all the nodes with the same representative as x to have y's representative, or vice versa. This way, union runs in O(min{|Sx|, |Sy|}), where Sxis the set of nodes with the same representative as x. These sets can be significantly larger than log n.
The weighted union algorithm on the other hand has O(log n) for both Find and
So it's a trade-off. If you expect to do many Find operations, but few Union operations, you should use the solution you suggest. If you expect to do many of each, you can use the weighted union algorithm to avoid excessively slow operations.

How is worst case time complexity of constructing suffix tree linear?

I have trouble understanding how the worst case time complexity of constructing a suffix tree is linear - particularly when we need to build a suffix tree for a string that may be composed of repeating single character such as "aaaaa".
Even if I were to construct a compressed suffix tree for "aaaaa", I won't be really able to compress any nodes since no two edges starting out of a node can have string-labels beginning with the same character.
This would result in a suffix tree of height 5, and at each insertion of the suffix, I would need to keep traversing from the root to the leaf.
Here was how I approached:
suffixes: a, aa, aaa, aaaa, aaaaa
Create root node, create an edge bearing 'a' and connect this to a new node, where to its left bears "$", and repeat this process until we can aaaaa.
This would result in O(n^2) instead of O(n).
What am I missing here?
I agree with the comments, but here are some more details:
The procedure you describe for forming the aaaaa tree is O(n), not O(n^2). Node and leaf creation are constant-time operations, and you perform them n==5 times. Your assumption of O(n^2) seems to be based on the idea that you'd traverse from root to leaf in each step, but there is no need to do this; e.g. in Ukkonen's algorithm:
You keep a pointer to the node you left off with before inserting the next
And in case of repetitions you don't perform any work until the repetitions end, and then you make insertions of the final $ mark one by one, following down the characters on the edge you have created, as well as the chain of suffix links in case of more complex repetitions
The key to why the Ukkonen algorithm (details here) is O(n) is that it maintains a "memory" of where to make inserts, in the form of (a) a pointer to where the previous insert was made, and (b) a network of suffix links. That network can be vast, but there is only one suffix link per internal node, so it's still O(n) in size.

What does the beam size represent in the beam search algorithm?

I have a question about the beam search algorithm.
Let's say that n = 2 (the number of nodes we are going to expand from every node). So, at the beginning, we only have the root, with 2 nodes that we expand from it. Now, from those two nodes, we expand two more. So, at the moment, we have 4 leafs. We will continue like this till we find the answer.
Is this how beam search works? Does it expand only n = 2 of every node, or it keeps 2 leaf nodes at all the times?
I used to think that n = 2 means that we should have 2 active nodes at most from each node, not two for the whole tree.
In the "standard" beam search algorithm, at every step, the total number of the nodes you currently "know about" is limited - and NOT the number of nodes you will follow from each node.
Concretely, if n = 2, it means that the "beam" will be of size at most 2, at all times. So, initially, you start from one node, then you discover all nodes that are reachable from it, but discard all of them but two, and finish step 1 with 2 nodes. At step 2, you have two nodes, and you will expand both, and discard all nodes again, except exactly 2 nodes (total, not from each!). In the next steps, similarly, you will keep 2 nodes after each step.
Choosing which node to keep is usually done by some heuristic function that evaluates which node is closest to the target.
Note that the beam search algorithm is not complete (i.e., it may not find a solution if one exists) nor optimal (i.e. it may not find the best solution). The best way to see this is witnessing that when n = 1, it basically reduces to best-first-search.
In beam search, instead of choosing the best token to generate at each timestep, we keep k possible tokens at each step. This fixed-size memory footprint k is called the beam width, on the metaphor of a flashlight beam that can be parameterized to be wider or narrower.
Thus at the first step of decoding, we compute a softmax over the entire vocabulary, assigning a probability to each word. We then select the k-best options from this softmax output. These initial k outputs are the search frontier and these k initial words are called hypotheses. A hypothesis is an output sequence, a translation-so- far, together with its probability.
At subsequent steps, each of the k best hypotheses is extended incrementally by being passed to distinct decoders, which each generate a softmax over the entire vocabulary to extend the hypothesis to every possible next token. Each of these k∗V hypotheses is scored by P(yi|x,y<i): the product of the probability of current word choice multiplied by the probability of the path that led to it. We then prune the k∗V hypotheses down to the k best hypotheses, so there are never more than k hypotheses at the frontier of the search, and never more than k decoders.
The beam size(or beam width) is the k aforementioned.
Source: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

Checking if A is a part of binary tree B

Let's say I have binary trees A and B and I want to know if A is a "part" of B. I am not only talking about subtrees. What I want to know is if B has all the nodes and edges that A does.
My thoughts were that since tree is essentially a graph, and I could view this question as a subgraph isomorphism problem (i.e. checking to see if A is a subgraph of B). But according to wikipedia this is an NP-complete problem.
http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem
I know that you can check if A is a subtree of B or not with O(n) algorithms (e.g. using preorder and inorder traversals to flatten the trees to strings and checking for substrings). I was trying to modify this a little to see if I can also test for just "parts" as well, but to no avail. This is where I'm stuck.
Are there any other ways to view this problem other than using subgraph isomorphism? I'm thinking there must be faster methods since binary trees are much more restricted and simpler versions of graphs.
Thanks in advance!
EDIT: I realized that the worst case for even a brute force method for my question would only take O(m * n), which is polynomial. So I guess this isn't a NP-complete problem after all. Then my next question is, is there an algorithm that is faster than O(m*n)?
I would approach this problem in two steps:
Find the root of A in B (either BFS of DFS)
Verify that A is contained in B (giving that starting node), using a recursive algorithm, as below (I concocted same crazy pseudo-language, because you didn't specify the language. I think this should be understandable, no matter your background). Note that a is a node from A (initially the root) and b is a node from B (initially the node found in step 1)
function checkTrees(node a, node b) returns boolean
if a does not exist or b does not exist then
// base of the recursion
return false
else if a is different from b then
// compare the current nodes
return false
else
// check the children of a
boolean leftFound = true
boolean rightFound = true
if a.left exists then
// try to match the left child of a with
// every possible neighbor of b
leftFound = checkTrees(a.left, b.left)
or checkTrees(a.left, b.right)
or checkTrees(a.left, b.parent)
if a.right exists then
// try to match the right child of a with
// every possible neighbor of b
leftFound = checkTrees(a.right, b.left)
or checkTrees(a.right, b.right)
or checkTrees(a.right, b.parent)
return leftFound and rightFound
About the running time: let m be the number of nodes in A and n be the number of nodes in B. The search in the first step takes O(n) time. The running time of the second step depends on one crucial assumption I made, but that might be wrong: I assumed that every node of A is equal to at most one node of B. If that is the case, the running time of the second step is O(m) (because you can never search too far in the wrong direction). So the total running time would be O(m + n).
While writing down my assumption, I start to wonder whether that's not oversimplifying your case...
you could compare the trees in bottom-up as follows:
for each leaf in tree A, identify the corresponding node in tree B.
start a parallel traversal towards the root in both trees from the nodes just matched.
specifically, move to the parent of a node in A and subsequently move towards the root in B until you either encounter the corresponding node in B (proceed) or a marked node in A (see below, if a match in B is found proceed, else fail) or the root of B (fail)
mark all nodes visited in A.
you succeed, if you haven't failed ;-).
the main part of the algorithm runs in O(e_B) - in the worst case, all edges in B are visited a constant number of times. the leaf node matching will run in O(n_A * log n_B) if there the B vertices are sorted, O(n_A * log n_A + n_B * log n_B + n) = O(n_B * log n_B) (sort each node set, lienarly scan the results thereafter) otherwise.
EDIT:
re-reading your question, abovementioned step 2 is even easier, as for matching nodes in A, B, their parents must match too (otheriwse there would be a mismatch between the edge sets). no effect on worst-case run time, of course.

Finding closest number in a range

I thought a problem which is as follows:
We have an array A of integers of size n, and we have test cases t and in every test cases we are given a number m and a range [s,e] i.e. we are given s and e and we have to find the closest number of m in the range of that array(A[s]-A[e]).
You may assume array indexed are from 1 to n.
For example:
A = {5, 12, 9, 18, 19}
m = 13
s = 4 and e = 5
So the answer should be 18.
Constraints:
n<=10^5
t<=n
All I can thought is an O(n) solution for every test case, and I think a better solution exists.
This is a rough sketch:
Create a segment tree from the data. At each node, besides the usual data like left and right indices, you also store the numbers found in the sub-tree rooted at that node, stored in sorted order. You can achieve this when you construct the segment tree in bottom-up order. In the node just above the leaf, you store the two leaf values in sorted order. In an intermediate node, you keep the numbers in the left child, and right child, which you can merge together using standard merging. There are O(n) nodes in the tree, and keeping this data should take overall O(nlog(n)).
Once you have this tree, for every query, walk down the path till you reach the appropriate node(s) in the given range ([s, e]). As the tutorial shows, one or more different nodes would combine to form the given range. As the tree depth is O(log(n)), that is the time per query to reach these nodes. Each query should be O(log(n)). For all the nodes which lie completely inside the range, find the closest number using binary search in the sorted array stored in those nodes. Again, O(log(n)). Find the closest among all these, and that is the answer. Thus, you can answer each query in O(log(n)) time.
The tutorial I link to contains other data structures, such as sparse table, which are easier to implement, and should give O(sqrt(n)) per query. But I haven't thought much about this.
sort the array and do binary search . complexity : o(nlogn + logn *t )
I'm fairly sure no faster solution exists. A slight variation of your problem is:
There is no array A, but each test case contains an unsorted array of numbers to search. (The array slice of A from s to e).
In that case, there is clearly no better way than a linear search for each test case.
Now, in what way is your original problem more specific than the variation above? The only added information is that all the slices come from the same array. I don't think that this additional constraint can be used for an algorithmic speedup.
EDIT: I stand corrected. The segment tree data structure should work.

Resources