Missing number in binary search tree - algorithm

If I have order statistic binary balanced tree that has n different integers as its keys and I want to write function find(x) that returns the minimal integer that is not in the tree, and is greater than x. in O(log(n)) time.
For example, if the keys in the tree are 6,7,8,10,11,13,14 then find(6)=9, find(8)=9, find(10)=12, find(13)=15.
I think about finding the max in O(log(n)) and the index of x (mark i_x) in O(log(n)) then if i_x=n-(m-x) then I can simply return max+1.
By index I mean in 6,7,8,10,11,13,14 that index of 6 is 0 and index of 10 is 3 for example...
But I'm having trouble with the other cases...

According to wikipedia, an order statistic tree supports those two operations in log(n) time:
Select(i) — find the i'th smallest element stored in the tree in O(log(n))
Rank(x) – find the rank of element x in the tree, i.e. its index in the sorted list of elements of the tree in O(log(n))
Start by getting the rank of x, and select the superior ranks of x until you find a place to insert your missing element. But this has worst-case n*log(n).
So instead, once you have the rank of x, you do a kind of binary search. The basic idea is whether there is a space between number x and y which are in the tree. There is a space if rank(x) - rank(y) != x - y.
General case is: when searching for the number in the interval [lo,hi] (lo and hi are ranks in the tree, mid is the middle rank), if there is a space between lo and mid then search inside [lo,mid], else search inside [mid, hi].
You will end up finding the number you seek.
However, this solution does not run in log(n) time, but in log^2(n). This is the best I can think of for a general solution.
EDIT:
Well, it's a tough question, I changed my mind several times. Here is what I came up with:
I assume that the left node holds inferior value and the right node holds superior value
Intuition of find(x): Start at the root and go down the tree almost like in a standard binary tree. If the branch we want to go does not contain the solution of find(x) then cut it.
We'll go through the basic cases first:
If the node I found is null, then I am done, and I return the value I was looking for.
If the current value is less than the one I am looking for, I search for x in the right subtree
If I found the node containing x, then I search for x+1 on the right subtree.
The case where x is in the left subtree is more tricky, because it may contain x, x+1, x+2, x+3, etc up to y-1 where y is the value stored in the current node. In this case, we want to search for y+1 in the right subtree.
However, if all the numbers from x to y are not in the left subtree (that is, there is a gap), then we will find a value in it, so we look into the left subtree for x.
Question is: How to find if the sequence from x to y is present in the subtree ?
The algorithm in python looks like this:
def find(node, x):
if node == null:
return x
if node.data < x:
return find(node.right, x)
if node.data == x:
return find(node.right, x+1)
if is_full(...):
return find(node.right, node.data+1)
return find(node.left, x)
To get the smallest value strictly greater than x which is not in the tree, the first call is find(root, x+1). If you want the smallest value greater than or equals to x that is not in the tree, the first call is find(root, x).
The is_full method checks if the left subtree contains all number from x to node.data-1.
Now, using this as a starting point, I believe you can find a suitable solution by yourself, using the fact that the number of nodes contained in each subtree is stored at the subtree's root.

I faced a similar question.
There were no restrictions about finding greater than some x, simply find the missing element in the BST.
Below is my answer, it is perfectly possible to do so in O(lg(n)) time, with the assumption that, tree is almost balanced. You might want to consider the proof that expected height of the randomly built BST is lg(n) given n elements. I use a simpler notation, O(h) where h = height of the tree, so two things are now separate.
assumptions and/or requirements:
I enhance the data structure. store the count of (left_subtree + right_subtree + 1) at each node.
Obviously, count of a single node is 1
This count is pre-computed and stored at each node
Kindly pardon my multiple notations for not equal to (=/= and !=)
Also note that code might be structured in little better way if one is to write a working code on a machine.
Moreover, I think, at this point in time, that this is correct. I tried as many corner cases as I could think of, and in general it works. Even if there is a counter example, I don;t think it will be that difficult to modify the code to fit that particular case; but please comment the counter example, I am interested.

Related

Why is the number of sub-trees gained from a range tree query is O(log(n))?

I'm trying to figure out this data structure, but I don't understand how can we
tell there are O(log(n)) subtrees that represents the answer to a query?
Here is a picture for illustration:
Thanks!
If we make the assumption that the above is a purely functional binary tree [wiki], so where the nodes are immutable, then we can make a "copy" of this tree such that only elements with a value larger than x1 and lower than x2 are in the tree.
Let us start with a very simple case to illustrate the point. Imagine that we simply do not have any bounds, than we can simply return the entire tree. So instead of constructing a new tree, we return a reference to the root of the tree. So we can, without any bounds return a tree in O(1), given that tree is not edited (at least not as long as we use the subtree).
The above case is of course quite simple. We simply make a "copy" (not really a copy since the data is immutable, we can just return the tree) of the entire tree. So let us aim to solve a more complex problem: we want to construct a tree that contains all elements larger than a threshold x1. Basically we can define a recursive algorithm for that:
the cutted version of None (or whatever represents a null reference, or a reference to an empty tree) is None;
if the node has a value is smaller than the threshold, we return a "cutted" version of the right subtree; and
if the node has a value greater than the threshold, we return an inode that has the same right subtree, and as left subchild the cutted version of the left subchild.
So in pseudo-code it looks like:
def treelarger(some_node, min):
if some_tree is None:
return None
if some_node.value > min:
return Node(treelarger(some_node.left, min), some_node.value, some_node.right)
else:
return treelarger(some_node.right, min)
This algorithm thus runs in O(h) with h the height of the tree, since for each case (except the first one), we recurse to one (not both) of the children, and it ends in case we have a node without children (or at least does not has a subtree in the direction we need to cut the subtree).
We thus do not make a complete copy of the tree. We reuse a lot of nodes in the old tree. We only construct a new "surface" but most of the "volume" is part of the old binary tree. Although the tree itself contains O(n) nodes, we construct, at most, O(h) new nodes. We can optimize the above such that, given the cutted version of one of the subtrees is the same, we do not create a new node. But that does not even matter much in terms of time complexity: we generate at most O(h) new nodes, and the total number of nodes is either less than the original number, or the same.
In case of a complete tree, the height of the tree h scales with O(log n), and thus this algorithm will run in O(log n).
Then how can we generate a tree with elements between two thresholds? We can easily rewrite the above into an algorithm treesmaller that generates a subtree that contains all elements that are smaller:
def treesmaller(some_node, max):
if some_tree is None:
return None
if some_node.value < min:
return Node(some_node.left, some_node.value, treesmaller(some_node.right, max))
else:
return treesmaller(some_node.left, max)
so roughly speaking there are two differences:
we change the condition from some_node.value > min to some_node.value < max; and
we recurse on the right subchild in case the condition holds, and on the left if it does not hold.
Now the conclusions we draw from the previous algorithm are also conclusions that can be applied to this algorithm, since again it only introduces O(h) new nodes, and the total number of nodes can only decrease.
Although we can construct an algorithm that takes the two thresholds concurrently into account, we can simply reuse the above algorithms to construct a subtree containing only elements within range: we first pass the tree to the treelarger function, and then that result through a treesmaller (or vice versa).
Since in both algorithms, we introduce O(h) new nodes, and the height of the tree can not increase, we thus construct at most O(2 h) and thus O(h) new nodes.
Given the original tree was a complete tree, then it thus holds that we create O(log n) new nodes.
Consider the search for the two endpoints of the range. This search will continue until finding the lowest common ancestor of the two leaf nodes that span your interval. At that point, the search branches with one part zigging left and one part zagging right. For now, let's just focus on the part of the query that branches to the left, since the logic is the same but reversed for the right branch.
In this search, it helps to think of each node as not representing a single point, but rather a range of points. The general procedure, then, is the following:
If the query range fully subsumes the range represented by this node, stop searching in x and begin searching the y-subtree of this node.
If the query range is purely in range represented by the right subtree of this node, continue the x search to the right and don't investigate the y-subtree.
If the query range overlaps the left subtree's range, then it must fully subsume the right subtree's range. So process the right subtree's y-subtree, then recursively explore the x-subtree to the left.
In all cases, we add at most one y-subtree in for consideration and then recursively continue exploring the x-subtree in only one direction. This means that we essentially trace out a path down the x-tree, adding in at most one y-subtree per step. Since the tree has height O(log n), the overall number of y-subtrees visited this way is O(log n). And then, including the number of y-subtrees visited in the case where we branched right at the top, we get another O(log n) subtrees for a total of O(log n) total subtrees to search.
Hope this helps!

Create a binary search tree with a better complexity

You are given a number which is the root of a binary search tree. Then you are given an array of N elements which you have to insert into the binary search tree. The time complexity is N^2 if the array is in the sorted order. I need to get the same tree structure in a much better complexity (say NlogN). I tried it a lot but wasn't able to solve it. Can somebody help?
I assume that all numbers are distinct (if it's not the case, you can use a pair (number, index) instead).
Let's assume that we want to insert we want to insert an element X. If it's the smallest/the largest element so far, its clear where it goes.
Let's a = max y: y in tree and y < X and b = min y: y in tree and y > X. I claim that:
One of them is an ancestor of the other.
Either a doesn't have the right child or b doesn't have the left child.
Proof:
Let it not be the case. Let l = lca(a, b). As a is in its left subtree and b is in it's right subtree, a < l < b. Contradiction.
Let a be an ancestor of b. If b has a left child c. Than a < c < b. Contradiction (the other case is handled similarly).
So the solution goes like this:
Let's a keep a set of elements that are already in a tree (I mean an efficient set with lower_bound operation like std::set in C++ or TreeSet in Java).
Let's find a and b as described above upon every insertion (in O(log N) time using the set's lower_bound operation). Exactly one of them doesn't have an appropriate child. That's where the new element goes.
The total time complexity is clearly O(N log N).
If you look up a word in a dictionary, you open the dictionary about halfway and look at the page. That then tells you if the search word is in the first or second half of the dictionary. Repeat, eliminating half the remaining words on each pass, and you soon narrow it down to a single word. 4 billion word dictionaries will take about 32 passes.
A binary search tree uses the same principle. Except as well as looking up, you can also insert. Insertion is O(log N), unless the tree becomes degenerate.
To prevent the tree going degenerate, you use a system of "red" and "black" nodes (the colours are just conventional), and you don't allow long runs of
either colour. The full explanation is in my book, Basic Algorithms
http://www.lulu.com/spotlight/bgy1mm
An implementation is here
https://github.com/MalcolmMcLean/babyxrc/blob/master/src/rbtree.c
https://github.com/MalcolmMcLean/babyxrc/blob/master/src/rbtree.h
But you will need some explanation if you want to learn about red black
trees from it.

RB tree with sum

I have some questions about augmenting data structures:
Let S = {k1, . . . , kn} be a set of numbers. Design an efficient
data structure for S that supports the following two operations:
Insert(S, k) which inserts the
number k into S (you can assume that k is not contained in S yet), and TotalGreater(S, a)
which returns the sum of all keys ki ∈ S which are larger than a, that is, P ki∈S, ki>a ki .
Argue the running time of both operations and give pseudo-code for TotalGreater(S, a) (do not given pseudo-code for Insert(S, k)).
I don't understand how to do this, I was thinking of adding an extra field to the RB-tree called sum, but then it doesn't work because sometimes I need only the sum of the left nodes and sometimes I need the sum of the right nodes too.
So I was thinking of adding 2 fields called leftSum and rightSum and if the current node is > GivenValue then add the cached value of the sum of the sub nodes to the current sum value.
Can someone please help me with this?
You can just add a variable size to each node, which is the number of nodes in the subtree rooted at that node. When finding the node with the smallest value that is larger than the value a, two things can happen on the path to that node: you can go left or right. Every time you go left, you add the size of the right child + 1 to the running total. Every time you go right, you do nothing.
There are two conditions for termination. 1) we find a node containing the exact value a, in which case we add the size of its right child to the total. 2) we reach a leaf, in which case we add 1 if it is larger than a, or nothing if it is smaller.
As Jordi describes: The key-word could be augmented red-black tree.

What is the best way to implement this type of binary tree?

Let's say I have a binary tree of height h, that have elements of x1, x2, ... xn.
xi element is initially at the ith leftmost leaf. The tree should support the following methods in O(h) time
add(i, j, k) where 1 <= i <= j =< n. This operation adds value k to the values of all leftmost nodes which are between i and j. For example, add(2,5,3) operation increments all leftmost nodes which are between 2th and 5th nodes by 3.
get(i): return the value of ith leftmost leaf.
What should be stored at the internal nodes for that properties?
Note: I am not looking for an exact answer but any hint on how to approach the problem would be great.
As I understand the question, the position of the xi'th element never changes, and the tree isn't a search tree, the search is based solely on the position of the node.
You can store an offset in the non leaves vertices, indicating the value change of its descendant.
add(i,j,k) will start from the root, and deepen in the tree, and increase a node's value by k if and only if, all its descendants are within the range [i,j]. If it had increased the value, no further deepening will occur.
Note1: In a single add() operation, you might need to add more then one number.
Note2: You actually need to add at most O(logn) = O(logh) values [convince yourself why. Hint: binary presentation of number up to n requires O(logn) bits], which later gives you [again, make sure you understand why] the O(logh) needed complexity.
get(i) is then trivial: summarize the values from the root to the i'th leaf, and return this sum.
Since it seems homework, I will not post a pseudo-code, this guidelines should get you started with this assigment.

Shortest branch in a binary tree?

A binary tree can be encoded using two functions l and r
such that for a node n, l(n) give the left child of n, r(n)
give the right child of n.
A branch of a tree is a path from the root to a leaf, the
length of a branch to a particular leaf is the number of
arcs on the path from the root to that leaf.
Let MinBranch(l,r,x) be a simple recursive algorithm for
taking a binary tree encoded by the l and r functions
together with the root node x for the binary tree and
returns the length of the shortest branch of the binary
tree.
Give the pseudocode for this algorithm.
OK, so basically this is what I've come up with so far:
MinBranch(l, r, x)
{
if x is None return 0
left_one = MinBranch(l, r, l(x))
right_one = MinBranch(l, r, r(x))
return {min (left_one),(right_one)}
}
Obviously this isn't great or perfect. I'd be greatful if
people can help me get this perfect and working - any help
will be appreciated.
I doubt anyone will solve homework for you straight-up. A clue: the return value must surely grow higher as the tree gets bigger, right? However I don't see any numeric literals in your function except 0, and no addition operators either. How will you ever return larger numbers?
Another angle on the same issue: anytime you write a recursive function, it helps to enumerate "what are all the conditions where I should stop calling myself? what I return in each circumstance?"
You're on the right approach, but you're not quite there; your recursive algorithm will always return 0. (the logic is almost right, though...)
note that the length of the sub-branches is one less than the length of the branch; so left_one and right_one should be 1 + MinBranch....
Steping through the algorithm with some sample trees will help uncover off-by-one errors like this one...
It looks like you almost have it, but consider this example:
4
3 5
When you trace through MinBranch, you'll see that in your
MinBranch(l,r,4) call:
left_one = MinBranch(l, r, l(x))
= MinBranch(l, r, l(4))
= MinBranch(l, r, 3)
= 0
That makes sense, after all, 3 is a leaf node, so of course the distance
to the closest leaf node is 0. The same happens for right_one.
But you then wind up here:
return {min (left_one),(right_one)}
= {min (0), (0) }
= 0
but that's clearly wrong, because this node (4) is not a leaf node. Your
code forgot to count the current node (oops!). I'm sure you can manage
to fix that.
Now, actually, they way you're doing this isn't the fastest, but I'm not
sure if that's relevant for this exercise. Consider this tree:
4
3 5
2
1
Your algorithm will count up the left branch recursively, even though it
could, hypothetically, bail out if you first counted the right branch
and noted that 3 has a left, so its clearly longer than 5 (which is a
leaf). But, of course, counting the right branch first doesn't always
work!
Instead, with more complicated code, and probably a tradeoff of greater
memory usage, you can check nodes left-to-right, top-to-bottom (just
like English reading order) and stop at the first leaf you find.
What you've created can be thought of as a depth-first search. However, given what you're after (shortest branch), this may not be the most efficent approach. Think about how your algorithm would perform on a tree that was very heavy on the left side (of the root node), but had only one node on the right side.
Hint: consider a breadth-first search approach.
What you have there looks like a depth first search algorithm which will have to search the entire tree before you come up with a solution. what you need is the breadth first search algorithm which can return as soon as it finds the solution without doing a complete search

Resources