Duplicates in Binary Search Tree? - data-structures

I am currently working on an Assignment in Data Structure topic Binary Trees. On of the question is: a) Draw a binary search tree by inserting the below numbers from left to right.
[15,7,8,20,13,10,5,17,40, 60, 30,70,6,14,4]. I drew my graph already. Now the confusing part is inserting a new node 7 and 17 into the tree.
I did a few research and found out that you cannot have duplicates in a Binary Tree. Or is there another way?

A Binary Search Tree(BST) can store only distinct values and are ordered.
But, a Binary Tree's values are not said to be ordered and distinct. So, there can be a possibility of finding duplicates.
While dealing with BST, a hash table can be used to store duplicate node values.

There're several strategies to handle duplicates (equal keys) in binary search tree, as propose in CLRS chapter 12 Problem 12-1:
-Keep a boolean flag x.b at node x, and set x to either x.left or x.right based
on the value of x.b, which alternates between False and True each time we
visit x while inserting a node with the same key as x.
-Keep a list of nodes with equal keys at x, and insert z into the list.
-Randomly set x to either x.left or x.right.
Another strategy is to store a counter to every tree node, and increment this counter for each duplicate. You can find code implementation for this strategy here:
https://www.geeksforgeeks.org/how-to-handle-duplicates-in-binary-search-tree/

Related

when exactly should root split in a B Tree

I learned B trees recently and from what I understand a node can have minimum t-1 keys and maximum 2t-1 keys given minimum degree t. Exception being root can have even 1 key.
Here is the example from CLRS 3rd edition Fig 18.7 (Page 498) where t=3
min keys = 3-1 = 2
max keys = 2*3-1 = 5
In the d) example when L is inserted why is the root splitted when it doesn't violate the B tree properties at the moment (It has 5 keys which is maximum allowed).
Why isn't inserting L into [J K L] without splitting [G M P T X] considered.
Should I always split the root when it reaches the maximum?
There are several variants of the insertion algorithm for B-trees. In this case the insertion algorithm is the "single pass down the tree" variant.
The background for this variant is given on page 493:
Since we cannot insert a key into a leaf node that is full, we introduce an operation that splits a full node 𝑦 (having 2𝑡 − 1 keys) around its median key 𝑦:key𝑡 into two nodes having only 𝑡 − 1 keys each. The median key moves up into 𝑦’s parent to identify the dividing point between the two new trees. But if 𝑦’s parent is also full, we must split it before we can insert the new key, and thus we could end up splitting full nodes all the way up the tree.
As with a binary search tree, we can insert a key into a B-tree in a single pass down the tree from the root to a leaf. To do so, we do not wait to find out whether we will actually need to split a full node in order to do the insertion. Instead, as we travel down the tree searching for the position where the new key belongs, we split each full node we come to along the way (including the leaf itself). Thus whenever we want to split a full node 𝑦, we are assured that its parent is not full.
In other words, this insertion algorithm will split a node earlier than might be strictly needed, in order to avoid to have to split nodes while backtracking out of recursion.
This algorithm is further described on page 495 with pseudo code.
This explains why at the insertion of L the root node is split immediately before any recursive call is made.
Alternative algorithms would not do this, and would delay the split up to the point when it is inevitable.

Search for node in unorganized binary tree?

This is a conceptual question. I have a tree where the data is stored with strings but not stored alphabetically. How do I search through the entire tree to find the node with string I'm looking for. So far I can only search through one side of the tree.
Here are the thing you can:
1. traverse the tree in any manner, say `DFS` or `BFS`
2. while travering nodes, keep checking the the current node is equivalent to the key string you are searching for.
2.1. compare each character of your search string with each character of current node's value.
2.2. if match found, process your result.
2.3. if not, continue with point 2.
3. if all the nodes exhausted, you don't have any match. stop the algorithm.
The complexity of above mentioned algorithm will be:
O(N)* O(M) => O(NM)
N - nodes of your tree.
M - length of your node's value + length of your search key's value.
You may iterate throught all tree levels and on each of level check all nodes. Depth of the tree is equivalent to numbers of itetations.
You may recursively go down to each branches and stop all itetations when node is found (by using external variable or flag) or if there is no child nodes.

Missing number in binary search tree

If I have order statistic binary balanced tree that has n different integers as its keys and I want to write function find(x) that returns the minimal integer that is not in the tree, and is greater than x. in O(log(n)) time.
For example, if the keys in the tree are 6,7,8,10,11,13,14 then find(6)=9, find(8)=9, find(10)=12, find(13)=15.
I think about finding the max in O(log(n)) and the index of x (mark i_x) in O(log(n)) then if i_x=n-(m-x) then I can simply return max+1.
By index I mean in 6,7,8,10,11,13,14 that index of 6 is 0 and index of 10 is 3 for example...
But I'm having trouble with the other cases...
According to wikipedia, an order statistic tree supports those two operations in log(n) time:
Select(i) — find the i'th smallest element stored in the tree in O(log(n))
Rank(x) – find the rank of element x in the tree, i.e. its index in the sorted list of elements of the tree in O(log(n))
Start by getting the rank of x, and select the superior ranks of x until you find a place to insert your missing element. But this has worst-case n*log(n).
So instead, once you have the rank of x, you do a kind of binary search. The basic idea is whether there is a space between number x and y which are in the tree. There is a space if rank(x) - rank(y) != x - y.
General case is: when searching for the number in the interval [lo,hi] (lo and hi are ranks in the tree, mid is the middle rank), if there is a space between lo and mid then search inside [lo,mid], else search inside [mid, hi].
You will end up finding the number you seek.
However, this solution does not run in log(n) time, but in log^2(n). This is the best I can think of for a general solution.
EDIT:
Well, it's a tough question, I changed my mind several times. Here is what I came up with:
I assume that the left node holds inferior value and the right node holds superior value
Intuition of find(x): Start at the root and go down the tree almost like in a standard binary tree. If the branch we want to go does not contain the solution of find(x) then cut it.
We'll go through the basic cases first:
If the node I found is null, then I am done, and I return the value I was looking for.
If the current value is less than the one I am looking for, I search for x in the right subtree
If I found the node containing x, then I search for x+1 on the right subtree.
The case where x is in the left subtree is more tricky, because it may contain x, x+1, x+2, x+3, etc up to y-1 where y is the value stored in the current node. In this case, we want to search for y+1 in the right subtree.
However, if all the numbers from x to y are not in the left subtree (that is, there is a gap), then we will find a value in it, so we look into the left subtree for x.
Question is: How to find if the sequence from x to y is present in the subtree ?
The algorithm in python looks like this:
def find(node, x):
if node == null:
return x
if node.data < x:
return find(node.right, x)
if node.data == x:
return find(node.right, x+1)
if is_full(...):
return find(node.right, node.data+1)
return find(node.left, x)
To get the smallest value strictly greater than x which is not in the tree, the first call is find(root, x+1). If you want the smallest value greater than or equals to x that is not in the tree, the first call is find(root, x).
The is_full method checks if the left subtree contains all number from x to node.data-1.
Now, using this as a starting point, I believe you can find a suitable solution by yourself, using the fact that the number of nodes contained in each subtree is stored at the subtree's root.
I faced a similar question.
There were no restrictions about finding greater than some x, simply find the missing element in the BST.
Below is my answer, it is perfectly possible to do so in O(lg(n)) time, with the assumption that, tree is almost balanced. You might want to consider the proof that expected height of the randomly built BST is lg(n) given n elements. I use a simpler notation, O(h) where h = height of the tree, so two things are now separate.
assumptions and/or requirements:
I enhance the data structure. store the count of (left_subtree + right_subtree + 1) at each node.
Obviously, count of a single node is 1
This count is pre-computed and stored at each node
Kindly pardon my multiple notations for not equal to (=/= and !=)
Also note that code might be structured in little better way if one is to write a working code on a machine.
Moreover, I think, at this point in time, that this is correct. I tried as many corner cases as I could think of, and in general it works. Even if there is a counter example, I don;t think it will be that difficult to modify the code to fit that particular case; but please comment the counter example, I am interested.

Binary Search Tree - Sorted?

I don't understand how binary search trees are always defined as "sorted". I get in an array representation of a binary heap you have a fully sorted array. I haven't seen array representations of binary search trees so hard for me to see them as sorted like an array eg [0,1,2,3,4,5] but rather sorted with respect to each node. What is the right way to think about a BST being "sorted" conceptually?
There are many types of binary search trees. All of them have one thing in common: they satisfy an invariant which enables binary search, namely an order relation by which every element in the tree can be compared to any other element in the tree, in a total preorder.
What does that mean?
Let's consider the typical statement of a BST invariant in a textbook, which states that every node's key is greater than all keys in its left sub-tree, and less than all keys in its right sub-tree. We omit conflict resolution details for keys which compare equal.
What does that BST look like? Here's an example:
The way I would explain it to a class of three-year-olds, is try to collapse all the nodes to the bottom level of the leaves, just let them fall down. Or, for high-schoolers, draw a line from each node/key projecting them on the x-axis. Once you did that, it's obvious the keys are already in (ascending) order.
Is this imaginary and casual observation analogous to our definition of a sorted sequence? Yes, it is. Since the elements of the BST satisfy a total preorder, an in-order traversal of the BST must produce those elements in order (Ex: prove it).
It is equivalent to state that if we had stored a BST's keys, by an in-order traversal, in an array, the array would be sorted.
Therefore, by way of our initial definition of a BST, the in-order traversal is the intuitive way of thinking of one as "sorted".
Does this help? It's a binary heap shown as an array
as far as data structures are concerned (arrays, trees, linked lists, etc), "sorted" means that sequentially going through all it's elements you'll find that their values are ordered according to some rule ( >, <, <=, etc).
For arrays, this is easy to picture because it's a linear data structure.
But trees are not, however, iterating through a BST you will notice that all the element are ordered accoring to the rule left value <= node value < right value ( or something similar); the very definition of a sorted data structure.
It is not "sorted" in the same sense an array might be sorted (and trees, except for heaps, are rarely represented as arrays anyway), but they have a structure that allows you to easily traverse the elements in sorted order: simply traverse the nodes of the BST with a depth-first search, and output each node's value after you've looked at its left child (if any) but before you look at its right child (if any).
By the way, the array in which a heap is stored is almost always not sorted. The heap itself can also not be said to be "sorted", because it does not have enough structure to be able to readily produce the elements in sorted order without destroying the heap by successively removing the top element. For example, although you do know that the top element is larger than both of its children (or smaller, depending on the heap type), you cannot tell in advance which child is smaller than the other.

Special Augmented Red-Black Tree

I'm looking for some help on a specific augmented Red Black Binary Tree. My goal is to make every single operation run in O(log(n)) in the worst case. The nodes of the tree will have an integer as there key. This integer can not be negative, and the tree should be sorted by a simple compare function off of this integer. Additionally, each node will also store another value: its power. (Note that this has nothing to do with mathematical exponents). Power is a floating point value. Both power and key are always non-negative. The tree must be able to provide these operations in O(log(n)) runtime.:
insert(key, power): Insert into the tree. The node in the tree should also store the power, and any other variables needed to augment the tree in such a way that all other operations are also O(log(n)). You can assume that there is no node in the tree which already has the same key.
get(key): Return the power of the node identified by the key.
delete(key): Delete the node with key (assume that the key does exist in the tree prior to the delete.
update(key,power): Update the power at the node given by key.
Here is where it gets interesting:
highestPower(key1, key2): Return the maximum power of all nodes with key k in the range key1 <= k <= key2. That is, all keys from key1 to key2, inclusive on both ends.
powerSum(key1, key2): Return the sum of the powers of all nodes with key k in the ragne key1 <= k <= key2. That is, all keys from key1 to key2, inclusive on both ends.
The main thing I would like to know is what extra variables should I store at each node. Then I need to work out how to use each one of these in each of the above functions so that the tree stays balanced and all operations can run in O(log(n)) My original thought was to store the following:
highestPowerLeft: The highest power of all child nodes to the right of this node.
highestPowerRight: The highest power of all child nodes to the right of this node.
powerSumLeft: The sum of the powers of all child nodes to the left of this node.
powerSumRight: The sum of the powers of all child nodes to the right of this node.
Would just this extra information work? If so, I'm not sure how to deal with it in the functions that are required. Frankly my knowledge of Red Black Tree's isn't great because I feel like every explanation of them gets convoluted really fast, and all the rotations and things confuse the hell out of me. Thanks to anyone willing to attempt helping here, I know what I'm asking is far from simple.
A very interesting problem! For the sum, your proposed method should work (it should be enough to only store the sum of the powers to the left of the current node, though; this technique is called prefix sum). For the max, it doesn't work, since if both max values are equal, that value is outside of your interval, so you have no idea what the max value in your interval is. My only idea is to use a segment tree (in which the leaves are the nodes of your red-black tree), which lets you answer the question "what is the maximal value within the given range?" in logarithmic time, and also lets you update individual values in logarithmic time. However, since you need to insert new values into it, you need to keep it balanced as well.

Resources