Generating a Huffman code that never produces the string "00" upon encoding - algorithm

I am trying to use Huffman coding to create an optimal coding for a set of symbols. However, a constraint is placed on the encodings such that no encoding contains the string, "00".
For example, the encoding 'A' = '0' and 'B' = '10' would not satisfy the constraint because the string 'BA' encodes to '100', which contains the "00" substring.
This means that the code words also cannot contain the string "00". For example, the encoding 'A' = '1', B = '00', and C = '01' would not satisfy the constraint because encoding 'B' would always result in '00' appearing in the encoding.
I have tried modifying the Huffman coding algorithm found on Wikipedia:
Create a leaf node for each symbol and add it to the priority queue.
While there is more than one node in the queue:
Remove the two nodes of highest priority (lowest probability) from the queue
If both nodes are not leaf nodes, select the highest priority node and the highest priority leaf node. This ensures that at least one of the selected nodes is a leaf node.
Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
If one node is not a leaf node make that node the right child of the new internal node (making it a '1' when encoding). This avoids creating the "00" substring.
Add the new node to the queue.
The remaining node is the root node and the tree is complete.
Add a '1' to the beginning of all codes to avoid the "00" substring when two adjacent symbols are encoded.
There is also the case where the only two nodes left in the queue are both non-leaf nodes. I am not sure how to solve this problem. Otherwise, I believe this creates a coding that satisfies the constraint, but I am unsure if it is optimal, and I would like to be certain that it is.

I think I'd start with the rule that any "0" in a code must be followed by a "1". That satisfies the constraint that codes are not allowed to contain "00". It also avoids the problem of a "00" substring being produced when two adjacent symbols are encoded.
The resulting code tree is shown below, where
the nodes in the red shaded areas are codes that contain "00"
the nodes containing a red X are codes that end with a "0"
the green nodes are the available valid codes
Note that because a Huffman code is a prefix-free code, choosing one of the valid codes eliminates all of the descendants of that node. For example, choosing to use the code "01" eliminates all of the other nodes on the left side of the tree. To put it another way, choosing "01" makes "01" a leaf, and breaks the two connections below "01".
Also note that the left child of an interior node will have a longer code than the right child, so the child with lower probability must be connected on the left. That's certainly necessary. It's left as an exercise to prove that it's sufficient. (If not sufficient, then the exercise is to come up with the optimal assignment algorithm.)

The easiest way is to not mess with the Huffman code at all. Instead, post-process the output. When encoding, take your coded bit stream and whenever there is a 0, insert a 1 after it. On the decoding end, whenever you see a 0, remove the next bit (which will be the 1 that was inserted). Then do the normal Huffman decoding.
This is not optimal, but the departure from optimality is bounded. You can reduce the impact of the bit stuffing by swapping the branches at every node, as needed, to put the lower probabilities or weights on the 0 sides.

Related

How to find longest accepted word by automata?

I need to write a code in Java that will find the longest word that DFA accepts. Firstly, if there is transition to one of previous states (or self-transition) on path that leads to final state, that means there are infinite words, and longest one doesn't exist (that means there is Kleene star applied on some word). I was thinking to form queue by BFS, where each level is separated by null, so that when I'm iterating through queue and come across null, length of the word would be increases by one, but it would be hard to track set of previous states so I'm kind of idealess. If you can't code in Java I would appreciate pseudocode or algorithm.
I don't think this is strictly necessary, but it would not hurt the performance too terribly much in practice and might be sufficient for your needs. I would suggest, as a first pass, minimizing the DFA. This can be done in O(nlogn) in terms of the number of states, using e.g. Hopcroft. This is probably conceptually similiar to what Christian Sloper suggests in the comments regarding reversing the transitions to find unproductive states ; indeed, there is a minimization algorithm that does this as well, but you might be able to get away with just removing unproductive states and not minimizing here (though minimizing does make the reasoning a little easier).
Doing that is nice because it will remove all unproductive loops and combine them into a single dead state, if indeed there are any unproductive prefixes. It is easy to find the one dead state, if there is one, and remove it from the directed graph formed by the DFA's states and transitions. To do this, do either DFS or BFS and check each state to come to and see if (1) all transitions are self-loops and (2) the state is not accepting.
With the one dead state removed (if any) any loops or cycles we detect in the remaining directed graph imply there are infinitely many strings in the language, since by definition any remaining states have a path to acceptance. If we find a loop or cycle, we know the language is infinite, and can respond accordingly.
If there are no loops or cycles remaining after removing the dead state from the minimal DFA, what remains is a tree rooted at the start state and whose leaves are accepting states (think about this for a moment and you will see it must be true). Therefore, the length of the longest string accepted is the length (in edges) of the longest path from the root to a leaf; so basically the height of the tree or something close to it (depending on how you define depth/height, whether edges or nodes). You can take any old algorithm for finding the depth and modify it so that in addition to returning the depth, it returns the string corresponding to the deepest subtree, so you can get the string without having to go back through the tree. Something like this:
GetLongestStringInTree(root)
1. if root is null return ""
2. result = ""
3. maxlen = 0
4. for each transition
5. child = transition.target
6. symbol = transition.symbol
7. str = GetLongestStringInTree(child)
8. if str.length > maxlen then
9. maxlen = str.length
10. result = str
11. return result
This could be pretty easily modified to find all words of maximum length by adding str to a collection if its length is equal to the max length so far, and emptying that collection when a new longer string is found, and returning the collection (and using the length of the first thing in the collection for checking). That can be left as an exercise; as written, this will just find some arbitrary longest string accepted by the DFA.
This problem becomes a lot simpler if you split it in two. (Sorry no java)
Step 1: Determine if there is a loop.
If there is a loop there exist an infinite long input. Detecting a loop in a directed graph can be done with DFS.
Step 2 (no loop): You now have a directed acyclic graph (DAG) and you can find the longest path using this algorithm: Longest path in Directed acyclic graph

Uniqueness of B-tree

Say that I have a sequence of key values to be inserted into a B-tree of any given order. After insertion of all the elements, I am performing a deletion operation on some of those elements. Does it always give an unique result (in the form of a B-tree) or it can it differ according to the deletion operation?
Quoted from wiki :
link:https://en.wikipedia.org/wiki/B-tree
Deletion from an internal node
Each element in an internal node acts as a separation value for two
subtrees, therefore we need to find a replacement for separation. Note
that the largest element in the left subtree is still less than the
separator. Likewise, the smallest element in the right subtree is
still greater than the separator. Both of those elements are in leaf
nodes, and either one can be the new separator for the two subtrees.
Algorithmically described below:
Choose a new separator (either the largest element in the left subtree or the smallest element in the right subtree), remove it from
the leaf node it is in, and replace the element to be deleted with the
new separator.
The previous step deleted an element (the new separator) from a leaf
node. If that leaf node is now deficient (has fewer than the required
number of nodes), then rebalance the tree starting from the leaf node.
I think according to the deletion operation it may vary because of the above lines quoted in bold letters. Am I right? help :)
If your question is whether two B-trees that contain the exact same collection of key values will always have identical nodes, then the answer is No.
Note that this is also true for e.g. simple binary trees.
However, in the case of B-trees this can be more pronounced because B-trees are optimized for minimizing page changes and thus the need to write back to slow secondary storage.

Random variable-length encoded numbers with uniform distribution

Suppose I have data presented with variable-length encoding when I can retrieve the data parsing some virtual b-tree and stopping when I reach the item (similar to Huffman encoding). There is unknown number of items (in the best case only the upper limit is known). Is there an algorithm to generate uniformly distributed numbers? The problem is that a coin-based algorithm will give non-uniform result in this case, for example if there's a number encoded as 101 and there's a number encoded 10010101, the latter will appear very rarely comparing to the former.
UPDATE: In other words, I have a set of maximum N elements (but maybe fewer) when every element can be addressed with arbitrary number of bits (and with accordance with informational theory, so if one is encoded 101 then no other element can be encoded with the same prefix). So it's more like B-Tree when I go left or right depending on a bit and at some moment I get to the data item. I want to get a sequence of random numbers addressed with this technique, but the distribution of them should be uniform (the example why choosing randomly left-right won't work is above, the numbers 101 and 10010101)
Thanks
Max
I can think of three basic methods, one of which involves frequent reguessing and one of which involves keeping extra information. I think that doing one or the other of these things is unavoidable. I'm going to begin with the extra information one:
In each node, store a number count which represents the number of descendants it has. For every node, you'll need to have a number between 1 and count for that node to tell you whether to go left or right by comparing it to the left child's count. Here's the algorithm:
n := random integer between 1 and root.count
node := route
while node.count != 1
if n <= node.left.count
node = node.left
else
node = node.right
n = n - node.left.count
So, essentially, we're imposing a left-to-right ordering on all nodes and selecting the nth one from the left. This is fairly quick, only having a O(depth of tree), which is likely the best we can do without doing something like also building a vector which contains all the node labels. This also adds an overhead of O(depth of tree) to any changes to the tree since counts must be corrected. If you're going the other way and never changing the tree at all but going to be selecting random nodes a lot, just bit the bullet and put all of the node labels in a vector. That way you can select a random one in O(1) after O(N) initial set-up time.
If, however, you don't want to use up any storage space, here's an alternative with a lot of reguessing. First find a bound (which I'll label B) for the depth of the tree (we can use N-1 if needed, but obviously, that's a very loose bound. The tighter the bound which can be found, the faster the algorithm runs). Next we're going to generate a possible node label in a random, but even way. There are 2^(B+1)-1 possibilities. It's not just 2^B because, for example, the string "0011" and "11" are completely different strings. As a result, we need to count all possible binary strings of length between 0 and B. Obviously, we have 2^i strings of length i. So for strings of length i or less, we have sum(i=0 to B){2^i} = 2^(B+1)-1. So, we can just chose a number between 0 and 2^(B+1)-2 and then find the corresponding node label. Of course, the mapping from numbers to node labels isn't trivial, so I'll provide it here.
We convert the number we have chosen into a string of bits in the ordinary way. Then, reading from the left, if the first digit is a 0, then the node label is the remaining string to the right (possibly the empty string, which is a valid node label although not likely to be in use). If the first digit is a 1, then we throw it away and repeat this process. Thus, if B=4, then the node label "0001" would come from the number "00001". The node label "001" would come from the number "10001". The node label "01" would come from the number "11001". The node label "1" would come from the number "11101". And the node label "" would come from the number "11110". We did not include the number 2^(B+1)-1 ("11111" in this case) which has no valid interpretation under this scheme. I'll leave it as an exercise to the reader to prove to themselves that every string from length 0 to B can be represented under this scheme. Rather than trying to prove it, I'll just assert that it will work.
So now we have a node label. The next step is to see if that label exists by traversing the tree. If it does, we're done. If it doesn't, then choose a new number and start over (that's the reguessing part). It's likely to have to reguess a lot, since only a small fraction of legal node labels will be in use, but this won't skew the fairness, just increase the time.
Here's a pseudo-code version of this process in four functions:
function num_to_binary_list(n,digits) =
if digits == 0 return ()
if n mod 2 == 0 return 0 :: num_to_digits(n/2,digits-1)
else return 1 :: num_to_digits((n-1)/2,digits-1)
function binary_list_to_node_label_list(l) =
if l.head() == 0 return l.tail()
else return binary_list_to_node_label_list(l.tail())
function check_node_label_list_against_tree(str,node) =
if node == null return false,null
if str.isEmpty()
if node.isLeaf() return true,node
else return false,null
if str.head() == 0 return check_node_label_list_against_tree(str.tail(),node.left)
else check_node_label_list_against_tree(str.tail,node.right)
function generate_random_node tree b =
found := false
while (not found)
x := random(0,2**(b+1)-2) // We're assuming that this random selects inclusively
node_label := binary_list_to_node_label(num_to_binary_list(x,b+1))
found,node := check_node_label_list_against_tree(node_label,tree)
return node
The timing analysis for this, of course, is pretty horrendous. Basically, the while loop will run an average of (2^(B+1)-1)/N times. So, in the worst case, it's O((2^N)/N) which is terrible. In the best case, B would be on the order of log(N), so it would be roughly O(1), but that requires that the tree be fairly balanced which it may not be. Still, if you really want no extra space, this method does that.
I don't really think that you can do better than this last method without storing some information. It sounds appealing to be able to traverse the tree, making random decisions as you go, but without storing additional information about the structure, you're just not going to be able to do that. Every time you make a branching decision, you could have just one node on the left side and a million nodes on the right side or it could have a million nodes on the left side and just one on the right side. Because those are both possible and you don't know which is the case, there's simply no way to make an even random decision between the two sides. Obviously 50-50 doesn't work and any other choice is going to be similarly problematic.
So, if you don't want extra space, the second method will work, but be slow. If you don't mind adding some extra space, the first method will work and be fast. And, as I said earlier, if you're not going to be changing the tree and you'll be selecting a lot of random nodes, then bite the bullet and just traverse the tree and stick all leaf nodes in a self-growing array or vector and then pick from that.

Applying a Logarithm to Navigate a Tree

I had once known of a way to use logarithms to move from one leaf of a tree to the next "in-order" leaf of a tree. I think it involved taking a position value (rank?) of the "current" leaf and using it as a seed for a fresh traversal from the root down to the new target leaf - all the way using a log function test to determine whether to follow the right or left node down to the leaf.
I no longer recall how to exercise that technique. Can anyone re-introduce me?
I also don't recall if the technique required the tree to be balanced, or if it worked on n-trees or only binary trees. Any info would be appreciated.
Since you mentioned whether to go left or right, I'm going to assume you're talking about a binary tree specifically. In that case, I think you're right that there is a way. If your nodes are numbered left-to-right, top-to-bottom, starting with 1, then you can find the rank (depth in the tree) by taking the log2 of the node's number. To find that node again from the root, you can use the binary representation of the number, where 0 = left and 1 = right.
For example:
n = 11
11 in binary is 1011
We always ignore the first 1 since it's going to be there for every number (all nodes of rank n will be binary numbers with n+1 digits, with the first digit being 1). We're left with 011, which is saying from the root go left, then right, then right.
If you want to find the next in-order leaf, take the current leaf's number and add one, then traverse from the root using this method.
I believe this only works with balanced binary trees.
OK, this proposal requires more characters than I can fit into a comment box. Steven does not believe that knowing the depth of the node in the tree is useful. I think it is. I have been wrong in the past, and I'm sure I'll be wrong in the future, so I will try to explain how this idea works in an attempt to not be wrong in the present. If I am, I apologize ahead of time. I'm nearly certain I got it from one of my Algorithms and Datastructures courses, using the CLR book. Please excuse any slips in notation or nomenclature, I haven't studied this stuff in a while.
Quoting wikipedia, "a complete binary tree is a binary tree in which every level, except possibly the last, is completely filled, and all nodes are as far left as possible."
We are considering a complete tree with any branching degree (where a binary tree has a branching degree of two). Also, we are considering our nodes to have a 'positional value' which is an ordering of the positional value (top to bottom, left to right) of the node.
Now, if we are given a positional value, we can find the node in the following fashion. Take the log_base_n of the positional value of the element we are looking for (floor of this, we want an integer). Traverse down from the root that many times, minus one. Now, start looking through all the children of the nodes at this level. Your node you are searching for will be in this set.
This is an attempt in explaining the additional part of the wikipedia definition:
"This depth is equal to the integer part of log2(n) where n
is the number of nodes on the balanced tree.
Example 1: balanced tree with 1 node, log2(1) = 0 (depth = 0).
Example 2: balanced tree with 3 nodes, log2(3) = 1.59 (depth=1).
Example 3: balanced tree with 5 nodes, log2(5) = 2.32
(depth of tree is 2 nodes)."
This is useful, because you can simply traverse down to this level and then start looking around. It is useful and important to know the depth your node is located on, so you can start looking there, instead of starting to look at the beginning. Unless you know what level of the tree you are on, you get to start looking at all the nodes sequentially.
That is why I think it is helpful to know the depth of the node we are searching for.
It is a little bit odd, since having the "positional value" is not something we normally care about in a tree. I can see why Steve thought of this in terms of an array, since positional value is inherent in arrays.
-Brian J. Stinar-
Something that at least resembles your description is the Binary Heap, used a.o. in Priority Queues.
I think I've found the answer, or at least a facsimile.
Assume the tree nodes are numbered, starting at 1, top-down and left-to-right. Assume traversal begins at the root, and halts when it finds node X (which means the parent is linked to its children). Also, for quick reference, the base 2 logarithmic values for nodes 1 through 12 are:
log2(1) = 0.0
log2(2) = 1
log2(3) = 1.58
log2(4) = 2
log2(5) = 2.32
log2(6) = 2.58
log2(7) = 2.807
log2(8) = 3
log2(9) = 3.16
log2(10) = 3.32
log2(11) = 3.459
log2(12) = 3.58
The fractional portion represents a unique diagonal position (notice how nodes 3, 6, and 12 all have fractional portion 0.58). Also notice that every node belongs either to the left or right side of the tree, depending on whether the log fractional component is less or great than 0.5. Anecdotes aside, the algorithm for finding a node is then as follows:
examine fractional portion, if it is less than .5, turn left. Else turn right.
subtract one from the whole number portion of the log, stop if the value reaches zero.
double the fractional portion, and start over.
So, for example, if node 11 is what you seek then you start by computing the log which is 3.459. Then...
3-459 <=fraction less than .5: turn left and decrement whole number to 2.
2-918 <=doubled fraction more than .5: turn right and decrement whole number to 1.
1-836 <=doubling .918 gives 1.836: but only fractional part counts: turn right and dec prior whole number to 0. Done!!
With appropriate accomodations, the same technique appears to work for any balanced n-ary tree. For example, given a balanced ternary tree, the choice of following left, middle, or right edges is again based on the fractional portion of the log, as follows:
between 0.5-0.832: turn left (a one-third fraction range)
between 0.17-0.49: turn right (another one-third fraction range)
otherwise go down the middle. (the last one-third range)
The algorithm is adjusted by multiplying the fractional portion by 3 instead of 2. Again, a quick reference for those who want to test this last statement:
log3(1) = 0.0
log3(2) = 0.63
log3(3) = 1
log3(4) = 1.26
log3(5) = 1.46
log3(6) = 1.63
log3(7) = 1.77
log3(8) = 1.89
log3(9) = 2
At this point I wonder if there is an even more concise way to express this whole "log-based top-down selection of a node." I'm interested if anyone knows...
Case 1: Nodes have pointers to their parent
Starting from the node, traverse up the parent pointer until one with non-null right_child is found. Go to the right_child and traverse left_child as long as they are non-null.
Case 2: Nodes do not have pointers to the parent
Starting from the root, find the path to the node (including the root and the node). Then find the latest vertex (i.e. a node) in the path that has non-null right_child. Go the the right_child and traverse left_child as long as they are non-null.
In both cases, we traversing either up or down from the root to one of the nodes. The maximum of such traversal is in the order of the depth of the tree, hence logarithmic in the size of the nodes if the tree is balanced.

Are duplicate keys allowed in the definition of binary search trees?

I'm trying to find the definition of a binary search tree and I keep finding different definitions everywhere.
Some say that for any given subtree the left child key is less than or equal to the root.
Some say that for any given subtree the right child key is greater than or equal to the root.
And my old college data structures book says "every element has a key and no two elements have the same key."
Is there a universal definition of a bst? Particularly in regards to what to do with trees with multiple instances of the same key.
EDIT: Maybe I was unclear, the definitions I'm seeing are
1) left <= root < right
2) left < root <= right
3) left < root < right, such that no duplicate keys exist.
Many algorithms will specify that duplicates are excluded. For example, the example algorithms in the MIT Algorithms book usually present examples without duplicates. It is fairly trivial to implement duplicates (either as a list at the node, or in one particular direction.)
Most (that I've seen) specify left children as <= and right children as >. Practically speaking, a BST which allows either of the right or left children to be equal to the root node, will require extra computational steps to finish a search where duplicate nodes are allowed.
It is best to utilize a list at the node to store duplicates, as inserting an '=' value to one side of a node requires rewriting the tree on that side to place the node as the child, or the node is placed as a grand-child, at some point below, which eliminates some of the search efficiency.
You have to remember, most of the classroom examples are simplified to portray and deliver the concept. They aren't worth squat in many real-world situations. But the statement, "every element has a key and no two elements have the same key", is not violated by the use of a list at the element node.
So go with what your data structures book said!
Edit:
Universal Definition of a Binary Search Tree involves storing and search for a key based on traversing a data structure in one of two directions. In the pragmatic sense, that means if the value is <>, you traverse the data structure in one of two 'directions'. So, in that sense, duplicate values don't make any sense at all.
This is different from BSP, or binary search partition, but not all that different. The algorithm to search has one of two directions for 'travel', or it is done (successfully or not.) So I apologize that my original answer didn't address the concept of a 'universal definition', as duplicates are really a distinct topic (something you deal with after a successful search, not as part of the binary search.)
If your binary search tree is a red black tree, or you intend to any kind of "tree rotation" operations, duplicate nodes will cause problems. Imagine your tree rule is this:
left < root <= right
Now imagine a simple tree whose root is 5, left child is nil, and right child is 5. If you do a left rotation on the root you end up with a 5 in the left child and a 5 in the root with the right child being nil. Now something in the left tree is equal to the root, but your rule above assumed left < root.
I spent hours trying to figure out why my red/black trees would occasionally traverse out of order, the problem was what I described above. Hopefully somebody reads this and saves themselves hours of debugging in the future!
All three definitions are acceptable and correct. They define different variations of a BST.
Your college data structure's book failed to clarify that its definition was not the only possible.
Certainly, allowing duplicates adds complexity. If you use the definition "left <= root < right" and you have a tree like:
3
/ \
2 4
then adding a "3" duplicate key to this tree will result in:
3
/ \
2 4
\
3
Note that the duplicates are not in contiguous levels.
This is a big issue when allowing duplicates in a BST representation as the one above: duplicates may be separated by any number of levels, so checking for duplicate's existence is not that simple as just checking for immediate childs of a node.
An option to avoid this issue is to not represent duplicates structurally (as separate nodes) but instead use a counter that counts the number of occurrences of the key. The previous example would then have a tree like:
3(1)
/ \
2(1) 4(1)
and after insertion of the duplicate "3" key it will become:
3(2)
/ \
2(1) 4(1)
This simplifies lookup, removal and insertion operations, at the expense of some extra bytes and counter operations.
In a BST, all values descending on the left side of a node are less than (or equal to, see later) the node itself. Similarly, all values descending on the right side of a node are greater than (or equal to) that node value(a).
Some BSTs may choose to allow duplicate values, hence the "or equal to" qualifiers above. The following example may clarify:
14
/ \
13 22
/ / \
1 16 29
/ \
28 29
This shows a BST that allows duplicates(b) - you can see that to find a value, you start at the root node and go down the left or right subtree depending on whether your search value is less than or greater than the node value.
This can be done recursively with something like:
def hasVal (node, srchval):
if node == NULL:
return false
if node.val == srchval:
return true
if node.val > srchval:
return hasVal (node.left, srchval)
return hasVal (node.right, srchval)
and calling it with:
foundIt = hasVal (rootNode, valToLookFor)
Duplicates add a little complexity since you may need to keep searching once you've found your value, for other nodes of the same value. Obviously that doesn't matter for hasVal since it doesn't matter how many there are, just whether at least one exists. It will however matter for things like countVal, since it needs to know how many there are.
(a) You could actually sort them in the opposite direction should you so wish provided you adjust how you search for a specific key. A BST need only maintain some sorted order, whether that's ascending or descending (or even some weird multi-layer-sort method like all odd numbers ascending, then all even numbers descending) is not relevant.
(b) Interestingly, if your sorting key uses the entire value stored at a node (so that nodes containing the same key have no other extra information to distinguish them), there can be performance gains from adding a count to each node, rather than allowing duplicate nodes.
The main benefit is that adding or removing a duplicate will simply modify the count rather than inserting or deleting a new node (an action that may require re-balancing the tree).
So, to add an item, you first check if it already exists. If so, just increment the count and exit. If not, you need to insert a new node with a count of one then rebalance.
To remove an item, you find it then decrement the count - only if the resultant count is zero do you then remove the actual node from the tree and rebalance.
Searches are also quicker given there are fewer nodes but that may not be a large impact.
For example, the following two trees (non-counting on the left, and counting on the right) would be equivalent (in the counting tree, i.c means c copies of item i):
__14__ ___22.2___
/ \ / \
14 22 7.1 29.1
/ \ / \ / \ / \
1 14 22 29 1.1 14.3 28.1 30.1
\ / \
7 28 30
Removing the leaf-node 22 from the left tree would involve rebalancing (since it now has a height differential of two) the resulting 22-29-28-30 subtree such as below (this is one option, there are others that also satisfy the "height differential must be zero or one" rule):
\ \
22 29
\ / \
29 --> 28 30
/ \ /
28 30 22
Doing the same operation on the right tree is a simple modification of the root node from 22.2 to 22.1 (with no rebalancing required).
In the book "Introduction to algorithms", third edition, by Cormen, Leiserson, Rivest and Stein, a binary search tree (BST) is explicitly defined as allowing duplicates. This can be seen in figure 12.1 and the following (page 287):
"The keys in a binary search tree are always stored in such a way as to satisfy the binary-search-tree property: Let x be a node in a binary search tree. If y is a node in the left subtree of x, then y:key <= x:key. If y is a node in the right subtree of x, then y:key >= x:key."
In addition, a red-black tree is then defined on page 308 as:
"A red-black tree is a binary search tree with one extra bit of storage per node: its color"
Therefore, red-black trees defined in this book support duplicates.
Any definition is valid. As long as you are consistent in your implementation (always put equal nodes to the right, always put them to the left, or never allow them) then you're fine. I think it is most common to not allow them, but it is still a BST if they are allowed and place either left or right.
I just want to add some more information to what #Robert Paulson answered.
Let's assume that node contains key & data. So nodes with the same key might contain different data.
(So the search must find all nodes with the same key)
left <= cur < right
left < cur <= right
left <= cur <= right
left < cur < right && cur contain sibling nodes with the same key.
left < cur < right, such that no duplicate keys exist.
1 & 2. works fine if the tree does not have any rotation-related functions to prevent skewness.
But this form doesn't work with AVL tree or Red-Black tree, because rotation will break the principal.
And even if search() finds the node with the key, it must traverse down to the leaf node for the nodes with duplicate key.
Making time complexity for search = theta(logN)
3. will work well with any form of BST with rotation-related functions.
But the search will take O(n), ruining the purpose of using BST.
Say we have the tree as below, with 3) principal.
12
/ \
10 20
/ \ /
9 11 12
/ \
10 12
If we do search(12) on this tree, even tho we found 12 at the root, we must keep search both left & right child to seek for the duplicate key.
This takes O(n) time as I've told.
4. is my personal favorite. Let's say sibling means the node with the same key.
We can change above tree into below.
12 - 12 - 12
/ \
10 - 10 20
/ \
9 11
Now any search will take O(logN) because we don't have to traverse children for the duplicate key.
And this principal also works well with AVL or RB tree.
Working on a red-black tree implementation I was getting problems validating the tree with multiple keys until I realized that with the red-black insert rotation, you have to loosen the constraint to
left <= root <= right
Since none of the documentation I was looking at allowed for duplicate keys and I didn't want to rewrite the rotation methods to account for it, I just decided to modify my nodes to allow for multiple values within the node, and no duplicate keys in the tree.
Those three things you said are all true.
Keys are unique
To the left are keys less than this one
To the right are keys greater than this one
I suppose you could reverse your tree and put the smaller keys on the right, but really the "left" and "right" concept is just that: a visual concept to help us think about a data structure which doesn't really have a left or right, so it doesn't really matter.
1.) left <= root < right
2.) left < root <= right
3.) left < root < right, such that no duplicate keys exist.
I might have to go and dig out my algorithm books, but off the top of my head (3) is the canonical form.
(1) or (2) only come about when you start to allow duplicates nodes and you put duplicate nodes in the tree itself (rather than the node containing a list).
Duplicate Keys
• What happens if there's more than one data item with
the same key?
– This presents a slight problem in red-black trees.
– It's important that nodes with the same key are distributed on
both sides of other nodes with the same key.
– That is, if keys arrive in the order 50, 50, 50,
• you want the second 50 to go to the right of the first one, and the
third 50 to go to the left of the first one.
• Otherwise, the tree becomes unbalanced.
• This could be handled by some kind of randomizing
process in the insertion algorithm.
– However, the search process then becomes more complicated if
all items with the same key must be found.
• It's simpler to outlaw items with the same key.
– In this discussion we'll assume duplicates aren't allowed
One can create a linked list for each node of the tree that contains duplicate keys and store data in the list.
The elements ordering relation <= is a total order so the relation must be reflexive but commonly a binary search tree (aka BST) is a tree without duplicates.
Otherwise if there are duplicates you need run twice or more the same function of deletion!

Resources