Create a balanced binary search tree from a stream of integers

Create a balanced binary search tree from a stream of integers - algorithm

I have just finished a job interview and I was struggling with this question, which seems to me as a very hard question for giving on a 15 minutes interview.
The question was:
Write a function, which given a stream of integers (unordered), builds a balanced search tree.
Now, you can't wait for the input to end (it's a stream), so you need to balance the tree on the fly.
My first answer was to use a Red-Black tree, which of course does the job, but i have to assume they didn't expect me to implement a red black tree in 15 minutes.
So, is there any simple solution for this problem i'm not aware of?
Thanks,
Dave

I personally think that the best way to do this would be to go for a randomized binary search tree like a treap. This doesn't absolutely guarantee that the tree will be balanced, but with high probability the tree will have a good balance factor. A treap works by augmenting each element of the tree with a uniformly random number, then ensuring that the tree is a binary search tree with respect to the keys and a heap with respect to the uniform random values. Insertion into a treap is extremely easy:
Pick a random number to assign to the newly-added element.
Insert the element into the BST using standard BST insertion.
While the newly-inserted element's key is greater than the key of its parent, perform a tree rotation to bring the new element above its parent.
That last step is the only really hard one, but if you had some time to work it out on a whiteboard I'm pretty sure that you could implement this on-the-fly in an interview.
Another option that might work would be to use a splay tree. It's another type of fast BST that can be implemented assuming you have a standard BST insert function and the ability to do tree rotations. Importantly, splay trees are extremely fast in practice, and it's known that they are (to within a constant factor) at least as good as any other static binary search tree.
Depending on what's meant by "search tree," you could also consider storing the integers in some structure optimized for lookup of integers. For example, you could use a bitwise trie to store the integers, which supports lookup in time proportional to the number of bits in a machine word. This can be implemented quite nicely using a recursive function to look over the bits, and doesn't require any sort of rotations. If you needed to blast out an implementation in fifteen minutes, and if the interviewer allows you to deviate from the standard binary search trees, then this might be a great solution.
Hope this helps!

AA Trees are a bit simpler than Red-Black trees, but I couldn't implement one off the top of my head.

One of the simplest balanced binary search tree is BB(α)-tree. You pick the constant α, which says how much unbalanced can the tree get. At all times, #descendants(child) <= (1-α) × #descendants(node) must hold. You treat it as normal binary search tree, but when the formula doesn't apply to some node anymore, you just rebuild that part of the tree from scratch, so that it is perfectly balanced.
The amortized time complexity for insertion or deletion is still O(log N), just as with other balanced binary trees.

Related

Balanced binary trees versus indexed skiplists

Not sure if the question should be here or on programmers (or some other SE site), but I was curious about the relevant differences between balanced binary trees and indexable skiplists. The issue came up in the context of this question. From the wikipedia:
Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the implementation method of choice for many applications. Skip list algorithms have the same asymptotic expected time bounds as balanced trees and are simpler, faster and use less space.
Don't the space requirements of a skiplist depend on the depth of the hierarchy? And aren't binary trees easier to use, at least for searching (granted, insertion and deletion in balanced BSTs can be tricky)? Are there other advantages/disadvantages to skiplists?

(Some parts of your question (ease of use, simplicity, etc.) are a bit subjective and I'll answer them at the end of this post.)
Let's look at space usage. First, let's suppose that you have a binary search tree with n nodes. What's the total space usage required? Well, each node stores some data plus two pointers. You might also need some amount of information to maintain balance information. This means that the total space usage is
n * (2 * sizeof(pointer) + sizeof(data) + sizeof(balance information))
So let's think about an equivalent skiplist. You are absolutely right that the real amount of memory used by a skiplist depends on the heights of the nodes, but we can talk about the expected amount of space used by a skiplist. Typically, you pick the height of a node in a skiplist by starting at 1, then repeatedly flipping a fair coin, incrementing the height as long as you flip heads and stopping as soon as you flip tails. Given this setup, what is the expected number of pointers inside a skiplist?
An interesting result from probability theory is that if you have a series of independent events with probability p, you need approximately 1 / p trials (on expectation) before that event will occur. In our coin-flipping example, we're flipping a coin until it comes up tails, and since the coin is a fair coin (comes up heads with probability 50%), the expected number of trials necessary before we flip tails is 2. Since that last flip ends the growth, the expected number of times a node grows in a skiplist is 1. Therefore, on expectation, we would expect an average node to have only two pointers in it - one initial pointer and one added pointer. This means that the expected total space usage is
n * (2 * sizeof(pointer) + sizeof(data))
Compare this to the size of a node in a balanced binary search tree. If there is a nonzero amount of space required to store balance information, the skiplist will indeed use (on expectation) less memory than the balanced BST. Note that many types of balanced BSTs (e.g. treaps) require a lot of balance information, while others (red/black trees, AVL trees) have balance information but can hide that information in the low-order bits of its pointers, while others (splay trees) don't have any balance information at all. Therefore, this isn't a guaranteed win, but in many cases it will use space.
As to your other questions about simplicity, ease, etc: that really depends. I personally find the code to look up an element in a BST far easier than the code to do lookups in a skiplist. However, the rotation logic in balanced BSTs is often substantially more complicated than the insertion/deletion logic in a skiplist; try seeing if you can rattle off all possible rotation cases in a red/black tree without consulting a reference, or see if you can remember all the zig/zag versus zag/zag cases from a splay tree. In that sense, it can be a bit easier to memorize the logic for inserting or deleting from a skiplist.
Hope this helps!

And aren't binary trees easier to use, at least for searching
(granted, insertion and deletion in balanced BSTs can be tricky)?
Trees are "more recursive" (trees and subtrees) and SkipLists are "more iterative" (levels in an array). Of course, it depends on implementation, but SkipLists can also be very useful for practical applications.
It's easier to search in trees because you don't have to iterate levels in an array.
Are there other advantages/disadvantages to skiplists?
SkipLists are "easier" to implement. This is a little relative, but it's easier to implement a full-functional SkipList than deletion and balance operations in a BinaryTree.
Trees can be persistent (better for functional programming).
It's easier to delete items from SkipLists than internal nodes in a binary tree.
It's easier to add items to binary trees (keeping the balance is another issue)
Binary Trees are deterministic, so it's easier to study and analyze them.
My tip: If you have time, you must use a Balanced Binary Tree. If you have little time, use a Skip List. If you have no time, use a Library.

Something not mentioned so far is that skip lists can be advantageous for concurrent operations. If you read the source of ConcurrentSkipListMap, authored by Doug Lea... dig into the comments. It mentions:
there are no known efficient lock-free insertion and deletion algorithms for search trees. The immutability of the "down" links of index nodes (as opposed to mutable "left" fields in true trees) makes this tractable using only CAS operations.

You're right that this isn't the perfect forum.
The comment you quoted was written by the author of the original skip list paper: not exactly an unbiased assertion. It's been 23 years, and red-black trees still seem to be more prevalent than skip lists. An exception is redis key-value pair database, which includes skip lists as one option among its data structures.
Skip lists are very cool. But the only space advantage I've been able to show in the general randomized case is no need to store balance flags: two bits per value. This is assuming the hierarchy is dense enough to replicate binary tree performance. You can chalk this up as the price of determinism (vice. randomization). A nice feature of SL's is you can use less dense hierarchies to trade constant factors of speed for space.
Side note: it's not often discussed that if you don't need to traverse in sorted order, you can randomize unbalanced binary trees by just enciphering the keys (i.e. mapping to a pseudo-random cipher text with something very simple like RC4). Such trees are absolutely trivial to implement.

How is balancing of binary trees implemented?

I'm studying how to balance trees and I have some questions
Is it possible to balance a normal binary tree? If yes, which algorithm should be used?
Do I necessarily have to use a AVL or Red-black tree to obtain a balanced tree? How do these work?
I read something about rotations, weights but I'm kind of confused right now

Is it possible to balance a normal binary tree? If yes, which
algorithm should be used?
In O(n) you can build a complete tree, and populate it with the elements in in-order traversal.
It cannot be done better, because A BST might in rare cases decay to a chain (linked list), where all nodes have one son as null. In this cases, accessing the element in the middle is O(n) itself.
Do I necessarily have to use a AVL or Red-black tree to obtain a
balanced tree?
There are other balanced trees such as B+ trees, and other data structures (not trees) such as skip-lists. You might want to have a look at a list of known data structures, especially the trees section.
How do these work?
I find the wikipedia articles both on AVL tree and Red-Black tree very informative. If you have something specific you don't understand there - you should ask.
Also: Trying to implement a balanced trees on your own (Implement a known tree, not inventing a new one - of course) - is great for educational purposes, and by doing so - you will definitely understand how it works.

Well... AVL and red-black trees are "normal binary trees" that are balanced, and keep that balance (for some definition of "balanced"). I'm not a computer science teacher to come up with my own explanation of the algorithms, and I guess you aren't looking for a cut&paste from Wikipedia :-)
Now, for balancing binary trees: if the tree is a search tree (i.e. 'sorted', but 'balanced' doesn't really make all that much sense if it's not) you could always just recreate the tree. The simplest algorithm is to use an array with all the elements from the tree, in sorted order (easily obtained from an inorder traversal). Then build an algorithm around this general idea:
take the middle element of the array as the root of the tree. This will create a tree node, and two arrays "left" and "right", which are meant to form the left and right subtrees
Apply this same algorithm recursively to create a tree from the "left" array and one from the "right" array. These two trees become the children of the parent node.
You might have to be careful with the case when the array has an even number of elements: there is no obvious "middle element", and removing one of the two candidates will create arrays of different sizes. I'm too lazy to analyze this further to see if that could offset the whole balancing thing.
Of course, doing something like this every time you change the tree isn't such a great idea; you really want to use self-balancing trees like AVL for that. Doing it after creating the tree might not be all that useful either: you could just use the array itself and do binary searches on it, instead of making a tree. The array IS just another form of a binary tree...
EDIT: there is a reason why a lot of computer scientists have spent a lot of time developing data structures and algorithms that perform well in certain situations. Rolling your own version of a balanced binary tree is unlikely to beat these...

Can you balance an unbalanced tree?
Yes, You can. You use the same balance function you created for your AVL Tree inside a PostOrderTraversal function.
Should You Do it?
No!!! You should recreate it! Balancing the tree will cost you unnecessarily.
How do I recreate it?
Use an InOrderTraversal function to put your nodes into an array. Then use a variable that will always go to the middle of the array and the left middle, right middle and add the nodes to the new Tree.
Is it possible to balance a normal binary tree? If yes, which algorithm should be used?
Do I necessarily have to use a AVL or Red-black tree to obtain a balanced tree? How do these work?
In general, Trees are either unbalanced or balanced. AVL, Red-Black, 2-3, e.t.c. are just trees with some properties and according to their properties they use some extra variables and functions. Those extra variables and function can also be used in the "normal" binary trees. In other words those functions and variables are not bounded to their respective type of tree. The nodes of a "normal" binary tree always had a balance! You just didn't use it because you didn't care if the "normal" binary tree was balanced or not. They also always had a height, depth, e.t.c. You just didn't care. In general, you will realize at one point that all are a trade-off between speed and memory. If you know what you are doing, more memory usage will make your program faster. Less memory usage means more calculations so you will have a slower program.

What class of problem would one use a binary search tree to solve?

I've seen this data structure talked about a lot, but I am unclear as to what sort of problem would demand such a data structure (over alternative representations). I've never needed one, but perhaps that's because I don't quite grok it. Can you enlighten me?

One example of where you would use a binary search tree would be a sorted list of values where you want to be able to quickly add elements.
Consider using an array for this purpose. You have very fast access to read random values, but if you want to add a new value, you have to find the place in the array where it belongs, shift everything over, and then insert the new value.
With a binary search tree, you simply traverse the tree looking for where the value would be if it were in the tree already, and then add it there.
Also, consider if you want to find out if your sorted array contains a particular value. You have to start at one end of the array and compare the value you're looking for to each individual value until you either find the value in the array, or pass the point where it would have been. With a binary search tree, you greatly reduce the number of comparisons you are likely to have to make. Just a quick caveat, however, it is definitely possible to contrive situations where the binary search tree requires more comparisons, but these are the exception, not the rule.

One thing I've used it for in the past is Huffman decoding (or any variable-bit-length scheme).
If you maintain your binary tree with the characters at the leaves, each incoming bit decides whether you move to the left or right node.
When you reach a leaf node, you have your decoded character and you can start on the next one.
For example, consider the following tree:
.
/ \
. C
/ \
A B
This would be a tree for a file where the predominant letter was C (by having less bits used for common letters, the file is shorter than it would be for a fixed-bit-length scheme). The codes for the individual letters are:
A: 00 (left, left).
B: 01 (left, right).
C: 1 (right).
The class of problems you use then for are those where you want to be able to both insert and access elements reasonably efficiently. As well as unbalanced trees (such as the Huffman example above), you can also use balanced trees which make the insertions a little more costly (since you may have to rebalance on the fly) but make lookups a lot more efficient since you're traversing the minimum possible number of nodes.

from wiki
Self-balancing binary search trees can be used in a natural way to construct and maintain ordered lists, such as priority queues. They can also be used for associative arrays; key-value pairs are simply inserted with an ordering based on the key alone. In this capacity, self-balancing BSTs have a number of advantages and disadvantages over their main competitor, hash tables. One advantage of self-balancing BSTs is that they allow fast (indeed, asymptotically optimal) enumeration of the items in key order, which hash tables do not provide. One disadvantage is that their lookup algorithms get more complicated when there may be multiple items with the same key.
Self-balancing BSTs can be used to implement any algorithm that requires mutable ordered lists, to achieve optimal worst-case asymptotic performance. For example, if binary tree sort is implemented with a self-balanced BST, we have a very simple-to-describe yet asymptotically optimal O(n log n) sorting algorithm. Similarly, many algorithms in computational geometry exploit variations on self-balancing BSTs to solve problems such as the line segment intersection problem and the point location problem efficiently. (For average-case performance, however, self-balanced BSTs may be less efficient than other solutions. Binary tree sort, in particular, is likely to be slower than mergesort or quicksort, because of the tree-balancing overhead as well as cache access patterns.)
Self-balancing BSTs are flexible data structures, in that it's easy to extend them to efficiently record additional information or perform new operations. For example, one can record the number of nodes in each subtree having a certain property, allowing one to count the number of nodes in a certain key range with that property in O(log n) time. These extensions can be used, for example, to optimize database queries or other list-processing algorithms.

AVL tree management

I have some question about AVL, lets assume I created some avl-tree of integers, how do I need to manage insertion into my tree to be able to take out the longest sequence of numbers, (insertion have to be with complexity O(logn)), for example:
_ 10 _
_ 7 _ _ 12 _
6 8
in this case the longest sequence will be 6,7,8 so in my function void sequence(int* low, int* high) I'll do *low = 6, *high = 8...
comlexity of the function(sequence) have to be O(1)
thanks in advance for any idea

Actually, if you build an interval list or something very-like-it, then store the components of it in an AVL tree, you could probably do okay. The thing is that you don't just want a given sequence, you want longest sequence. Longest run of lexically immediately adjacent keys*, to be more exact. Which, curiously, is quite hard I think without bashing up a custom metric to build your AVL tree on. I guess if your comparator for the AVL tree built on the interval list was f(length-of-interval), you could get it in o(logn) or maybe faster if your AVL implementation has fast max\min.
I'm terribly sorry, I was hoping to be more help, but the fact that we have to use an AVL tree is a little troubling. I'm wondering if there's a trick that one could pull involving sub-trees, but I'm simply seeing no good way to make such an approach o(1) without so much preprocessing as to be a joke. Something with bloom filters might work?
*
Some total orderings can create similar runs, but not all have a meaningful concept of immediate adjacency in their... well... phase space I guess?**
**My lackluster formal education is really biting me right about now.

The basic insertion & rotation in AVL Trees guarantees close to O(logn) performance.
Coming to the 2nd part of your question, to find the "complexity" of your sequence, you first need to find (or traverse to) the "low" element in your AVL Tree, that itself will take you AT MOST O(logn).
So O(1) sequence() complexity is out of the window... If O(1) is a must then maybe AVL tree is not your data structure here.

Applications of red-black trees

What are the applications of red-black (RB) trees? Is there any application where only RB Trees can be used and no other data structures?

A red-black tree is a particular implementation of a self-balancing binary search tree, and today it seems to be the most popular choice of implementation.
Binary search trees are used to implement finite maps, where you store a set of keys with associated values. You can also implement sets by only using the keys and not storing any values.
Balancing the tree is needed to guarantee good performance, as otherwise the tree could degenerate into a list, for example if you insert keys which are already sorted.
The advantage of search trees over hash tables is that you can traverse the tree efficiently in sort order.
AVL-trees are another variant of balanced binary search trees. They were popular before red-black trees were known. They are more carefully balanced, with a maximal difference of one between the heights of the left and right subtree (RB trees guarantee at most a factor of two). Their main drawback is that rebalancing takes more effort.
So red-black trees are certainly a good but not the only choice for this application.

Red Black Trees are from a class of self balancing BSTs and as answered by others, any such self balancing tree can be used. I would like to add that Red-black trees are widely used as system symbol tables. For example they are used in implementing the following:
Java: java.util.TreeMap , java.util.TreeSet .
C++ STL: map, multimap, multiset.
Linux kernel: completely fair scheduler, linux/rbtree.h

Unless you have very specific performance requirements, an R-B tree could be replaced by some other self-balancing binary tree, for example an AVL tree. Choosing between the two of them is basically a performance optimization - they offer the same basic operations.
Not that either of them is definitively "faster" than the other, just that they're different enough that specific uses of them will tend to have slightly different performance, all else being equal. So if you draw your requirements carefully enough, or just by chance, you could end up with one of them being "fast enough" for your use, and the other not. R-B offers slightly faster insertion than AVL, at the cost of slightly slower lookup.

There is no such rule like red black can only be used in a particular case
it depends upon the application in cases like when You have to build the tree only once and you have to query it many times then you can go for a AVL tree because in AVL tree searching is quite fast.. But it is strictly balanced so insertion and deletion may take some time
AVl tree may be used for language dictionery where You have to build the data structure just once
and the red black tree is used in the Completely Fair Scheduler used in current Linux kernels now a days..
the constraints applied on the red black tree also enforce the point that that that the path from the root to the furthest leaf is no more than twice as long as the path from the root to the nearest leaf.
BTW you can look for the various seach and insert etc time required for a red black tree down here
Average Worst case
Space O(n) O(n)
Search O(log n) O(log n)
Insert O(log n) O(log n)
Delete O(log n) O(log n)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio