Binary tree arranged per bit of data, not by comparisons? - data-structures

Suppose we use a binary tree just to store items that are just a series of bits. The top-level node's left node represents 0, right 1 for the least significant bit of the data:
Bit 1: 0 1
| |
Bit 2: 0 1 0 1
| | | |
Bit 3:01 01 01 01
^^ ^^ ^^ ^^
Value:04 26 15 37
The advantage over the normal comparison-based node arrangement is that this way doesn't ever need rebalancing. Also the nodes may take up less space.
Is this a useful data structure, or is there a better version of it? Does it have a name?

As #Lee pointed out in the comments, this is closely related to a binary trie data structure. You are correct that this implementation does not require any rebalancing to occur. In a BST, the shape of the tree is (more or less) independent of what's being stored in the tree, so you have to take precautions to ensure the tree doesn't become imbalanced. With a trie, the shape of the tree directly encodes what's stored in the trie, and no rebalancing is necessary.
Binary tries have a few disadvantages relative to, say, a balanced binary search tree. In particular, the storage requirements can be much higher; a binary tree holding n elements needs O(n log U) space, where U is the maximum value stored in the trie (O(log U) bits are needed to write it out). This can become a problem when you're storing a moderate number of large (64-bit, say) numbers. You can optimize the storage problem by using a Patricia trie (sometimes called a radix trie) by removing nodes that only have one child, though this requires a more clever implementation.
With some very clever optimizations on top of a binary tree, you can derive the x-fast trie and y-fast trie data structures. The latter of these uses O(n) total storage and supports lookups in time O(log log U). This can be strictly better than a BST in theory, though in practice the constant factors hidden here aren't great.

Related

Balanced binary trees versus indexed skiplists

Not sure if the question should be here or on programmers (or some other SE site), but I was curious about the relevant differences between balanced binary trees and indexable skiplists. The issue came up in the context of this question. From the wikipedia:
Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the implementation method of choice for many applications. Skip list algorithms have the same asymptotic expected time bounds as balanced trees and are simpler, faster and use less space.
Don't the space requirements of a skiplist depend on the depth of the hierarchy? And aren't binary trees easier to use, at least for searching (granted, insertion and deletion in balanced BSTs can be tricky)? Are there other advantages/disadvantages to skiplists?
(Some parts of your question (ease of use, simplicity, etc.) are a bit subjective and I'll answer them at the end of this post.)
Let's look at space usage. First, let's suppose that you have a binary search tree with n nodes. What's the total space usage required? Well, each node stores some data plus two pointers. You might also need some amount of information to maintain balance information. This means that the total space usage is
n * (2 * sizeof(pointer) + sizeof(data) + sizeof(balance information))
So let's think about an equivalent skiplist. You are absolutely right that the real amount of memory used by a skiplist depends on the heights of the nodes, but we can talk about the expected amount of space used by a skiplist. Typically, you pick the height of a node in a skiplist by starting at 1, then repeatedly flipping a fair coin, incrementing the height as long as you flip heads and stopping as soon as you flip tails. Given this setup, what is the expected number of pointers inside a skiplist?
An interesting result from probability theory is that if you have a series of independent events with probability p, you need approximately 1 / p trials (on expectation) before that event will occur. In our coin-flipping example, we're flipping a coin until it comes up tails, and since the coin is a fair coin (comes up heads with probability 50%), the expected number of trials necessary before we flip tails is 2. Since that last flip ends the growth, the expected number of times a node grows in a skiplist is 1. Therefore, on expectation, we would expect an average node to have only two pointers in it - one initial pointer and one added pointer. This means that the expected total space usage is
n * (2 * sizeof(pointer) + sizeof(data))
Compare this to the size of a node in a balanced binary search tree. If there is a nonzero amount of space required to store balance information, the skiplist will indeed use (on expectation) less memory than the balanced BST. Note that many types of balanced BSTs (e.g. treaps) require a lot of balance information, while others (red/black trees, AVL trees) have balance information but can hide that information in the low-order bits of its pointers, while others (splay trees) don't have any balance information at all. Therefore, this isn't a guaranteed win, but in many cases it will use space.
As to your other questions about simplicity, ease, etc: that really depends. I personally find the code to look up an element in a BST far easier than the code to do lookups in a skiplist. However, the rotation logic in balanced BSTs is often substantially more complicated than the insertion/deletion logic in a skiplist; try seeing if you can rattle off all possible rotation cases in a red/black tree without consulting a reference, or see if you can remember all the zig/zag versus zag/zag cases from a splay tree. In that sense, it can be a bit easier to memorize the logic for inserting or deleting from a skiplist.
Hope this helps!
And aren't binary trees easier to use, at least for searching
(granted, insertion and deletion in balanced BSTs can be tricky)?
Trees are "more recursive" (trees and subtrees) and SkipLists are "more iterative" (levels in an array). Of course, it depends on implementation, but SkipLists can also be very useful for practical applications.
It's easier to search in trees because you don't have to iterate levels in an array.
Are there other advantages/disadvantages to skiplists?
SkipLists are "easier" to implement. This is a little relative, but it's easier to implement a full-functional SkipList than deletion and balance operations in a BinaryTree.
Trees can be persistent (better for functional programming).
It's easier to delete items from SkipLists than internal nodes in a binary tree.
It's easier to add items to binary trees (keeping the balance is another issue)
Binary Trees are deterministic, so it's easier to study and analyze them.
My tip: If you have time, you must use a Balanced Binary Tree. If you have little time, use a Skip List. If you have no time, use a Library.
Something not mentioned so far is that skip lists can be advantageous for concurrent operations. If you read the source of ConcurrentSkipListMap, authored by Doug Lea... dig into the comments. It mentions:
there are no known efficient lock-free insertion and deletion algorithms for search trees. The immutability of the "down" links of index nodes (as opposed to mutable "left" fields in true trees) makes this tractable using only CAS operations.
You're right that this isn't the perfect forum.
The comment you quoted was written by the author of the original skip list paper: not exactly an unbiased assertion. It's been 23 years, and red-black trees still seem to be more prevalent than skip lists. An exception is redis key-value pair database, which includes skip lists as one option among its data structures.
Skip lists are very cool. But the only space advantage I've been able to show in the general randomized case is no need to store balance flags: two bits per value. This is assuming the hierarchy is dense enough to replicate binary tree performance. You can chalk this up as the price of determinism (vice. randomization). A nice feature of SL's is you can use less dense hierarchies to trade constant factors of speed for space.
Side note: it's not often discussed that if you don't need to traverse in sorted order, you can randomize unbalanced binary trees by just enciphering the keys (i.e. mapping to a pseudo-random cipher text with something very simple like RC4). Such trees are absolutely trivial to implement.

Why dont we use 2-3 or 2-3-4-5 trees?

I have a basic understanding of how 2-3-4 trees maintain the height balance property operation after operation to make sure even the worst case operations are O(n logn).
But I do not understand it well enough to know why only 2-3-4?
Why not 2-3 or 2-3-4-5 etc?
Implementation of 2-3-4 trees typically requires either multiple classes (2NODE, 3NODE, 4NODE) or you have just NODE that has an array of items. In the case of multiple classes you waste lots of time constructing and destructing node instances and reparenting them is cumbersome. If you use a single class with arrays to hold items and children then you are either resizing arrays constantly which is similarly wasteful or you wind up wasting over half your memory on unused array elements. It's just not very efficient compared to Red-Black trees.
Red-Black trees have only one type of node structure. Since Red-Black trees have a duality with 2-3-4 trees, RB trees can use the exact same algorithms as 2-3-4 trees (no need for the stupidly confusing/complex implementations described in Cormen, Leiserson and Rivest that led to AA trees which are not less complex than the 2-3-4 algorithm.)
So, Red-Black trees for their ease of implementation plus their memory/CPU efficiency. (AVL trees are nice too. They produce more well balanced trees and are stupid simply to code but they tend to be less efficient due to working too often to maintain only a slightly more compact tree.)
Oh, and 2-3-4-5-6... etc aren't done because nothing is gained. 2-3-4 has a net-gain over 2-3 trees because they can be done without recursion easily (recursion tends to be less efficient, especially when it cannot be coded tail-recursively). However, B-Trees and Bplus-Trees are pretty much 2-3-4-5-6-7-8-9-etc trees where the max size of the nodes, n, is chosen so that n records can be stored in a single disk sector. (i.e. each disk sector is a node in the tree and the size of the sector is equivalent to the number of items stored in the node.) This is because the time to search through 512 records linearly in memory is still MUCH faster than traversing down a level in the tree which requires another disk seek/fetch. and O(512) is still O(1) and thus maintains O(lg n) for the tree.
To be honest, I wasn't aware of 2-3-4 trees. At my Data Structures class, we were taught 2-3 trees, and to be honest, most of us implemented AVL trees for the wet part of the exercise.
But apparently, there's a generalization of this type of tree:
(a,b) tree.
At my algoritm course they told us that they are commonly used for acessing memory from hard disc - known as B/B+ trees. You make tree that store sizeof your avabile ram and by doing so you minimalize number of reed from disc operation (if you made B with node that store for example 10^8 elements you only need log_10^8(n) reed from disc operation to find something on hard disc which is nothing. So something that you called 2-3-4-5-... trees is in fact widespread solution.

Create a balanced binary search tree from a stream of integers

I have just finished a job interview and I was struggling with this question, which seems to me as a very hard question for giving on a 15 minutes interview.
The question was:
Write a function, which given a stream of integers (unordered), builds a balanced search tree.
Now, you can't wait for the input to end (it's a stream), so you need to balance the tree on the fly.
My first answer was to use a Red-Black tree, which of course does the job, but i have to assume they didn't expect me to implement a red black tree in 15 minutes.
So, is there any simple solution for this problem i'm not aware of?
Thanks,
Dave
I personally think that the best way to do this would be to go for a randomized binary search tree like a treap. This doesn't absolutely guarantee that the tree will be balanced, but with high probability the tree will have a good balance factor. A treap works by augmenting each element of the tree with a uniformly random number, then ensuring that the tree is a binary search tree with respect to the keys and a heap with respect to the uniform random values. Insertion into a treap is extremely easy:
Pick a random number to assign to the newly-added element.
Insert the element into the BST using standard BST insertion.
While the newly-inserted element's key is greater than the key of its parent, perform a tree rotation to bring the new element above its parent.
That last step is the only really hard one, but if you had some time to work it out on a whiteboard I'm pretty sure that you could implement this on-the-fly in an interview.
Another option that might work would be to use a splay tree. It's another type of fast BST that can be implemented assuming you have a standard BST insert function and the ability to do tree rotations. Importantly, splay trees are extremely fast in practice, and it's known that they are (to within a constant factor) at least as good as any other static binary search tree.
Depending on what's meant by "search tree," you could also consider storing the integers in some structure optimized for lookup of integers. For example, you could use a bitwise trie to store the integers, which supports lookup in time proportional to the number of bits in a machine word. This can be implemented quite nicely using a recursive function to look over the bits, and doesn't require any sort of rotations. If you needed to blast out an implementation in fifteen minutes, and if the interviewer allows you to deviate from the standard binary search trees, then this might be a great solution.
Hope this helps!
AA Trees are a bit simpler than Red-Black trees, but I couldn't implement one off the top of my head.
One of the simplest balanced binary search tree is BB(α)-tree. You pick the constant α, which says how much unbalanced can the tree get. At all times, #descendants(child) <= (1-α) × #descendants(node) must hold. You treat it as normal binary search tree, but when the formula doesn't apply to some node anymore, you just rebuild that part of the tree from scratch, so that it is perfectly balanced.
The amortized time complexity for insertion or deletion is still O(log N), just as with other balanced binary trees.

What class of problem would one use a binary search tree to solve?

I've seen this data structure talked about a lot, but I am unclear as to what sort of problem would demand such a data structure (over alternative representations). I've never needed one, but perhaps that's because I don't quite grok it. Can you enlighten me?
One example of where you would use a binary search tree would be a sorted list of values where you want to be able to quickly add elements.
Consider using an array for this purpose. You have very fast access to read random values, but if you want to add a new value, you have to find the place in the array where it belongs, shift everything over, and then insert the new value.
With a binary search tree, you simply traverse the tree looking for where the value would be if it were in the tree already, and then add it there.
Also, consider if you want to find out if your sorted array contains a particular value. You have to start at one end of the array and compare the value you're looking for to each individual value until you either find the value in the array, or pass the point where it would have been. With a binary search tree, you greatly reduce the number of comparisons you are likely to have to make. Just a quick caveat, however, it is definitely possible to contrive situations where the binary search tree requires more comparisons, but these are the exception, not the rule.
One thing I've used it for in the past is Huffman decoding (or any variable-bit-length scheme).
If you maintain your binary tree with the characters at the leaves, each incoming bit decides whether you move to the left or right node.
When you reach a leaf node, you have your decoded character and you can start on the next one.
For example, consider the following tree:
.
/ \
. C
/ \
A B
This would be a tree for a file where the predominant letter was C (by having less bits used for common letters, the file is shorter than it would be for a fixed-bit-length scheme). The codes for the individual letters are:
A: 00 (left, left).
B: 01 (left, right).
C: 1 (right).
The class of problems you use then for are those where you want to be able to both insert and access elements reasonably efficiently. As well as unbalanced trees (such as the Huffman example above), you can also use balanced trees which make the insertions a little more costly (since you may have to rebalance on the fly) but make lookups a lot more efficient since you're traversing the minimum possible number of nodes.
from wiki
Self-balancing binary search trees can be used in a natural way to construct and maintain ordered lists, such as priority queues. They can also be used for associative arrays; key-value pairs are simply inserted with an ordering based on the key alone. In this capacity, self-balancing BSTs have a number of advantages and disadvantages over their main competitor, hash tables. One advantage of self-balancing BSTs is that they allow fast (indeed, asymptotically optimal) enumeration of the items in key order, which hash tables do not provide. One disadvantage is that their lookup algorithms get more complicated when there may be multiple items with the same key.
Self-balancing BSTs can be used to implement any algorithm that requires mutable ordered lists, to achieve optimal worst-case asymptotic performance. For example, if binary tree sort is implemented with a self-balanced BST, we have a very simple-to-describe yet asymptotically optimal O(n log n) sorting algorithm. Similarly, many algorithms in computational geometry exploit variations on self-balancing BSTs to solve problems such as the line segment intersection problem and the point location problem efficiently. (For average-case performance, however, self-balanced BSTs may be less efficient than other solutions. Binary tree sort, in particular, is likely to be slower than mergesort or quicksort, because of the tree-balancing overhead as well as cache access patterns.)
Self-balancing BSTs are flexible data structures, in that it's easy to extend them to efficiently record additional information or perform new operations. For example, one can record the number of nodes in each subtree having a certain property, allowing one to count the number of nodes in a certain key range with that property in O(log n) time. These extensions can be used, for example, to optimize database queries or other list-processing algorithms.

Fast Algorithm to Quickly Find the Range a Number Belongs to in a Set of Ranges?

The Scenario
I have several number ranges. Those ranges are not overlapping - as they are not overlapping, the logical consequence is that no number can be part of more than one range at any time. Each range is continuously (there are no holes within a single range, so a range 8 to 16 will really contain all numbers between 8 and 16), but there can be holes between two ranges (e.g. range starts at 64 and goes to 128, next range starts at 256 and goes to 384), so some numbers may not belong to any range at all (numbers 129 to 255 would not belong to any range in this example).
The Problem
I'm getting a number and need to know to which range the number belongs to... if it belongs to any range at all. Otherwise I need to know that it does not belong to any range. Of course speed is important; I can not simply check all the ranges which would be O(n), as there might be thousands of ranges.
Simple Solutions
A simple solution was keeping all numbers in a sorted array and run a binary search on it. That would give me at least O(log n). Of course the binary search must be somewhat modified as it must always check against the smallest and biggest number of a range. If the number to look for is in between, we have found the correct range, otherwise we must search ranges below or above the current one. If there is only one range left in the end and the number is not within that range, the number is within no range at all and we can return a "not found" result.
Ranges could also be chained together in some kind of tree structure. This is basically like a sorted list with binary search. The advantage is that it will be faster to modify a tree than a sorted array (adding/removing range), but unlike we waste some extra time on keeping the tree balanced, the tree might get very unbalanced over the time and that will lead to much slower searches than a binary search on a sorted array.
One can argue which solution is better or worse as in practice the number of searches and modification operations will be almost balanced (there will be an equal number of searches and add/remove operations performed per second).
Question
Is there maybe a better data structure than a sorted list or a tree for this kind of problem? Maybe one that could be even better than O(log n) in best case and O(log n) in worst case?
Some additional information that might help here is the following: All ranges always start and end at multiple of a power of two. They always all start and end at the same power of two (e.g. they all start/end at a multiple of 4 or at a multiple of 8 or at a multiple of 16 and so on). The power of two cannot change during run time. Before the first range is added, the power of two must be set and all ranges ever added must start/end at a multiple of this value till the application terminates. I think this can be used for optimization, as if they all start at a multiple of e.g. 8, I can ignore the first 3 bits for all comparison operations, the other bits alone will tell me the range if any.
I read about section and ranges trees. Are these optimal solutions to the problem? Are there possibly better solutions? The problem sounds similar to what a malloc implementation must do (e.g. every free'd memory block belongs to a range of available memory and the malloc implementation must find out to which one), so how do those commonly solve the issue?
After running various benchmarks, I came to the conclusion that only a tree like structure can work here. A sorted list shows of course good lookup performance - O(log n) - but it shows horribly update performance (inserts and removals are slower by more than the factor 10 compared to trees!).
A balanced binary tree also has O(log n) lookup performance, however it is much faster to update, also around O(log n), while a sorted list is more like O(n) for updates (O(log n) to find the position for insert or the element to delete, but then up to n elements must be moved within the list and this is O(n)).
I implemented an AVL tree, a red-black tree, a Treap, an AA-Tree and various variations of B-Trees (B means Bayer Tree here, not Binary). Result: Bayer trees almost never win. Their lookup is good, but their update performance is bad (as within each node of a B-Tree you have a sorted list again!). Bayer trees are only superior in cases where reading/writing a node is a very slow operation (e.g. when the nodes are directly read or written from/to hard disk) - as a B-Tree must read/write much less nodes than any other tree, so in such a case it will win. If we are having the tree in memory though, it stands no chance against other trees, sorry for all the B-Tree fans out there.
A Treap was easiest to implement (less than half the lines of code you need for other balanced trees, only twice the code you need for an unbalanced tree) and shows good average performance for lookups and updates... but we can do better than that.
An AA-Tree shows amazing good lookup performance - I have no idea why. They sometimes beat all other trees (not by far, but still enough to not be coincident)... and the removal performance is okay, however unless I'm too stupid to implement them correctly, the insert performance is really bad (it performs much more tree rotations on every insert than any other tree - even B-Trees have faster insert performance).
This leaves us with two classics, AVL and RB-Tree. They are both pretty similar but after hours of benchmarking, one thing is clear: AVL Trees definitely have better lookup performance than RB-Trees. The difference is not gigantic, but in 2/3 out of all benchmarks they will win the lookup test. Not too surprising, after all AVL Trees are more strictly balanced than RB-Trees, so they are closer to the optimal binary tree in most cases. We are not talking about a huge difference here, it is always a close race.
On the other hand RB Trees beat AVL Trees for inserts in almost all test runs and that is not such a close race. As before, that is expected. Being less strictly balanced RB Trees perform much less tree rotations on inserts compared to AVL Trees.
How about removal of nodes? Here it seems to depend a lot on the number of nodes. For small node numbers (everything less than half a million) RB Trees again own AVL Trees; the difference is even bigger than for inserts. Rather unexpected is that once the node number grows beyond a million nodes AVL Trees seems to catch up and the difference to RB Trees shrinks until they are more or less equally fast. This could be an effect of the system, though. It could have to do with memory usage of the process or CPU caching or the like. Something that has a more negative effect on RB Trees than it has on AVL Trees and thus AVL Trees can catch up. The same effect is not observed for lookups (AVL usually faster, regardless how many nodes) and inserts (RB usually faster, regardless how many nodes).
Conclusion:
I think the fastest I can get is when using RB-Trees, since the number of lookups will only be somewhat higher than the number of inserts and deletions and no matter how fast AVL is on lookups, the overall performance will suffer from their worse insert/deletion performance.
That is, unless anyone here may come up with a much better data structure that will own RB Trees big time ;-)
Create a sorted list and sort by the lower margin / start. That's easiest to implement and fast enough unless you have millions of ranges (and maybe even then).
When looking for a range, find the range where start <= position. You can use a binary search here since the list is sorted. The number is in the range if position <= end.
Since the end of any range is guaranteed to be smaller than start of the next range, you don't need to care about the end until you have found a range where the position might be contained.
All other data structures become interesting when you get intersections or you have a whole lot of ranges and when you build the structure one and query often.
A balanced, sorted tree with ranges on each node seems to be the answer.
I can't prove it's optimal, but if I were you I wouldn't look any further.
If the total range of numbers is low, and you have enough memory, you could create a huge table with all the numbers.
For example, if you have one million of numbers, you can create a table that references the range object.
As an alternative to O(log n) balanced binary search trees (BST), you could consider building a bitwise (compressed) trie. I.e. a prefix tree on the bits of the numbers you're storing.
This gives you O(w)-search, insert and delete performance; where w = number of bits (e.g. 32 or 64 minus whatever power of 2 your ranges were based on).
Not saying that it'll perform better or worse, but it seems like a true alternative in the sense it is different from BST but still has good theoretic performance and allows for predecessor queries just like BST.

Resources