Balanced binary trees versus indexed skiplists - algorithm

Not sure if the question should be here or on programmers (or some other SE site), but I was curious about the relevant differences between balanced binary trees and indexable skiplists. The issue came up in the context of this question. From the wikipedia:
Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the implementation method of choice for many applications. Skip list algorithms have the same asymptotic expected time bounds as balanced trees and are simpler, faster and use less space.
Don't the space requirements of a skiplist depend on the depth of the hierarchy? And aren't binary trees easier to use, at least for searching (granted, insertion and deletion in balanced BSTs can be tricky)? Are there other advantages/disadvantages to skiplists?

(Some parts of your question (ease of use, simplicity, etc.) are a bit subjective and I'll answer them at the end of this post.)
Let's look at space usage. First, let's suppose that you have a binary search tree with n nodes. What's the total space usage required? Well, each node stores some data plus two pointers. You might also need some amount of information to maintain balance information. This means that the total space usage is
n * (2 * sizeof(pointer) + sizeof(data) + sizeof(balance information))
So let's think about an equivalent skiplist. You are absolutely right that the real amount of memory used by a skiplist depends on the heights of the nodes, but we can talk about the expected amount of space used by a skiplist. Typically, you pick the height of a node in a skiplist by starting at 1, then repeatedly flipping a fair coin, incrementing the height as long as you flip heads and stopping as soon as you flip tails. Given this setup, what is the expected number of pointers inside a skiplist?
An interesting result from probability theory is that if you have a series of independent events with probability p, you need approximately 1 / p trials (on expectation) before that event will occur. In our coin-flipping example, we're flipping a coin until it comes up tails, and since the coin is a fair coin (comes up heads with probability 50%), the expected number of trials necessary before we flip tails is 2. Since that last flip ends the growth, the expected number of times a node grows in a skiplist is 1. Therefore, on expectation, we would expect an average node to have only two pointers in it - one initial pointer and one added pointer. This means that the expected total space usage is
n * (2 * sizeof(pointer) + sizeof(data))
Compare this to the size of a node in a balanced binary search tree. If there is a nonzero amount of space required to store balance information, the skiplist will indeed use (on expectation) less memory than the balanced BST. Note that many types of balanced BSTs (e.g. treaps) require a lot of balance information, while others (red/black trees, AVL trees) have balance information but can hide that information in the low-order bits of its pointers, while others (splay trees) don't have any balance information at all. Therefore, this isn't a guaranteed win, but in many cases it will use space.
As to your other questions about simplicity, ease, etc: that really depends. I personally find the code to look up an element in a BST far easier than the code to do lookups in a skiplist. However, the rotation logic in balanced BSTs is often substantially more complicated than the insertion/deletion logic in a skiplist; try seeing if you can rattle off all possible rotation cases in a red/black tree without consulting a reference, or see if you can remember all the zig/zag versus zag/zag cases from a splay tree. In that sense, it can be a bit easier to memorize the logic for inserting or deleting from a skiplist.
Hope this helps!

And aren't binary trees easier to use, at least for searching
(granted, insertion and deletion in balanced BSTs can be tricky)?
Trees are "more recursive" (trees and subtrees) and SkipLists are "more iterative" (levels in an array). Of course, it depends on implementation, but SkipLists can also be very useful for practical applications.
It's easier to search in trees because you don't have to iterate levels in an array.
Are there other advantages/disadvantages to skiplists?
SkipLists are "easier" to implement. This is a little relative, but it's easier to implement a full-functional SkipList than deletion and balance operations in a BinaryTree.
Trees can be persistent (better for functional programming).
It's easier to delete items from SkipLists than internal nodes in a binary tree.
It's easier to add items to binary trees (keeping the balance is another issue)
Binary Trees are deterministic, so it's easier to study and analyze them.
My tip: If you have time, you must use a Balanced Binary Tree. If you have little time, use a Skip List. If you have no time, use a Library.

Something not mentioned so far is that skip lists can be advantageous for concurrent operations. If you read the source of ConcurrentSkipListMap, authored by Doug Lea... dig into the comments. It mentions:
there are no known efficient lock-free insertion and deletion algorithms for search trees. The immutability of the "down" links of index nodes (as opposed to mutable "left" fields in true trees) makes this tractable using only CAS operations.

You're right that this isn't the perfect forum.
The comment you quoted was written by the author of the original skip list paper: not exactly an unbiased assertion. It's been 23 years, and red-black trees still seem to be more prevalent than skip lists. An exception is redis key-value pair database, which includes skip lists as one option among its data structures.
Skip lists are very cool. But the only space advantage I've been able to show in the general randomized case is no need to store balance flags: two bits per value. This is assuming the hierarchy is dense enough to replicate binary tree performance. You can chalk this up as the price of determinism (vice. randomization). A nice feature of SL's is you can use less dense hierarchies to trade constant factors of speed for space.
Side note: it's not often discussed that if you don't need to traverse in sorted order, you can randomize unbalanced binary trees by just enciphering the keys (i.e. mapping to a pseudo-random cipher text with something very simple like RC4). Such trees are absolutely trivial to implement.

Related

Single big BST vs multiple smaller BSTs? Which is faster?

I am storing 10^9 keys in a BST.
Compared to having lets say having multiple BSTs of size 10^6 containing chunk of the bigger tree? Search through all of them executing in parallel.
I am talking about only search performance here, Given that processing power is not a bottle neck.
It depends entirely on your key schema.
For example, let's say your keys are surnames, equally distributed across the twenty-six English letters. If you're looking for Pax Diablo, you can immediately remove 25/26ths of your search space, looking only in the D tree (for Diablo).
With a balanced binary tree, you would have to traverse about 4.7 tree levels on average (log226 is about 4.700439718).
So, yes, it can be more efficient, provided the up-front operation has minimal complexity. In the given example, the selection of one of twenty-six tress is O(1), based on the first character of the name and an array lookup to find the tree.
In the case where the keys are actually numbers from zero to a billion as your comments indicate, you could still have the same efficiency, depending on data distribution. If they're equally distributed (or even close), you could maintain a thousand different trees (from your statement that you want trees of size one million) based on the first three digits of the number and reduce the initial search by a factor of 1000 (about ten tree levels).
Of course, the distribution is important. If all your numbers turn out to be less that a million, they'll all be in the first tree and this scheme will save you nothing (in fact it'll add a useless first step).
Consider using a hash table. Look up for such big keyset should be noticably faster. A hashmap will have constant amortized search complexity as opposed to the logarithmic of a BST.
Also as you are talking about a huge tree here maybe you should take a look at b+ trees.
I doubt the approach you try to take will be more efficient than using the suggestions above. The depth of a binary tree grows very slowly(assuming it is balanced). On the other hand with your approach synchronization when you produce the output will be cumbersome.

What class of problem would one use a binary search tree to solve?

I've seen this data structure talked about a lot, but I am unclear as to what sort of problem would demand such a data structure (over alternative representations). I've never needed one, but perhaps that's because I don't quite grok it. Can you enlighten me?
One example of where you would use a binary search tree would be a sorted list of values where you want to be able to quickly add elements.
Consider using an array for this purpose. You have very fast access to read random values, but if you want to add a new value, you have to find the place in the array where it belongs, shift everything over, and then insert the new value.
With a binary search tree, you simply traverse the tree looking for where the value would be if it were in the tree already, and then add it there.
Also, consider if you want to find out if your sorted array contains a particular value. You have to start at one end of the array and compare the value you're looking for to each individual value until you either find the value in the array, or pass the point where it would have been. With a binary search tree, you greatly reduce the number of comparisons you are likely to have to make. Just a quick caveat, however, it is definitely possible to contrive situations where the binary search tree requires more comparisons, but these are the exception, not the rule.
One thing I've used it for in the past is Huffman decoding (or any variable-bit-length scheme).
If you maintain your binary tree with the characters at the leaves, each incoming bit decides whether you move to the left or right node.
When you reach a leaf node, you have your decoded character and you can start on the next one.
For example, consider the following tree:
.
/ \
. C
/ \
A B
This would be a tree for a file where the predominant letter was C (by having less bits used for common letters, the file is shorter than it would be for a fixed-bit-length scheme). The codes for the individual letters are:
A: 00 (left, left).
B: 01 (left, right).
C: 1 (right).
The class of problems you use then for are those where you want to be able to both insert and access elements reasonably efficiently. As well as unbalanced trees (such as the Huffman example above), you can also use balanced trees which make the insertions a little more costly (since you may have to rebalance on the fly) but make lookups a lot more efficient since you're traversing the minimum possible number of nodes.
from wiki
Self-balancing binary search trees can be used in a natural way to construct and maintain ordered lists, such as priority queues. They can also be used for associative arrays; key-value pairs are simply inserted with an ordering based on the key alone. In this capacity, self-balancing BSTs have a number of advantages and disadvantages over their main competitor, hash tables. One advantage of self-balancing BSTs is that they allow fast (indeed, asymptotically optimal) enumeration of the items in key order, which hash tables do not provide. One disadvantage is that their lookup algorithms get more complicated when there may be multiple items with the same key.
Self-balancing BSTs can be used to implement any algorithm that requires mutable ordered lists, to achieve optimal worst-case asymptotic performance. For example, if binary tree sort is implemented with a self-balanced BST, we have a very simple-to-describe yet asymptotically optimal O(n log n) sorting algorithm. Similarly, many algorithms in computational geometry exploit variations on self-balancing BSTs to solve problems such as the line segment intersection problem and the point location problem efficiently. (For average-case performance, however, self-balanced BSTs may be less efficient than other solutions. Binary tree sort, in particular, is likely to be slower than mergesort or quicksort, because of the tree-balancing overhead as well as cache access patterns.)
Self-balancing BSTs are flexible data structures, in that it's easy to extend them to efficiently record additional information or perform new operations. For example, one can record the number of nodes in each subtree having a certain property, allowing one to count the number of nodes in a certain key range with that property in O(log n) time. These extensions can be used, for example, to optimize database queries or other list-processing algorithms.

Fast Algorithm to Quickly Find the Range a Number Belongs to in a Set of Ranges?

The Scenario
I have several number ranges. Those ranges are not overlapping - as they are not overlapping, the logical consequence is that no number can be part of more than one range at any time. Each range is continuously (there are no holes within a single range, so a range 8 to 16 will really contain all numbers between 8 and 16), but there can be holes between two ranges (e.g. range starts at 64 and goes to 128, next range starts at 256 and goes to 384), so some numbers may not belong to any range at all (numbers 129 to 255 would not belong to any range in this example).
The Problem
I'm getting a number and need to know to which range the number belongs to... if it belongs to any range at all. Otherwise I need to know that it does not belong to any range. Of course speed is important; I can not simply check all the ranges which would be O(n), as there might be thousands of ranges.
Simple Solutions
A simple solution was keeping all numbers in a sorted array and run a binary search on it. That would give me at least O(log n). Of course the binary search must be somewhat modified as it must always check against the smallest and biggest number of a range. If the number to look for is in between, we have found the correct range, otherwise we must search ranges below or above the current one. If there is only one range left in the end and the number is not within that range, the number is within no range at all and we can return a "not found" result.
Ranges could also be chained together in some kind of tree structure. This is basically like a sorted list with binary search. The advantage is that it will be faster to modify a tree than a sorted array (adding/removing range), but unlike we waste some extra time on keeping the tree balanced, the tree might get very unbalanced over the time and that will lead to much slower searches than a binary search on a sorted array.
One can argue which solution is better or worse as in practice the number of searches and modification operations will be almost balanced (there will be an equal number of searches and add/remove operations performed per second).
Question
Is there maybe a better data structure than a sorted list or a tree for this kind of problem? Maybe one that could be even better than O(log n) in best case and O(log n) in worst case?
Some additional information that might help here is the following: All ranges always start and end at multiple of a power of two. They always all start and end at the same power of two (e.g. they all start/end at a multiple of 4 or at a multiple of 8 or at a multiple of 16 and so on). The power of two cannot change during run time. Before the first range is added, the power of two must be set and all ranges ever added must start/end at a multiple of this value till the application terminates. I think this can be used for optimization, as if they all start at a multiple of e.g. 8, I can ignore the first 3 bits for all comparison operations, the other bits alone will tell me the range if any.
I read about section and ranges trees. Are these optimal solutions to the problem? Are there possibly better solutions? The problem sounds similar to what a malloc implementation must do (e.g. every free'd memory block belongs to a range of available memory and the malloc implementation must find out to which one), so how do those commonly solve the issue?
After running various benchmarks, I came to the conclusion that only a tree like structure can work here. A sorted list shows of course good lookup performance - O(log n) - but it shows horribly update performance (inserts and removals are slower by more than the factor 10 compared to trees!).
A balanced binary tree also has O(log n) lookup performance, however it is much faster to update, also around O(log n), while a sorted list is more like O(n) for updates (O(log n) to find the position for insert or the element to delete, but then up to n elements must be moved within the list and this is O(n)).
I implemented an AVL tree, a red-black tree, a Treap, an AA-Tree and various variations of B-Trees (B means Bayer Tree here, not Binary). Result: Bayer trees almost never win. Their lookup is good, but their update performance is bad (as within each node of a B-Tree you have a sorted list again!). Bayer trees are only superior in cases where reading/writing a node is a very slow operation (e.g. when the nodes are directly read or written from/to hard disk) - as a B-Tree must read/write much less nodes than any other tree, so in such a case it will win. If we are having the tree in memory though, it stands no chance against other trees, sorry for all the B-Tree fans out there.
A Treap was easiest to implement (less than half the lines of code you need for other balanced trees, only twice the code you need for an unbalanced tree) and shows good average performance for lookups and updates... but we can do better than that.
An AA-Tree shows amazing good lookup performance - I have no idea why. They sometimes beat all other trees (not by far, but still enough to not be coincident)... and the removal performance is okay, however unless I'm too stupid to implement them correctly, the insert performance is really bad (it performs much more tree rotations on every insert than any other tree - even B-Trees have faster insert performance).
This leaves us with two classics, AVL and RB-Tree. They are both pretty similar but after hours of benchmarking, one thing is clear: AVL Trees definitely have better lookup performance than RB-Trees. The difference is not gigantic, but in 2/3 out of all benchmarks they will win the lookup test. Not too surprising, after all AVL Trees are more strictly balanced than RB-Trees, so they are closer to the optimal binary tree in most cases. We are not talking about a huge difference here, it is always a close race.
On the other hand RB Trees beat AVL Trees for inserts in almost all test runs and that is not such a close race. As before, that is expected. Being less strictly balanced RB Trees perform much less tree rotations on inserts compared to AVL Trees.
How about removal of nodes? Here it seems to depend a lot on the number of nodes. For small node numbers (everything less than half a million) RB Trees again own AVL Trees; the difference is even bigger than for inserts. Rather unexpected is that once the node number grows beyond a million nodes AVL Trees seems to catch up and the difference to RB Trees shrinks until they are more or less equally fast. This could be an effect of the system, though. It could have to do with memory usage of the process or CPU caching or the like. Something that has a more negative effect on RB Trees than it has on AVL Trees and thus AVL Trees can catch up. The same effect is not observed for lookups (AVL usually faster, regardless how many nodes) and inserts (RB usually faster, regardless how many nodes).
Conclusion:
I think the fastest I can get is when using RB-Trees, since the number of lookups will only be somewhat higher than the number of inserts and deletions and no matter how fast AVL is on lookups, the overall performance will suffer from their worse insert/deletion performance.
That is, unless anyone here may come up with a much better data structure that will own RB Trees big time ;-)
Create a sorted list and sort by the lower margin / start. That's easiest to implement and fast enough unless you have millions of ranges (and maybe even then).
When looking for a range, find the range where start <= position. You can use a binary search here since the list is sorted. The number is in the range if position <= end.
Since the end of any range is guaranteed to be smaller than start of the next range, you don't need to care about the end until you have found a range where the position might be contained.
All other data structures become interesting when you get intersections or you have a whole lot of ranges and when you build the structure one and query often.
A balanced, sorted tree with ranges on each node seems to be the answer.
I can't prove it's optimal, but if I were you I wouldn't look any further.
If the total range of numbers is low, and you have enough memory, you could create a huge table with all the numbers.
For example, if you have one million of numbers, you can create a table that references the range object.
As an alternative to O(log n) balanced binary search trees (BST), you could consider building a bitwise (compressed) trie. I.e. a prefix tree on the bits of the numbers you're storing.
This gives you O(w)-search, insert and delete performance; where w = number of bits (e.g. 32 or 64 minus whatever power of 2 your ranges were based on).
Not saying that it'll perform better or worse, but it seems like a true alternative in the sense it is different from BST but still has good theoretic performance and allows for predecessor queries just like BST.

B-tree faster than AVL or RedBlack-Tree? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I know that performance never is black and white, often one implementation is faster in case X and slower in case Y, etc. but in general - are B-trees faster then AVL or RedBlack-Trees? They are considerably more complex to implement then AVL trees (and maybe even RedBlack-trees?), but are they faster (does their complexity pay off) ?
Edit: I should also like to add that if they are faster then the equivalent AVL/RedBlack tree (in terms of nodes/content) - why are they faster?
Sean's post (the currently accepted one) contains several incorrect claims. Sorry Sean, I don't mean to be rude; I hope I can convince you that my statement is based in fact.
They're totally different in their use cases, so it's not possible to make a comparison.
They're both used for maintaining a set of totally ordered items with fast lookup, insertion and deletion. They have the same interface and the same intention.
RB trees are typically in-memory structures used to provide fast access (ideally O(logN)) to data. [...]
always O(log n)
B-trees are typically disk-based structures, and so are inherently slower than in-memory data.
Nonsense. When you store search trees on disk, you typically use B-trees. That much is true. When you store data on disk, it's slower to access than data in memory. But a red-black tree stored on disk is also slower than a red-black tree stored in memory.
You're comparing apples and oranges here. What is really interesting is a comparison of in-memory B-trees and in-memory red-black trees.
[As an aside: B-trees, as opposed to red-black trees, are theoretically efficient in the I/O-model. I have experimentally tested (and validated) the I/O-model for sorting; I'd expect it to work for B-trees as well.]
B-trees are rarely binary trees, the number of children a node can have is typically a large number.
To be clear, the size range of B-tree nodes is a parameter of the tree (in C++, you may want to use an integer value as a template parameter).
The management of the B-tree structure can be quite complicated when the data changes.
I remember them to be much simpler to understand (and implement) than red-black trees.
B-tree try to minimize the number of disk accesses so that data retrieval is reasonably deterministic.
That much is true.
It's not uncommon to see something like 4 B-tree access necessary to lookup a bit of data in a very database.
Got data?
In most cases I'd say that in-memory RB trees are faster.
Got data?
Because the lookup is binary it's very easy to find something. B-tree can have multiple children per node, so on each node you have to scan the node to look for the appropriate child. This is an O(N) operation.
The size of each node is a fixed parameter, so even if you do a linear scan, it's O(1). If we big-oh over the size of each node, note that you typically keep the array sorted so it's O(log n).
On a RB-tree it'd be O(logN) since you're doing one comparison and then branching.
You're comparing apples and oranges. The O(log n) is because the height of the tree is at most O(log n), just as it is for a B-tree.
Also, unless you play nasty allocation tricks with the red-black trees, it seems reasonable to conjecture that B-trees have better caching behavior (it accesses an array, not pointers strewn about all over the place, and has less allocation overhead increasing memory locality even more), which might help it in the speed race.
I can point to experimental evidence that B-trees (with size parameters 32 and 64, specifically) are very competitive with red-black trees for small sizes, and outperforms it hands down for even moderately large values of n. See http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html
B-trees are faster. Why? I conjecture that it's due to memory locality, better caching behavior and less pointer chasing (which are, if not the same things, overlapping to some degree).
Actually Wikipedia has a great article that shows every RB-Tree can easily be expressed as a B-Tree. Take the following tree as sample:
now just convert it to a B-Tree (to make this more obvious, nodes are still colored R/B, what you usually don't have in a B-Tree):
Same Tree as B-Tree
(cannot add the image here for some weird reason)
Same is true for any other RB-Tree. It's taken from this article:
http://en.wikipedia.org/wiki/Red-black_tree
To quote from this article:
The red-black tree is then
structurally equivalent to a B-tree of
order 4, with a minimum fill factor of
33% of values per cluster with a
maximum capacity of 3 values.
I found no data that one of both is significantly better than the other one. I guess one of both had already died out if that was the case. They are different regarding how much data they must store in memory and how complicated it is to add/remove nodes from the tree.
Update:
My personal tests suggest that B-Trees are better when searching for data, as they have better data locality and thus the CPU cache can do compares somewhat faster. The higher the order of a B-Tree (the order is the number of children a note can have), the faster the lookup will get. On the other hand, they have worse performance for adding and removing new entries the higher their order is. This is caused by the fact that adding a value within a node has linear complexity. As each node is a sorted array, you must move lots of elements around within that array when adding an element into the middle: all elements to the left of the new element must be moved one position to the left or all elements to the right of the new element must be moved one position to the right. If a value moves one node upwards during an insert (which happens frequently in a B-Tree), it leaves a hole which must be also be filled either by moving all elements from the left one position to the right or by moving all elements to the right one position to the left. These operations (in C usually performed by memmove) are in fact O(n). So the higher the order of the B-Tree, the faster the lookup but the slower the modification. On the other hand if you choose the order too low (e.g. 3), a B-Tree shows little advantages or disadvantages over other tree structures in practice (in such a case you can as well use something else). Thus I'd always create B-Trees with high orders (at least 4, 8 and up is fine).
File systems, which often base on B-Trees, use much higher orders (order 200 and even a lot more) - this is because they usually choose the order high enough so that a node (when containing maximum number of allowed elements) equals either the size of a sector on harddrive or of a cluster of the filesystem. This gives optimal performance (since a HD can only write a full sector at a time, even when just one byte is changed, the full sector is rewritten anyway) and optimal space utilization (as each data entry on drive equals at least the size of one cluster or is a multiple of the cluster sizes, no matter how big the data really is). Caused by the fact that the hardware sees data as sectors and the file system groups sectors to clusters, B-Trees can yield much better performance and space utilization for file systems than any other tree structure can; that's why they are so popular for file systems.
When your app is constantly updating the tree, adding or removing values from it, a RB-Tree or an AVL-Tree may show better performance on average compared to a B-Tree with high order. Somewhat worse for the lookups and they might also need more memory, but therefor modifications are usually fast. Actually RB-Trees are even faster for modifications than AVL-Trees, therefor AVL-Trees are a little bit faster for lookups as they are usually less deep.
So as usual it depends a lot what your app is doing. My recommendations are:
Lots of lookups, little modifications: B-Tree (with high order)
Lots of lookups, lots of modifiations: AVL-Tree
Little lookups, lots of modifications: RB-Tree
An alternative to all these trees are AA-Trees. As this PDF paper suggests, AA-Trees (which are in fact a sub-group of RB-Trees) are almost equal in performance to normal RB-Trees, but they are much easier to implement than RB-Trees, AVL-Trees, or B-Trees. Here is a full implementation, look how tiny it is (the main-function is not part of the implementation and half of the implementation lines are actually comments).
As the PDF paper shows, a Treap is also an interesting alternative to classic tree implementation. A Treap is also a binary tree, but one that doesn't try to enforce balancing. To avoid worst case scenarios that you may get in unbalanced binary trees (causing lookups to become O(n) instead of O(log n)), a Treap adds some randomness to the tree. Randomness cannot guarantee that the tree is well balanced, but it also makes it highly unlikely that the tree is extremely unbalanced.
Nothing prevents a B-Tree implementation that works only in memory. In fact, if key comparisons are cheap, in-memory B-Tree can be faster because its packing of multiple keys in one node will cause less cache misses during searches. See this link for performance comparisons. A quote: "The speed test results are interesting and show the B+ tree to be significantly faster for trees containing more than 16,000 items." (B+Tree is just a variation on B-Tree).
The question is old but I think it is still relevant. Jonas Kölker and Mecki gave very good answers but I don't think the answers cover the whole story. I would even argue that the whole discussion is missing the point :-).
What was said about B-Trees is true when entries are relatively small (integers, small strings/words, floats, etc). When entries are large (over 100B) the differences become smaller/insignificant.
Let me sum up the main points about B-Trees:
They are faster than any Binary Search Tree (BSTs) due to memory locality (resulting in less cache and TLB misses).
B-Trees are usually more space efficient if entries are relatively
small or if entries are of variable size. Free space management is
easier (you allocate larger chunks of memory) and the extra metadata
overhead per entry is lower. B-Trees will waste some space as nodes
are not always full, however, they still end up being more compact
that Binary Search Trees.
The big O performance ( O(logN) ) is the same for both. Moreover, if you do binary search inside each B-Tree node, you will even end up with the same number of comparisons as in a BST (it is a nice math exercise to verify this).
If the B-Tree node size is sensible (1-4x cache line size), linear searching inside each node is still faster because of
the hardware prefetching. You can also use SIMD instructions for
comparing basic data types (e.g. integers).
B-Trees are better suited for compression: there is more data per node to compress. In certain cases this can be a huge benefit.
Just think of an auto-incrementing key in a relational database table that is used to build an index. The lead nodes of a B-Tree contain consecutive integers that compress very, very well.
B-Trees are clearly much, much faster when stored on secondary storage (where you need to do block IO).
On paper, B-Trees have a lot of advantages and close to no disadvantages. So should one just use B-Trees for best performance?
The answer is usually NO -- if the tree fits in memory. In cases where performance is crucial you want a thread-safe tree-like data-structure (simply put, several threads can do more work than a single one). It is more problematic to make a B-Tree support concurrent accesses than to make a BST. The most straight-forward way to make a tree support concurrent accesses is to lock nodes as you are traversing/modifying them. In a B-Tree you lock more entries per node, resulting in more serialization points and more contended locks.
All tree versions (AVL, Red/Black, B-Tree, an others) have countless variants that differ in how they support concurrency. The vanilla algorithms that are taught in a university course or read from some introductory books are almost never used in practice. So, it is hard to say which tree performs best as there is no official agreement on the exact algorithms are behind each tree. I would suggest to think of the trees mentioned more like data-structure classes that obey certain tree-like invariants rather than precise data-structures.
Take for example the B-Tree. The vanilla B-Tree is almost never used in practice -- you cannot make it to scale well! The most common B-Tree variant used is the B+-Tree (widely used in file-systems, databases). The main differences between the B+-Tree and the B-Tree: 1) you dont store entries in the inner nodes of the tree (thus you don't need write locks high in the tree when modifying an entry stored in an inner node); 2) you have links between nodes at the same level (thus you do not have to lock the parent of a node when doing range searches).
I hope this helps.
Guys from Google recently released their implementation of STL containers, which is based on B-trees. They claim their version is faster and consume less memory compared to standard STL containers, implemented via red-black trees.
More details here
For some applications, B-trees are significantly faster than BSTs.
The trees you may find here:
http://freshmeat.net/projects/bps
are quite fast. They also use less memory than regular BST implementations, since they do not require the BST infrastructure of 2 or 3 pointers per node, plus some extra fields to keep the balancing information.
THey are sed in different circumstances - B-trees are used when the tree nodes need to be kept together in storage - typically because storage is a disk page and so re-balancing could be vey expensive. RB trees are used when you don't have this constraint. So B-trees will probably be faster if you want to implement (say) a relational database index, while RB trees will probably be fasterv for (say) an in memory search.
They all have the same asymptotic behavior, so the performance depends more on the implementation than which type of tree you are using.
Some combination of tree structures might actually be the fastest approach, where each node of a B-tree fits exactly into a cache-line and some sort of binary tree is used to search within each node. Managing the memory for the nodes yourself might also enable you to achieve even greater cache locality, but at a very high price.
Personally, I just use whatever is in the standard library for the language I am using, since it's a lot of work for a very small performance gain (if any).
On a theoretical note... RB-trees are actually very similar to B-trees, since they simulate the behavior of 2-3-4 trees. AA-trees are a similar structure, which simulates 2-3 trees instead.
moreover ...the height of a red black tree is O(log[2] N) whereas that of B-tree is O(log[q] N) where ceiling[N]<= q <= N . So if we consider comparisons in each key array of B-tree (that is fixed as mentioned above) then time complexity of B-tree <= time complexity of Red-black tree. (equal case for single record equal in size of a block size)

Red-Black Trees

I've seen binary trees and binary searching mentioned in several books I've read lately, but as I'm still at the beginning of my studies in Computer Science, I've yet to take a class that's really dealt with algorithms and data structures in a serious way.
I've checked around the typical sources (Wikipedia, Google) and most descriptions of the usefulness and implementation of (in particular) Red-Black trees have come off as dense and difficult to understand. I'm sure for someone with the necessary background, it makes perfect sense, but at the moment it reads like a foreign language almost.
So what makes binary trees useful in some of the common tasks you find yourself doing while programming? Beyond that, which trees do you prefer to use (please include a sample implementation) and why?
Red Black trees are good for creating well-balanced trees. The major problem with binary search trees is that you can make them unbalanced very easily. Imagine your first number is a 15. Then all the numbers after that are increasingly smaller than 15. You'll have a tree that is very heavy on the left side and has nothing on the right side.
Red Black trees solve that by forcing your tree to be balanced whenever you insert or delete. It accomplishes this through a series of rotations between ancestor nodes and child nodes. The algorithm is actually pretty straightforward, although it is a bit long. I'd suggest picking up the CLRS (Cormen, Lieserson, Rivest and Stein) textbook, "Introduction to Algorithms" and reading up on RB Trees.
The implementation is also not really so short so it's probably not really best to include it here. Nevertheless, trees are used extensively for high performance apps that need access to lots of data. They provide a very efficient way of finding nodes, with a relatively small overhead of insertion/deletion. Again, I'd suggest looking at CLRS to read up on how they're used.
While BSTs may not be used explicitly - one example of the use of trees in general are in almost every single modern RDBMS. Similarly, your file system is almost certainly represented as some sort of tree structure, and files are likewise indexed that way. Trees power Google. Trees power just about every website on the internet.
I'd like to address only the question "So what makes binary trees useful in some of the common tasks you find yourself doing while programming?"
This is a big topic that many people disagree on. Some say that the algorithms taught in a CS degree such as binary search trees and directed graphs are not used in day-to-day programming and are therefore irrelevant. Others disagree, saying that these algorithms and data structures are the foundation for all of our programming and it is essential to understand them, even if you never have to write one for yourself. This filters into conversations about good interviewing and hiring practices. For example, Steve Yegge has an article on interviewing at Google that addresses this question. Remember this debate; experienced people disagree.
In typical business programming you may not need to create binary trees or even trees very often at all. However, you will use many classes which internally operate using trees. Many of the core organization classes in every language use trees and hashes to store and access data.
If you are involved in more high-performance endeavors or situations that are somewhat outside the norm of business programming, you will find trees to be an immediate friend. As another poster said, trees are core data structures for databases and indexes of all kinds. They are useful in data mining and visualization, advanced graphics (2d and 3d), and a host of other computational problems.
I have used binary trees in the form of BSP (binary space partitioning) trees in 3d graphics. I am currently looking at trees again to sort large amounts of geocoded data and other data for information visualization in Flash/Flex applications. Whenever you are pushing the boundary of the hardware or you want to run on lower hardware specifications, understanding and selecting the best algorithm can make the difference between failure and success.
None of the answers mention what it is exactly BSTs are good for.
If what you want to do is just lookup by values then a hashtable is much faster, O(1) insert and lookup (amortized best case).
A BST will be O(log N) lookup where N is the number of nodes in the tree, inserts are also O(log N).
RB and AVL trees are important like another answer mentioned because of this property, if a plain BST is created with in-order values then the tree will be as high as the number of values inserted, this is bad for lookup performance.
The difference between RB and AVL trees are in the the rotations required to rebalance after an insert or delete, AVL trees are O(log N) for rebalances while RB trees are O(1). An example of benefit of this constant complexity is in a case where you might be keeping a persistent data source, if you need to track changes to roll-back you would have to track O(log N) possible changes with an AVL tree.
Why would you be willing to pay for the cost of a tree over a hash table? ORDER! Hash tables have no order, BSTs on the other hand are always naturally ordered by virtue of their structure. So if you find yourself throwing a bunch of data in an array or other container and then sorting it later, a BST may be a better solution.
The tree's order property gives you a number of ordered iteration capabilities, in-order, depth-first, breadth-first, pre-order, post-order. These iteration algorithms are useful in different circumstances if you want to look them up.
Red black trees are used internally in almost every ordered container of language libraries, C++ Set and Map, .NET SortedDictionary, Java TreeSet, etc...
So trees are very useful, and you may use them quite often without even knowing it. You most likely will never need to write one yourself, though I would highly recommend it as an interesting programming exercise.
Red Black Trees and B-trees are used in all sorts of persistent storage; because the trees are balanced the performance of breadth and depth traversals are mitigated.
Nearly all modern database systems use trees for data storage.
BSTs make the world go round, as said by Micheal. If you're looking for a good tree to implement, take a look at AVL trees (Wikipedia). They have a balancing condition, so they are guaranteed to be O(logn). This kind of searching efficiency makes it logical to put into any kind of indexing process. The only thing that would be more efficient would be a hashing function, but those get ugly quick, fast, and in a hurry. Also, you run into the Birthday Paradox (also known as the pigeon-hole problem).
What textbook are you using? We used Data Structures and Analysis in Java by Mark Allen Weiss. I actually have it open in my lap as i'm typing this. It has a great section about Red-Black trees, and even includes the code necessary to implement all the trees it talks about.
Red-black trees stay balanced, so you don't have to traverse deep to get items out. The time saved makes RB trees O(log()n)) in the WORST case, whereas unlucky binary trees can get into a lop sided configuration and cause retrievals in O(n) a bad case. This does happen in practice or on random data. So if you need time critical code (database retrievals, network server etc.) you use RB trees to support ordered or unordered lists/sets .
But RBTrees are for noobs! If you are doing AI and you need to perform a search, you find you fork the state information alot. You can use a persistent red-black to fork new states in O(log(n)). A persistent red black tree keeps a copy of the tree before and after a morphological operation (insert/delete), but without copying the entire tree (normally and O(log(n)) operation). I have open sourced a persistent red-black tree for java
http://edinburghhacklab.com/2011/07/a-java-implementation-of-persistent-red-black-trees-open-sourced/
The best description of red-black trees I have seen is the one in Cormen, Leisersen and Rivest's 'Introduction to Algorithms'. I could even understand it enough to partially implement one (insertion only). There are also quite a few applets such as This One on various web pages that animate the process and allow you to watch and step through a graphical representation of the algorithm building a tree structure.
Since you ask which tree people use, you need to know that a Red Black tree is fundamentally a 2-3-4 B-tree (i.e a B-tree of order 4). A B-tree is not equivalent to a binary tree(as asked in your question).
Here's an excellent resource describing the initial abstraction known as the symmetric binary B-tree that later evolved into the RBTree. You would need a good grasp on B-trees before it makes sense. To summarize: a 'red' link on a Red Black tree is a way to represent nodes that are part of a B-tree node (values within a key range), whereas 'black' links are nodes that are connected vertically in a B-tree.
So, here's what you get when you translate the rules of a Red Black tree in terms of a B-tree (I'm using the format Red Black tree rule => B Tree equivalent):
1) A node is either red or black. => A node in a b-tree can either be part of a node, or as a node in a new level.
2) The root is black. (This rule is sometimes omitted, since it doesn't affect analysis) => The root node can be thought of either as a part of an internal root node as a child of an imaginary parent node.
3) All leaves (NIL) are black. (All leaves are same color as the root.) => Since one way of representing a RB tree is by omitting the leaves, we can rule this out.
4)Both children of every red node are black. => The children of an internal node in a B-tree always lie on another level.
5)Every simple path from a given node to any of its descendant leaves contains the same number of black nodes. => A B-tree is kept balanced as it requires that all leaf nodes are at the same depth (Hence the height of a B-tree node is represented by the number of black links from the root to the leaf of a Red Black tree)
Also, there's a simpler 'non-standard' implementation by Robert Sedgewick here: (He's the author of the book Algorithms along with Wayne)
Lots and lots of heat here, but not much light, so lets see if we can provide some.
First, a RB tree is an associative data structure, unlike, say an array, which cannot take a key and return an associated value, well, unless that's an integer "key" in a 0% sparse index of contiguous integers. An array cannot grow in size either (yes, I know about realloc() too, but under the covers that requires a new array and then a memcpy()), so if you have either of these requirements, an array won't do. An array's memory efficiency is perfect. Zero waste, but not very smart, or flexible - realloc() not withstanding.
Second, in contrast to a bsearch() on an array of elements, which IS an associative data structure, a RB tree can grow (AND shrink) itself in size dynamically. The bsearch() works fine for indexing a data structure of a known size, which will remain that size. So if you don't know the size of your data in advance, or new elements need to be added, or deleted, a bsearch() is out. Bsearch() and qsort() are both well supported in classic C, and have good memory efficiency, but are not dynamic enough for many applications. They are my personal favorite though because they're quick, easy, and if you're not dealing with real-time apps, quite often are flexible enough. In addition, in C/C++ you can sort an array of pointers to data records, pointing to the struc{} member, for example, you wish to compare, and then rearranging the pointer in the pointer array such that reading the pointers in order at the end of the pointer sort yields your data in sorted order. Using this with memory-mapped data files is extremely memory efficient, fast, and fairly easy. All you need to do is add a few "*"s to your compare function/s.
Third, in contrast to a hashtable, which also must be a fixed size, and cannot be grown once filled, a RB tree will automagically grow itself and balance itself to maintain its O(log(n)) performance guarantee. Especially if the RB tree's key is an int, it can be faster than a hash, because even though a hashtable's complexity is O(1), that 1 can be a very expensive hash calculation. A tree's multiple 1-clock integer compares often outperform 100-clock+ hash calculations, to say nothing of rehashing, and malloc()ing space for hash collisions and rehashes. Finally, if you want ISAM access, as well as key access to your data, a hash is ruled out, as there is no ordering of the data inherent in the hashtable, in contrast to the natural ordering of data in any tree implementation. The classic use for a hash table is to provide keyed access to a table of reserved words for a compiler. It's memory efficiency is excellent.
Fourth, and very low on any list, is the linked, or doubly-linked list, which, in contrast to an array, naturally supports element insertions and deletions, and as that implies, resizing. It's the slowest of all the data structures, as each element only knows how to get to the next element, so you have to search, on average, (element_knt/2) links to find your datum. It is mostly used where insertions and deletions somewhere in the middle of the list are common, and especially, where the list is circular and feeds an expensive process which makes the time to read the links relatively small. My general RX is to use an arbitrarily large array instead of a linked list if your only requirement is that it be able to increase in size. If you run out of size with an array, you can realloc() a larger array. The STL does this for you "under the covers" when you use a vector. Crude, but potentially 1,000s of times faster if you don't need insertions, deletions or keyed lookups. It's memory efficiency is poor, especially for doubly-linked lists. In fact, a doubly-linked list, requiring two pointers, is exactly as memory inefficient as a red-black tree while having NONE of its appealing fast, ordered retrieval characteristics.
Fifth, trees support many additional operations on their sorted data than any other data structure. For example, many database queries make use of the fact that a range of leaf values can be easily specified by specifying their common parent, and then focusing subsequent processing on the part of the tree that parent "owns". The potential for multi-threading offered by this approach should be obvious, as only a small region of the tree needs to be locked - namely, only the nodes the parent owns, and the parent itself.
In short, trees are the Cadillac of data structures. You pay a high price in terms of memory used, but you get a completely self-maintaining data structure. This is why, as pointed out in other replies here, transaction databases use trees almost exclusively.
If you would like to see how a Red-Black tree is supposed to look graphically, I have coded an implementation of a Red-Black tree that you can download here
IME, almost no one understands the RB tree algorithm. People can repeat the rules back to you, but they don't understand why those rules and where they come from. I am no exception :-)
For this reason, I prefer the AVL algorithm, because it's easy to comprehend. Once you understand it, you can then code it up from scratch, because it make sense to you.
Trees can be fast. If you have a million nodes in a balanced binary tree, it takes twenty comparisons on average to find any one item. If you have a million nodes in a linked list, it takes five hundred thousands comparisons on average to find the same item.
If the tree is unbalanced, though, it can be just as slow as a list, and also take more memory to store. Imagine a tree where most nodes have a right child, but no left child; it is a list, but you still have to hold memory space to put in the left node if one shows up.
Anyways, the AVL tree was the first balanced binary tree algorithm, and the Wikipedia article on it is pretty clear. The Wikipedia article on red-black trees is clear as mud, honestly.
Beyond binary trees, B-Trees are trees where each node can have many values. B-Tree is not a binary tree, just happens to be the name of it. They're really useful for utilizing memory efficiently; each node of the tree can be sized to fit in one block of memory, so that you're not (slowly) going and finding tons of different things in memory that was paged to disk. Here's a phenomenal example of the B-Tree.

Resources