Related
I'm watching university lectures on algorithms and it seems so many of them rely almost entirely binary search trees of some particular sort for querying/database/search tasks.
I don't understand this obsession with Binary Search Trees. It seems like in the vast majority of scenarios, a BSP could be replaced with a sorted array in the case of a static data, or a sorted bucketed list if insertions occur dynamically, and then a Binary Search could be employed over them.
With this approach, you get the same algorithmic complexity (for querying at least) as a BST, way better cache coherency, way less memory fragmentation (and less gc allocs depending on what language you're in), and are likely much simpler to write.
The fundamental issue is that BSP are completely memory naïve -- their focus is entirely on O(n) complexity and they ignore the very real performance considerations of memory fragmentation and cache coherency... Am I missing something?
Binary search trees (BST) are not totally equivalent to the proposed data structure. Their asymptotic complexity is better when it comes to both insert and remove sorted values dynamically (assuming they are balanced correctly). For example, when you when to build an index of the top-k values dynamically:
while end_of_stream(stream):
value <- stream.pop_value()
tree.insert(value)
tree.remove_max()
Sorted arrays are not efficient in this case because of the linear-time insertion. The complexity of bucketed lists is not better than plain list asymptotically and also suffer from a linear-time search. One can note that a heap can be used in this case, and in fact it is probably better to use a heap here, although they are not always interchangeable.
That being said, your are right : BST are slow, cause a lot of cache miss and fragmentation, etc. Thus, they are often replaced by more compact variants like B-trees. B-tree uses a sorted array index to reduce the amount of node jumps and make the data-structure much more compact. They can be mixed with some 4-byte pointer optimizations to make them even more compact. B-trees are to BST what bucketed linked-lists are to plain linked-lists. B-trees are very good for building dynamic database index of huge datasets stored on a slow storage device (because of the size): they enable applications to fetch values associated to a key using very few storage-device lookups (which as very slow on HDD for example). Another example of real-world use-case is interval-trees.
Note that memory fragmentation can be reduced using compaction methods. For BSTs/B-trees, one can reorder the root nodes like in a heap. However, compaction is not always easy to apply, especially on native languages with pointers like in C/C++ although some very clever methods exists to do so.
Keep in mind that B-trees shine only on big datasets (especially the ones that do not fit in cache). On relatively small ones, using just plain arrays or even sorted array is often a very good solution.
This question was the last straw; and I've been wondering for a long time about it,
Why do people think about "Algorithms" and "Data structures" as about something that can be separated from each other?
I see a lot of evidence that they're separated in programmers' minds.
they request "Data Structures & Algorithms" books
they refer to "Data Structures" and "Algorithms" as separate university courses
they "know Algorithms", but are "weak in Data Structures" (can't find the link, sorry).
etc.
In my opinion "Data Structures" are algorithms, since the concept of "Data Structure" is about Algorithms to operate data that go in and out of the structures. But the opinion seems not a mainstream. What do I miss?
Edit: unfortunately, I did not formulate the question well. A separation of data structures and algorithms in programs people write is natural, since, well, the former is data, and the latter is functions (and in semi-functional frameworks like STL it's the core of the whole thing).
But the points above, and the question itself, refers to the way people think, to the way they arrange the knowledge in their heads. This doesn't have to even relate to the code writing.
Here are some links where people separate "algorithms" and "data structures" when they're the same thing:
Revisions: algorithm and data structure
They are different. Consider graphs, or trees to be more specific. Now, a tree appears to only be a tree. But you can browse it in preorder, inorder or postorder (3 algorithms for one structure).
You can have multiple or only 2 children for one node. The tree can be balanced (like AVL) or contain additional information (like B-tree indexes in data bases). That's different structures. But still you traverse them with the same algorithm.
See it now?
Another point: Algorithms sometimes are and sometimes are not independent from data structures. Certain algorithms have different complexity over different structures (finding paths in graph represented as list or a 2D table).
Algorithms and Data Structures are tightly wound together. Algorithm depends on data structures, if you change either of them, complexity will change considerably. They are not same, but are definitely two sides of the same coin. Selecting a good Data Structure is itself a path towards better algorithm.
For instance, Priority Queues can be implemented using binary heaps and binomial heaps, binary heaps allow peeking at highest priority element in constant time, whereas binomial heaps require O(log N) time for peeking.
So, a particular algorithm works best for that particular data-structure (in a particular context), hence Algorithms and Data Structures go hand-in-hand!
People refer to them as different entities because they are. Suppose I want to find an element from a set of data. If I put that data into an array, the array is a data-structure. Once it's in the array, I can use multiple different algorithms to find the element I'm interested in. I could sort the array (with any of multiple sorts) then use a binary search, I could just check each element linearly, etc. The choice of the array as the data structure I would use as opposed to say, a linked list, is not choosing an algorithm.
That said, it is important to understand one to understand the other. If you do not understand algorithms well then it is not obvious what the advantages and disadvantages of different data structures are, and vice versa. As such, it makes sense to teach them simultaneously. They are however different entities.
[Edit] Think about this: If you look at pseudo-code for most algorithms, a data structure isn't specified. You may have a "list" of elements to iterate through etc, but the exact implementation of that list is unimportant to the correctness of the algorithm.
I would say it's because functional programming separates what is operated on from the operations themselves. Targets and actions are certainly different, even if they're closely intertwined.
It was object-oriented programming that put data and operations into a single component. Perhaps if OO had come along earlier there would have been one discipline.
The way I see it is that algorithms are something that work with or on data structures, so there is a difference between the two. A simple data structure is an array, but there are a lot of algorithms that operate on simple arrays, so there has to be a way of separating the two. An array can also represent a tree, and trees are handled with specialized algorithms.
The difference isn't big, because you can't really have one without the other most of the times, but some times you can. Consider the trivial algorithm that determines whether a number is prime - it uses no data structures. Consider the GCD algorithm, also no data structures. You can talk about an algorithm without talking about data structures, but you can't talk about a data structure without talking about algorithms usually. You can talk about a tree, but you'll need algorithms for insertions, removals etc.
I think it's good that there is a distinction because they are, conceptually, different things. An algorithm is a set of steps used for accomplishing a task, while a data structure is something used to store data, the manipulation of said data is done with algorithms.
They are separate university courses. Typically, the data structures course emphasizes programming and is prerequisite to the algorithms course, which emphasizes mathematical analysis of algorithms. I don't think it's hard to see why many people with an undergraduate education in CS might think of them as separate.
I agree with you. Both are two sides of one and the same thing.
When talking about data structures, it's always about storing data in a way to optimize certain operations on this data, which leads us to algorithms and complexity.
The two are, of course, closely intertwined. This is why the posts you refer to requests books on both. Not always, though. The core of a sort algorithm, for example, is unchanged no matter what sort of data structure you're working on.
The title of the book Algorithm + Data Structures = Programs (1975) by none other than Niklaus Wirth suggests that both are essential in writing a program.
Ok, so this is something that's always bothered me. The tree data structures I know of are:
Unbalanced binary trees
AVL trees
Red-black trees
2-3 trees
B-trees
B*-trees
Tries
Heaps
How do I determine what kind of tree is the best tool for the job? Obviously heaps are canonically used to form priority queues. But the rest of them just seem to be different ways of doing the same thing. Is there any way to choose the best one for the job?
Let’s pick them off one by one, shall we?
Unbalanced binary trees
For search tasks, never. Basically, their performance characteristics will be completely unpredictable and the overhead of balancing a tree won’t be so big as to make unbalanced trees a viable alternative.
Apart from that, unbalanced binary trees of course have other uses, but not as search trees.
AVL trees
They are easy to develop but their performance is generally surpassed by other balancing strategies because balancing them is comparatively time-intensive. Wikipedia claims that they perform better in lookup-intensive scenarios because their height is slightly less in the worst case.
Red-black trees
These are used inside most of C++’ std::map implemenations and probably in a few other standard libraries as well. However, there’s good evidence that they are actually worse than B(+) trees in every scenario due to caching behaviour of modern CPUs. Historically, when caching wasn’t as important (or as good), they surpassed B trees when used in main memory.
2-3 trees
B-trees
B*-trees
These require the most careful consideration of all the trees, since the different constants used are basically “magical” constans which relate in weird and sometimes unpredictable way to the underlying hardware architecture. For example, the optimal number of child nodes per level can depend on the size of a memory page or cache line.
I know of no good, general rule to distinguish between them.
Tries
Completely different. Tries are also search trees, but for text retrieval of substrings in a corpus. A trie is an uncompressed prefix tree (i.e. a tree in which the paths from root to leaf nodes correspond to all the prefixes of a given string).
Tries should be compared to, and offset against, suffix trees, suffix arrays and q-gram indices – not so much against other search trees because the data that they search is different: instead of discrete words in a corpus, the latter index structures allow a factor search.
Heaps
As you’ve already said, they are not search trees at all.
The same as any other data structure, you have to know the characteristics (complexity of search, insert, and delete operations) of each type of tree, and the requirements of the job you're selecting a tool for. The tree that has the best performance for the type of operations you'll do most often is usually the best tool for the job.
You can usually find the general characteristics for any kind of data structure on Wikipedia. Introduction to Algorithms also has at least a section (in some cases a whole chapter) on most of the data structures you've listed, so it's another good reference.
Similar question: When to choose RB tree, B-Tree or AVL tree?
Offhand, I'd say, write the simplest code that could possibly work (availing yourself of library-provided data structures if possible). Then measure its performance problems, if any.
If your performance needs are really extreme, read Konrad Rudolph's awesome answer. :)
Each of these has different complexity for insertion, deletion and retrieval, All have mostly O log(n) access times.
Each tree has specific characteristics which make them usefull in a certain way. You should compare there characteristics with the needs you have.
I'm building a symbol table for a project I'm working on. I was wondering what peoples opinions are on the advantages and disadvantages of the various methods available for storing and creating a symbol table.
I've done a fair bit of searching and the most commonly recommended are binary trees or linked lists or hash tables. What are the advantages and or disadvantages of all of the above? (working in c++)
The standard trade offs between these data structures apply.
Binary Trees
medium complexity to implement (assuming you can't get them from a library)
inserts are O(logN)
lookups are O(logN)
Linked lists (unsorted)
low complexity to implement
inserts are O(1)
lookups are O(N)
Hash tables
high complexity to implement
inserts are O(1) on average
lookups are O(1) on average
Your use case is presumably going to be "insert the data once (e.g., application startup) and then perform lots of reads but few if any extra insertions".
Therefore you need to use an algorithm that is fast for looking up the information that you need.
I'd therefore think the HashTable was the most suitable algorithm to use, as it is simply generating a hash of your key object and using that to access the target data - it is O(1). The others are O(N) (Linked Lists of size N - you have to iterate through the list one at a time, an average of N/2 times) and O(log N) (Binary Tree - you halve the search space with each iteration - only if the tree is balanced, so this depends on your implementation, an unbalanced tree can have significantly worse performance).
Just make sure that there are enough spaces (buckets) in the HashTable for your data (R.e., Soraz's comment on this post). Most framework implementations (Java, .NET, etc) will be of a quality that you won't need to worry about the implementations.
Did you do a course on data structures and algorithms at university?
What everybody seems to forget is that for small Ns, IE few symbols in your table, the linked list can be much faster than the hash-table, although in theory its asymptotic complexity is indeed higher.
There is a famous qoute from Pike's Notes on Programming in C: "Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy." http://www.lysator.liu.se/c/pikestyle.html
I can't tell from your post if you will be dealing with a small N or not, but always remember that the best algorithm for large N's are not necessarily good for small Ns.
It sounds like the following may all be true:
Your keys are strings.
Inserts are done once.
Lookups are done frequently.
The number of key-value pairs is relatively small (say, fewer than a K or so).
If so, you might consider a sorted list over any of these other structures. This would perform worse than the others during inserts, as a sorted list is O(N) on insert, versus O(1) for a linked list or hash table, and O(log2N) for a balanced binary tree. But lookups in a sorted list may be faster than any of these others structures (I'll explain this shortly), so you may come out on top. Also, if you perform all your inserts at once (or otherwise don't require lookups until all insertions are complete), then you can simplify insertions to O(1) and do one much quicker sort at the end. What's more, a sorted list uses less memory than any of these other structures, but the only way this is likely to matter is if you have many small lists. If you have one or a few large lists, then a hash table is likely to out-perform a sorted list.
Why might lookups be faster with a sorted list? Well, it's clear that it's faster than a linked list, with the latter's O(N) lookup time. With a binary tree, lookups only remain O(log2 N) if the tree remains perfectly balanced. Keeping the tree balanced (red-black, for instance) adds to the complexity and insertion time. Additionally, with both linked lists and binary trees, each element is a separately-allocated1 node, which means you'll have to dereference pointers and likely jump to potentially widely varying memory addresses, increasing the chances of a cache miss.
As for hash tables, you should probably read a couple of other questions here on StackOverflow, but the main points of interest here are:
A hash table can degenerate to O(N) in the worst case.
The cost of hashing is non-zero, and in some implementations it can be significant, particularly in the case of strings.
As in linked lists and binary trees, each entry is a node storing more than just key and value, also separately-allocated in some implementations, so you use more memory and increase chances of a cache miss.
Of course, if you really care about how any of these data structures will perform, you should test them. You should have little problem finding good implementations of any of these for most common languages. It shouldn't be too difficult to throw some of your real data at each of these data structures and see which performs best.
It's possible for an implementation to pre-allocate an array of nodes, which would help with the cache-miss problem. I've not seen this in any real implementation of linked lists or binary trees (not that I've seen every one, of course), although you could certainly roll your own. You'd still have a slightly higher possibility of a cache miss, though, since the node objects would be necessarily larger than the key/value pairs.
I like Bill's answer, but it doesn't really synthesize things.
From the three choices:
Linked lists are relatively slow to lookup items from (O(n)). So if you have a lot of items in your table, or you are going to be doing a lot of lookups, then they are not the best choice. However, they are easy to build, and easy to write too. If the table is small, and/or you only ever do one small scan through it after it is built, then this might be the choice for you.
Hash tables can be blazingly fast. However, for it to work you have to pick a good hash for your input, and you have to pick a table big enough to hold everything without a lot of hash collisions. What that means is you have to know something about the size and quantity of your input. If you mess this up, you end up with a really expensive and complex set of linked lists. I'd say that unless you know ahead of time roughly how large the table is going to be, don't use a hash table. This disagrees with your "accepted" answer. Sorry.
That leaves trees. You have an option here though: To balance or not to balance. What I've found by studying this problem on C and Fortran code we have here is that the symbol table input tends to be sufficiently random that you only lose about a tree level or two by not balancing the tree. Given that balanced trees are slower to insert elements into and are harder to implement, I wouldn't bother with them. However, if you already have access to nice debugged component libraries (eg: C++'s STL), then you might as well go ahead and use the balanced tree.
A couple of things to watch out for.
Binary trees only have O(log n) lookup and insert complexity if the tree is balanced. If your symbols are inserted in a pretty random fashion, this shouldn't be a problem. If they're inserted in order, you'll be building a linked list. (For your specific application they shouldn't be in any kind of order, so you should be okay.) If there's a chance that the symbols will be too orderly, a Red-Black Tree is a better option.
Hash tables give O(1) average insert and lookup complexity, but there's a caveat here, too. If your hash function is bad (and I mean really bad) you could end up building a linked list here as well. Any reasonable string hash function should do, though, so this warning is really only to make sure you're aware that it could happen. You should be able to just test that your hash function doesn't have many collisions over your expected range of inputs, and you'll be fine. One other minor drawback is if you're using a fixed-size hash table. Most hash table implementations grow when they reach a certain size (load factor to be more precise, see here for details). This is to avoid the problem you get when you're inserting a million symbols into ten buckets. That just leads to ten linked lists with an average size of 100,000.
I would only use a linked list if I had a really short symbol table. It's easiest to implement, but the best case performance for a linked list is the worst case performance for your other two options.
Other comments have focused on adding/retrieving elements, but this discussion isn't complete without considering what it takes to iterate over the entire collection. The short answer here is that hash tables require less memory to iterate over, but trees require less time.
For a hash table, the memory overhead of iterating over the (key, value) pairs does not depend on the capacity of the table or the number of elements stored in the table; in fact, iterating should require just a single index variable or two.
For trees, the amount of memory required always depends on the size of the tree. You can either maintain a queue of unvisited nodes while iterating or add additional pointers to the tree for easier iteration (making the tree, for purposes of iteration, act like a linked list), but either way, you have to allocate extra memory for iteration.
But the situation is reversed when it comes to timing. For a hash table, the time it takes to iterate depends on the capacity of the table, not the number of stored elements. So a table loaded at 10% of capacity will take about 10 times longer to iterate over than a linked list with the same elements!
This depends on several things, of course. I'd say that a linked list is right out, since it has few suitable properties to work as a symbol table. A binary tree might work, if you already have one and don't have to spend time writing and debugging it. My choice would be a hash table, I think that is more or less the default for this purpose.
This question goes through the different containers in C#, but they are similar in any language you use.
Unless you expect your symbol table to be small, I should steer clear of linked lists. A list of 1000 items will on average take 500 iterations to find any item within it.
A binary tree can be much faster, so long as it's balanced. If you're persisting the contents, the serialised form will likely be sorted, and when it's re-loaded, the resulting tree will be wholly un-balanced as a consequence, and it'll behave the same as the linked list - because that's basically what it has become. Balanced tree algorithms solve this matter, but make the whole shebang more complex.
A hashmap (so long as you pick a suitable hashing algorithm) looks like the best solution. You've not mentioned your environment, but just about all modern languages have a Hashmap built in.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I know that performance never is black and white, often one implementation is faster in case X and slower in case Y, etc. but in general - are B-trees faster then AVL or RedBlack-Trees? They are considerably more complex to implement then AVL trees (and maybe even RedBlack-trees?), but are they faster (does their complexity pay off) ?
Edit: I should also like to add that if they are faster then the equivalent AVL/RedBlack tree (in terms of nodes/content) - why are they faster?
Sean's post (the currently accepted one) contains several incorrect claims. Sorry Sean, I don't mean to be rude; I hope I can convince you that my statement is based in fact.
They're totally different in their use cases, so it's not possible to make a comparison.
They're both used for maintaining a set of totally ordered items with fast lookup, insertion and deletion. They have the same interface and the same intention.
RB trees are typically in-memory structures used to provide fast access (ideally O(logN)) to data. [...]
always O(log n)
B-trees are typically disk-based structures, and so are inherently slower than in-memory data.
Nonsense. When you store search trees on disk, you typically use B-trees. That much is true. When you store data on disk, it's slower to access than data in memory. But a red-black tree stored on disk is also slower than a red-black tree stored in memory.
You're comparing apples and oranges here. What is really interesting is a comparison of in-memory B-trees and in-memory red-black trees.
[As an aside: B-trees, as opposed to red-black trees, are theoretically efficient in the I/O-model. I have experimentally tested (and validated) the I/O-model for sorting; I'd expect it to work for B-trees as well.]
B-trees are rarely binary trees, the number of children a node can have is typically a large number.
To be clear, the size range of B-tree nodes is a parameter of the tree (in C++, you may want to use an integer value as a template parameter).
The management of the B-tree structure can be quite complicated when the data changes.
I remember them to be much simpler to understand (and implement) than red-black trees.
B-tree try to minimize the number of disk accesses so that data retrieval is reasonably deterministic.
That much is true.
It's not uncommon to see something like 4 B-tree access necessary to lookup a bit of data in a very database.
Got data?
In most cases I'd say that in-memory RB trees are faster.
Got data?
Because the lookup is binary it's very easy to find something. B-tree can have multiple children per node, so on each node you have to scan the node to look for the appropriate child. This is an O(N) operation.
The size of each node is a fixed parameter, so even if you do a linear scan, it's O(1). If we big-oh over the size of each node, note that you typically keep the array sorted so it's O(log n).
On a RB-tree it'd be O(logN) since you're doing one comparison and then branching.
You're comparing apples and oranges. The O(log n) is because the height of the tree is at most O(log n), just as it is for a B-tree.
Also, unless you play nasty allocation tricks with the red-black trees, it seems reasonable to conjecture that B-trees have better caching behavior (it accesses an array, not pointers strewn about all over the place, and has less allocation overhead increasing memory locality even more), which might help it in the speed race.
I can point to experimental evidence that B-trees (with size parameters 32 and 64, specifically) are very competitive with red-black trees for small sizes, and outperforms it hands down for even moderately large values of n. See http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html
B-trees are faster. Why? I conjecture that it's due to memory locality, better caching behavior and less pointer chasing (which are, if not the same things, overlapping to some degree).
Actually Wikipedia has a great article that shows every RB-Tree can easily be expressed as a B-Tree. Take the following tree as sample:
now just convert it to a B-Tree (to make this more obvious, nodes are still colored R/B, what you usually don't have in a B-Tree):
Same Tree as B-Tree
(cannot add the image here for some weird reason)
Same is true for any other RB-Tree. It's taken from this article:
http://en.wikipedia.org/wiki/Red-black_tree
To quote from this article:
The red-black tree is then
structurally equivalent to a B-tree of
order 4, with a minimum fill factor of
33% of values per cluster with a
maximum capacity of 3 values.
I found no data that one of both is significantly better than the other one. I guess one of both had already died out if that was the case. They are different regarding how much data they must store in memory and how complicated it is to add/remove nodes from the tree.
Update:
My personal tests suggest that B-Trees are better when searching for data, as they have better data locality and thus the CPU cache can do compares somewhat faster. The higher the order of a B-Tree (the order is the number of children a note can have), the faster the lookup will get. On the other hand, they have worse performance for adding and removing new entries the higher their order is. This is caused by the fact that adding a value within a node has linear complexity. As each node is a sorted array, you must move lots of elements around within that array when adding an element into the middle: all elements to the left of the new element must be moved one position to the left or all elements to the right of the new element must be moved one position to the right. If a value moves one node upwards during an insert (which happens frequently in a B-Tree), it leaves a hole which must be also be filled either by moving all elements from the left one position to the right or by moving all elements to the right one position to the left. These operations (in C usually performed by memmove) are in fact O(n). So the higher the order of the B-Tree, the faster the lookup but the slower the modification. On the other hand if you choose the order too low (e.g. 3), a B-Tree shows little advantages or disadvantages over other tree structures in practice (in such a case you can as well use something else). Thus I'd always create B-Trees with high orders (at least 4, 8 and up is fine).
File systems, which often base on B-Trees, use much higher orders (order 200 and even a lot more) - this is because they usually choose the order high enough so that a node (when containing maximum number of allowed elements) equals either the size of a sector on harddrive or of a cluster of the filesystem. This gives optimal performance (since a HD can only write a full sector at a time, even when just one byte is changed, the full sector is rewritten anyway) and optimal space utilization (as each data entry on drive equals at least the size of one cluster or is a multiple of the cluster sizes, no matter how big the data really is). Caused by the fact that the hardware sees data as sectors and the file system groups sectors to clusters, B-Trees can yield much better performance and space utilization for file systems than any other tree structure can; that's why they are so popular for file systems.
When your app is constantly updating the tree, adding or removing values from it, a RB-Tree or an AVL-Tree may show better performance on average compared to a B-Tree with high order. Somewhat worse for the lookups and they might also need more memory, but therefor modifications are usually fast. Actually RB-Trees are even faster for modifications than AVL-Trees, therefor AVL-Trees are a little bit faster for lookups as they are usually less deep.
So as usual it depends a lot what your app is doing. My recommendations are:
Lots of lookups, little modifications: B-Tree (with high order)
Lots of lookups, lots of modifiations: AVL-Tree
Little lookups, lots of modifications: RB-Tree
An alternative to all these trees are AA-Trees. As this PDF paper suggests, AA-Trees (which are in fact a sub-group of RB-Trees) are almost equal in performance to normal RB-Trees, but they are much easier to implement than RB-Trees, AVL-Trees, or B-Trees. Here is a full implementation, look how tiny it is (the main-function is not part of the implementation and half of the implementation lines are actually comments).
As the PDF paper shows, a Treap is also an interesting alternative to classic tree implementation. A Treap is also a binary tree, but one that doesn't try to enforce balancing. To avoid worst case scenarios that you may get in unbalanced binary trees (causing lookups to become O(n) instead of O(log n)), a Treap adds some randomness to the tree. Randomness cannot guarantee that the tree is well balanced, but it also makes it highly unlikely that the tree is extremely unbalanced.
Nothing prevents a B-Tree implementation that works only in memory. In fact, if key comparisons are cheap, in-memory B-Tree can be faster because its packing of multiple keys in one node will cause less cache misses during searches. See this link for performance comparisons. A quote: "The speed test results are interesting and show the B+ tree to be significantly faster for trees containing more than 16,000 items." (B+Tree is just a variation on B-Tree).
The question is old but I think it is still relevant. Jonas Kölker and Mecki gave very good answers but I don't think the answers cover the whole story. I would even argue that the whole discussion is missing the point :-).
What was said about B-Trees is true when entries are relatively small (integers, small strings/words, floats, etc). When entries are large (over 100B) the differences become smaller/insignificant.
Let me sum up the main points about B-Trees:
They are faster than any Binary Search Tree (BSTs) due to memory locality (resulting in less cache and TLB misses).
B-Trees are usually more space efficient if entries are relatively
small or if entries are of variable size. Free space management is
easier (you allocate larger chunks of memory) and the extra metadata
overhead per entry is lower. B-Trees will waste some space as nodes
are not always full, however, they still end up being more compact
that Binary Search Trees.
The big O performance ( O(logN) ) is the same for both. Moreover, if you do binary search inside each B-Tree node, you will even end up with the same number of comparisons as in a BST (it is a nice math exercise to verify this).
If the B-Tree node size is sensible (1-4x cache line size), linear searching inside each node is still faster because of
the hardware prefetching. You can also use SIMD instructions for
comparing basic data types (e.g. integers).
B-Trees are better suited for compression: there is more data per node to compress. In certain cases this can be a huge benefit.
Just think of an auto-incrementing key in a relational database table that is used to build an index. The lead nodes of a B-Tree contain consecutive integers that compress very, very well.
B-Trees are clearly much, much faster when stored on secondary storage (where you need to do block IO).
On paper, B-Trees have a lot of advantages and close to no disadvantages. So should one just use B-Trees for best performance?
The answer is usually NO -- if the tree fits in memory. In cases where performance is crucial you want a thread-safe tree-like data-structure (simply put, several threads can do more work than a single one). It is more problematic to make a B-Tree support concurrent accesses than to make a BST. The most straight-forward way to make a tree support concurrent accesses is to lock nodes as you are traversing/modifying them. In a B-Tree you lock more entries per node, resulting in more serialization points and more contended locks.
All tree versions (AVL, Red/Black, B-Tree, an others) have countless variants that differ in how they support concurrency. The vanilla algorithms that are taught in a university course or read from some introductory books are almost never used in practice. So, it is hard to say which tree performs best as there is no official agreement on the exact algorithms are behind each tree. I would suggest to think of the trees mentioned more like data-structure classes that obey certain tree-like invariants rather than precise data-structures.
Take for example the B-Tree. The vanilla B-Tree is almost never used in practice -- you cannot make it to scale well! The most common B-Tree variant used is the B+-Tree (widely used in file-systems, databases). The main differences between the B+-Tree and the B-Tree: 1) you dont store entries in the inner nodes of the tree (thus you don't need write locks high in the tree when modifying an entry stored in an inner node); 2) you have links between nodes at the same level (thus you do not have to lock the parent of a node when doing range searches).
I hope this helps.
Guys from Google recently released their implementation of STL containers, which is based on B-trees. They claim their version is faster and consume less memory compared to standard STL containers, implemented via red-black trees.
More details here
For some applications, B-trees are significantly faster than BSTs.
The trees you may find here:
http://freshmeat.net/projects/bps
are quite fast. They also use less memory than regular BST implementations, since they do not require the BST infrastructure of 2 or 3 pointers per node, plus some extra fields to keep the balancing information.
THey are sed in different circumstances - B-trees are used when the tree nodes need to be kept together in storage - typically because storage is a disk page and so re-balancing could be vey expensive. RB trees are used when you don't have this constraint. So B-trees will probably be faster if you want to implement (say) a relational database index, while RB trees will probably be fasterv for (say) an in memory search.
They all have the same asymptotic behavior, so the performance depends more on the implementation than which type of tree you are using.
Some combination of tree structures might actually be the fastest approach, where each node of a B-tree fits exactly into a cache-line and some sort of binary tree is used to search within each node. Managing the memory for the nodes yourself might also enable you to achieve even greater cache locality, but at a very high price.
Personally, I just use whatever is in the standard library for the language I am using, since it's a lot of work for a very small performance gain (if any).
On a theoretical note... RB-trees are actually very similar to B-trees, since they simulate the behavior of 2-3-4 trees. AA-trees are a similar structure, which simulates 2-3 trees instead.
moreover ...the height of a red black tree is O(log[2] N) whereas that of B-tree is O(log[q] N) where ceiling[N]<= q <= N . So if we consider comparisons in each key array of B-tree (that is fixed as mentioned above) then time complexity of B-tree <= time complexity of Red-black tree. (equal case for single record equal in size of a block size)