I'm watching university lectures on algorithms and it seems so many of them rely almost entirely binary search trees of some particular sort for querying/database/search tasks.
I don't understand this obsession with Binary Search Trees. It seems like in the vast majority of scenarios, a BSP could be replaced with a sorted array in the case of a static data, or a sorted bucketed list if insertions occur dynamically, and then a Binary Search could be employed over them.
With this approach, you get the same algorithmic complexity (for querying at least) as a BST, way better cache coherency, way less memory fragmentation (and less gc allocs depending on what language you're in), and are likely much simpler to write.
The fundamental issue is that BSP are completely memory naïve -- their focus is entirely on O(n) complexity and they ignore the very real performance considerations of memory fragmentation and cache coherency... Am I missing something?
Binary search trees (BST) are not totally equivalent to the proposed data structure. Their asymptotic complexity is better when it comes to both insert and remove sorted values dynamically (assuming they are balanced correctly). For example, when you when to build an index of the top-k values dynamically:
while end_of_stream(stream):
value <- stream.pop_value()
tree.insert(value)
tree.remove_max()
Sorted arrays are not efficient in this case because of the linear-time insertion. The complexity of bucketed lists is not better than plain list asymptotically and also suffer from a linear-time search. One can note that a heap can be used in this case, and in fact it is probably better to use a heap here, although they are not always interchangeable.
That being said, your are right : BST are slow, cause a lot of cache miss and fragmentation, etc. Thus, they are often replaced by more compact variants like B-trees. B-tree uses a sorted array index to reduce the amount of node jumps and make the data-structure much more compact. They can be mixed with some 4-byte pointer optimizations to make them even more compact. B-trees are to BST what bucketed linked-lists are to plain linked-lists. B-trees are very good for building dynamic database index of huge datasets stored on a slow storage device (because of the size): they enable applications to fetch values associated to a key using very few storage-device lookups (which as very slow on HDD for example). Another example of real-world use-case is interval-trees.
Note that memory fragmentation can be reduced using compaction methods. For BSTs/B-trees, one can reorder the root nodes like in a heap. However, compaction is not always easy to apply, especially on native languages with pointers like in C/C++ although some very clever methods exists to do so.
Keep in mind that B-trees shine only on big datasets (especially the ones that do not fit in cache). On relatively small ones, using just plain arrays or even sorted array is often a very good solution.
I know that the best-fit algorithm has to search through the whole list in order to find the best memory block which takes O(n), so I'm thinking about using a red-black tree to improve the runtime to O(logN). Will there be any cases that the red-black tree won't work for best-fit? If so, could someone provide me an example ? Thank you.
A red-black tree works fine, but it's not a really good choice.
Since it takes O(size) time to make good use of O(size) bytes, it's better to use a finger search tree of some kind that makes finding small blocks faster at the expense of taking more time to find large blocks: https://en.wikipedia.org/wiki/Finger_search_tree
Also, you probably just want to have separate free lists for some of the smallest possible sizes. You can expect to hit several cache lines when you search in and update a search tree. Each cache line is 64 bytes on x86, so just one cache line is enough to store the head pointers for 8 free lists. If you put the head pointers for free lists for the 8 smallest sizes in one cache line, then allocating blocks of those sizes will be much faster.
This also lets you allocate blocks that are smaller than a red-black tree node.
If you are to build a red-black tree to do what you are describing, there are better options.
Given n elements, the red-black tree construction time takes O(n log n). It's the same time from just sorting the array and using binary search, which is way simpler and very performant.
Obviously the tree would be better if elements were constantly added to the array.
I've been working on an implementation of a radix tree (for strings/char arrays), but I'm having somewhat of a dilemma figuring out how to store what tree nodes are children of a particular tree nodes.
I've seen linked list implementations used in Trie's (somewhat similar to radix trees) and possibly some radix trees (it's been a while since I've last researched this topic), but that seems like it'd perform very poorly especially if you have a set of data which contains lots of common prefixes.
Now I'm wondering would using another data structure (e.g. a Binary Search Tree) be a better design choice? I think I can see a very substantial speed improvement over a simple linked list (O(log(n)) vs. O(n)) when there is data with a large number of common prefixes, but could there be some substantial compromises to performance elsewhere?
In particular I'm worried about cases where there aren't a large number of common prefixes, or any other possible obstacles which may cause one to choose a linked list over a binary search tree.
Alternatively, is there a better (i.e. faster/uses less memory) method for storing the children nodes?
You want to look for a kart-trie. A kart-trie uses a BST like data structure with a simple hash. You can find a description here: http://code.dogmap.org/kart.
You could use a trie in place of a BST or list. For BST you'll have to compute a hash which could be as expensive as traversing the trie (I'm thinking of a trie with an array of pointers to children, where you use a character as an index). You'll end up with a trie of tries. A better solution could be to build a trie, and then compress the links that aren't branching.
I want to save my objects according to a key in the attributes of my object in a sorted fashion. Later on I'll access these objects sequentially from max key to min key. I'll do some search tasks as well.
I consider to use either AVL tree or RB Tree. As far as i know they are nearly equivalent in theory(Both have O(logn)). But in practice which one might be better in performance in my situation. And is there a better alternative than those, considering that I'll mostly do insert and sequentially access to the ds.
Edit: I'm going to use java
For what it's worth, in C#, SortedDictionary<K, V> is implemented as a red-black tree, and in many implementations of the STL in C++, std::map<K, T> is implemented as a red-black tree.
Also, from Wikipedia on AVL vs. red-black trees:
The AVL tree is another structure
supporting O(log n) search, insertion,
and removal. It is more rigidly
balanced than red-black trees, leading
to slower insertion and removal but
faster retrieval. This makes it
attractive for data structures that
may be built once and loaded without
reconstruction, such as language
dictionaries (or program dictionaries,
such as the order codes of an
assembler or interpreter).
Which ever is easiest for you to implement, you won't get better insertion than log(n) with a sorted list and we'd probably need a lot more detail than what you've provided to decide if there are other factors that make another structure more appropriate.
As you're doing it in Java, consider using a TreeSet (although it's a Set, so you can't have duplicate entries)...
I'm building a symbol table for a project I'm working on. I was wondering what peoples opinions are on the advantages and disadvantages of the various methods available for storing and creating a symbol table.
I've done a fair bit of searching and the most commonly recommended are binary trees or linked lists or hash tables. What are the advantages and or disadvantages of all of the above? (working in c++)
The standard trade offs between these data structures apply.
Binary Trees
medium complexity to implement (assuming you can't get them from a library)
inserts are O(logN)
lookups are O(logN)
Linked lists (unsorted)
low complexity to implement
inserts are O(1)
lookups are O(N)
Hash tables
high complexity to implement
inserts are O(1) on average
lookups are O(1) on average
Your use case is presumably going to be "insert the data once (e.g., application startup) and then perform lots of reads but few if any extra insertions".
Therefore you need to use an algorithm that is fast for looking up the information that you need.
I'd therefore think the HashTable was the most suitable algorithm to use, as it is simply generating a hash of your key object and using that to access the target data - it is O(1). The others are O(N) (Linked Lists of size N - you have to iterate through the list one at a time, an average of N/2 times) and O(log N) (Binary Tree - you halve the search space with each iteration - only if the tree is balanced, so this depends on your implementation, an unbalanced tree can have significantly worse performance).
Just make sure that there are enough spaces (buckets) in the HashTable for your data (R.e., Soraz's comment on this post). Most framework implementations (Java, .NET, etc) will be of a quality that you won't need to worry about the implementations.
Did you do a course on data structures and algorithms at university?
What everybody seems to forget is that for small Ns, IE few symbols in your table, the linked list can be much faster than the hash-table, although in theory its asymptotic complexity is indeed higher.
There is a famous qoute from Pike's Notes on Programming in C: "Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy." http://www.lysator.liu.se/c/pikestyle.html
I can't tell from your post if you will be dealing with a small N or not, but always remember that the best algorithm for large N's are not necessarily good for small Ns.
It sounds like the following may all be true:
Your keys are strings.
Inserts are done once.
Lookups are done frequently.
The number of key-value pairs is relatively small (say, fewer than a K or so).
If so, you might consider a sorted list over any of these other structures. This would perform worse than the others during inserts, as a sorted list is O(N) on insert, versus O(1) for a linked list or hash table, and O(log2N) for a balanced binary tree. But lookups in a sorted list may be faster than any of these others structures (I'll explain this shortly), so you may come out on top. Also, if you perform all your inserts at once (or otherwise don't require lookups until all insertions are complete), then you can simplify insertions to O(1) and do one much quicker sort at the end. What's more, a sorted list uses less memory than any of these other structures, but the only way this is likely to matter is if you have many small lists. If you have one or a few large lists, then a hash table is likely to out-perform a sorted list.
Why might lookups be faster with a sorted list? Well, it's clear that it's faster than a linked list, with the latter's O(N) lookup time. With a binary tree, lookups only remain O(log2 N) if the tree remains perfectly balanced. Keeping the tree balanced (red-black, for instance) adds to the complexity and insertion time. Additionally, with both linked lists and binary trees, each element is a separately-allocated1 node, which means you'll have to dereference pointers and likely jump to potentially widely varying memory addresses, increasing the chances of a cache miss.
As for hash tables, you should probably read a couple of other questions here on StackOverflow, but the main points of interest here are:
A hash table can degenerate to O(N) in the worst case.
The cost of hashing is non-zero, and in some implementations it can be significant, particularly in the case of strings.
As in linked lists and binary trees, each entry is a node storing more than just key and value, also separately-allocated in some implementations, so you use more memory and increase chances of a cache miss.
Of course, if you really care about how any of these data structures will perform, you should test them. You should have little problem finding good implementations of any of these for most common languages. It shouldn't be too difficult to throw some of your real data at each of these data structures and see which performs best.
It's possible for an implementation to pre-allocate an array of nodes, which would help with the cache-miss problem. I've not seen this in any real implementation of linked lists or binary trees (not that I've seen every one, of course), although you could certainly roll your own. You'd still have a slightly higher possibility of a cache miss, though, since the node objects would be necessarily larger than the key/value pairs.
I like Bill's answer, but it doesn't really synthesize things.
From the three choices:
Linked lists are relatively slow to lookup items from (O(n)). So if you have a lot of items in your table, or you are going to be doing a lot of lookups, then they are not the best choice. However, they are easy to build, and easy to write too. If the table is small, and/or you only ever do one small scan through it after it is built, then this might be the choice for you.
Hash tables can be blazingly fast. However, for it to work you have to pick a good hash for your input, and you have to pick a table big enough to hold everything without a lot of hash collisions. What that means is you have to know something about the size and quantity of your input. If you mess this up, you end up with a really expensive and complex set of linked lists. I'd say that unless you know ahead of time roughly how large the table is going to be, don't use a hash table. This disagrees with your "accepted" answer. Sorry.
That leaves trees. You have an option here though: To balance or not to balance. What I've found by studying this problem on C and Fortran code we have here is that the symbol table input tends to be sufficiently random that you only lose about a tree level or two by not balancing the tree. Given that balanced trees are slower to insert elements into and are harder to implement, I wouldn't bother with them. However, if you already have access to nice debugged component libraries (eg: C++'s STL), then you might as well go ahead and use the balanced tree.
A couple of things to watch out for.
Binary trees only have O(log n) lookup and insert complexity if the tree is balanced. If your symbols are inserted in a pretty random fashion, this shouldn't be a problem. If they're inserted in order, you'll be building a linked list. (For your specific application they shouldn't be in any kind of order, so you should be okay.) If there's a chance that the symbols will be too orderly, a Red-Black Tree is a better option.
Hash tables give O(1) average insert and lookup complexity, but there's a caveat here, too. If your hash function is bad (and I mean really bad) you could end up building a linked list here as well. Any reasonable string hash function should do, though, so this warning is really only to make sure you're aware that it could happen. You should be able to just test that your hash function doesn't have many collisions over your expected range of inputs, and you'll be fine. One other minor drawback is if you're using a fixed-size hash table. Most hash table implementations grow when they reach a certain size (load factor to be more precise, see here for details). This is to avoid the problem you get when you're inserting a million symbols into ten buckets. That just leads to ten linked lists with an average size of 100,000.
I would only use a linked list if I had a really short symbol table. It's easiest to implement, but the best case performance for a linked list is the worst case performance for your other two options.
Other comments have focused on adding/retrieving elements, but this discussion isn't complete without considering what it takes to iterate over the entire collection. The short answer here is that hash tables require less memory to iterate over, but trees require less time.
For a hash table, the memory overhead of iterating over the (key, value) pairs does not depend on the capacity of the table or the number of elements stored in the table; in fact, iterating should require just a single index variable or two.
For trees, the amount of memory required always depends on the size of the tree. You can either maintain a queue of unvisited nodes while iterating or add additional pointers to the tree for easier iteration (making the tree, for purposes of iteration, act like a linked list), but either way, you have to allocate extra memory for iteration.
But the situation is reversed when it comes to timing. For a hash table, the time it takes to iterate depends on the capacity of the table, not the number of stored elements. So a table loaded at 10% of capacity will take about 10 times longer to iterate over than a linked list with the same elements!
This depends on several things, of course. I'd say that a linked list is right out, since it has few suitable properties to work as a symbol table. A binary tree might work, if you already have one and don't have to spend time writing and debugging it. My choice would be a hash table, I think that is more or less the default for this purpose.
This question goes through the different containers in C#, but they are similar in any language you use.
Unless you expect your symbol table to be small, I should steer clear of linked lists. A list of 1000 items will on average take 500 iterations to find any item within it.
A binary tree can be much faster, so long as it's balanced. If you're persisting the contents, the serialised form will likely be sorted, and when it's re-loaded, the resulting tree will be wholly un-balanced as a consequence, and it'll behave the same as the linked list - because that's basically what it has become. Balanced tree algorithms solve this matter, but make the whole shebang more complex.
A hashmap (so long as you pick a suitable hashing algorithm) looks like the best solution. You've not mentioned your environment, but just about all modern languages have a Hashmap built in.