More specifically, are there any operations that can be performed more efficiently if using an AVL tree rather than a hash table?
I generally prefer AVL trees to hash tables. I know that the expected-time O(1) complexity of hash tables beats the guaranteed-time O(log n) complexity of AVL trees, but in practice constant factors make the two data structures generally competitive, and there are no niggling worries about some unexpected data that evokes bad behavior. Also, I often find that sometime during the maintenance life of a program, in a situation not foreseen when the initial choice of a hash table seemed right, that I need the data in sorted order, so I end up rewriting the program to use an AVL tree instead of a hash table; do that enough times, and you learn that you may as well just start with AVL trees.
If your keys are strings, ternary search tries offer a reasonable alternative to AVL trees or hash tables.
An obvious difference, of course, is that with AVL trees (and other balanced trees), you can have persistency: you can insert/remove an element from the tree in O(log N) space-and-time and end up with not just the new tree, but also get to keep the old tree.
With a hash-table, you generally cannot do that in less than O(N) time-and-space.
Another important difference is the operations needed on the keys: AVL tress need a <= comparison between keys, whereas hash-tables need an = comparison as well as a hash function.
Related
I need an efficient data structure for storing, I need to insert and maybe order. However, keeping the order after every insert is not necessary and I think that sortings are much less than inserts.
I was considering red-black trees but I'm not sure how fast it is inserting a node in an RB tree (compared to inserting it in a list for example); however, sorting in an RB tree is much more time-efficient.
What data structure is the most efficient for that?
Thanks for your time
I'm not sure how fast it is inserting a node in an RB tree (compared to inserting it in a list for example
Insertion has this average time complexity:
RB tree (also worst case): O(logn)
Sorted list: O(n)
Unsorted list: O(1)
Traversing all values in sorted order:
RB tree: O(n)
Sorted list: O(n)
Unsorted list: O(nlogn)
So for asymptotically increasing data sizes, insertion will eventually run faster on an RB than on a sorted list, although for small sizes the list can be faster (as it has less constant overhead). The actual tipping point will depend on implementation aspects, including the programming language and the structure of the values to compare. But insertion into a non sorted list will of course outperform both.
Sorting a list on demand as is needed for an unsorted list, has a cost, but it is "only" O(nlogn) compared to O(n). So if sorting doesn't have to happen that frequently, it may be a viable option. Again, the tipping point -- as where the overall running time of several inserts and sorts is faster than the alternatives -- depends on implementation aspects.
What data structure is the most efficient for that?
In practice I have found B+ trees to be a good choice for fast insertion. Just like RB trees they have O(logn) insertion time, but one can tune the data structure with varying block sizes, trying to find out which one works best for your actual case. This is not possible with RB trees. Also B+ trees have the sorted list sitting in a linked list of sorted blocks, so iteration in sorted order is trivial. Nothing much is going to beat the speed of that.
Another interesting alternative is a skip list. It resembles a B+ tree a bit, but its operations are easier to implement. It uses a factor more memory (same complexity) by the absence of blocks and more pointers.
Which one will work the best depends on implementation/platform factors. In the end you'll want to implement some alternatives and compare them with benchmark tests.
I've seen this data structure talked about a lot, but I am unclear as to what sort of problem would demand such a data structure (over alternative representations). I've never needed one, but perhaps that's because I don't quite grok it. Can you enlighten me?
One example of where you would use a binary search tree would be a sorted list of values where you want to be able to quickly add elements.
Consider using an array for this purpose. You have very fast access to read random values, but if you want to add a new value, you have to find the place in the array where it belongs, shift everything over, and then insert the new value.
With a binary search tree, you simply traverse the tree looking for where the value would be if it were in the tree already, and then add it there.
Also, consider if you want to find out if your sorted array contains a particular value. You have to start at one end of the array and compare the value you're looking for to each individual value until you either find the value in the array, or pass the point where it would have been. With a binary search tree, you greatly reduce the number of comparisons you are likely to have to make. Just a quick caveat, however, it is definitely possible to contrive situations where the binary search tree requires more comparisons, but these are the exception, not the rule.
One thing I've used it for in the past is Huffman decoding (or any variable-bit-length scheme).
If you maintain your binary tree with the characters at the leaves, each incoming bit decides whether you move to the left or right node.
When you reach a leaf node, you have your decoded character and you can start on the next one.
For example, consider the following tree:
.
/ \
. C
/ \
A B
This would be a tree for a file where the predominant letter was C (by having less bits used for common letters, the file is shorter than it would be for a fixed-bit-length scheme). The codes for the individual letters are:
A: 00 (left, left).
B: 01 (left, right).
C: 1 (right).
The class of problems you use then for are those where you want to be able to both insert and access elements reasonably efficiently. As well as unbalanced trees (such as the Huffman example above), you can also use balanced trees which make the insertions a little more costly (since you may have to rebalance on the fly) but make lookups a lot more efficient since you're traversing the minimum possible number of nodes.
from wiki
Self-balancing binary search trees can be used in a natural way to construct and maintain ordered lists, such as priority queues. They can also be used for associative arrays; key-value pairs are simply inserted with an ordering based on the key alone. In this capacity, self-balancing BSTs have a number of advantages and disadvantages over their main competitor, hash tables. One advantage of self-balancing BSTs is that they allow fast (indeed, asymptotically optimal) enumeration of the items in key order, which hash tables do not provide. One disadvantage is that their lookup algorithms get more complicated when there may be multiple items with the same key.
Self-balancing BSTs can be used to implement any algorithm that requires mutable ordered lists, to achieve optimal worst-case asymptotic performance. For example, if binary tree sort is implemented with a self-balanced BST, we have a very simple-to-describe yet asymptotically optimal O(n log n) sorting algorithm. Similarly, many algorithms in computational geometry exploit variations on self-balancing BSTs to solve problems such as the line segment intersection problem and the point location problem efficiently. (For average-case performance, however, self-balanced BSTs may be less efficient than other solutions. Binary tree sort, in particular, is likely to be slower than mergesort or quicksort, because of the tree-balancing overhead as well as cache access patterns.)
Self-balancing BSTs are flexible data structures, in that it's easy to extend them to efficiently record additional information or perform new operations. For example, one can record the number of nodes in each subtree having a certain property, allowing one to count the number of nodes in a certain key range with that property in O(log n) time. These extensions can be used, for example, to optimize database queries or other list-processing algorithms.
How are collisions handled in associative arrays implemented using self-balanced tree? If two objects have same hash are they stored in a linked list attached to a tree node or two nodes are created? In it's the former, then how it is O(log n) and if latter, how binary search tree can handle same keys (hashes)?
Search trees can definitely not handle two nodes with the same key, so you do need to store the entries with colliding keys in a separate data structure (typically, as you say, a linked list attached to a tree node). You will indeed not have a worst-case complexity of O(log n), just as an associative array implemented as a hash table will not have worst-case O(1) operations.
As epitaph notes, one thing to try is increasing the length of your hash keys, so as to not get collisions. You can't guarantee that you won't, though, and do need to make some sort of provision for two objects with the same hash. If you choose your hashing algorithm properly, though, this should be a rare case, and your average time complexity for lookups will be O(log n), even though it can degrade to O(n) in the degenerate case of everything having the same hash key.
I will never be deleting from this data structure, but will be doing a huge number of lookups and insertions (~a trillion lookups and insertions). What is the best data structure for handling this?
Red-black and AVL trees seem decent, but are there any better suited for this situation?
A hash table would seem to be ideal if you are only doing insertions and lookup by exact key.
Try Splay trees if you are doing insertions, and find/find-next on ordered keys.
I assume that most of your operations are going to be lookups, or you're going to need one heap of a lot of memory.
I would choose a red-black tree or a hash table.
Operations on a red-black is O(log2(n)).
If implementet right the hash can have a O(1 + k/n). If implementet wrong it can be as bad as o(k). If what you are trying to do is just to make it as fast as possible, I would go with hash and do the extra work. Otherwise I would go with red-black. It is fairly simple and you know your running time.
If all of the queries are successful (i.e., to elements that are actually stored in the table), then hashing is probably best, and you could experiment with various types of collision resolution, such as cuckoo hashing, which provides worst-case performance guarantees on lookups (see http://en.wikipedia.org/wiki/Hash_table).
If some queries are in between the stored keys, I would use van Emde Boas trees, y-fast trees, or fusion trees, which offer better performance than binary search trees (see http://courses.csail.mit.edu/6.851/spring10/scribe/lec09.pdf and http://courses.csail.mit.edu/6.851/spring10/scribe/lec10.pdf).
I've been able to find details on several self-balancing BSTs through several sources, but I haven't found any good descriptions detailing which one is best to use in different situations (or if it really doesn't matter).
I want a BST that is optimal for storing in excess of ten million nodes. The order of insertion of the nodes is basically random, and I will never need to delete nodes, so insertion time is the only thing that would need to be optimized.
I intend to use it to store previously visited game states in a puzzle game, so that I can quickly check if a previous configuration has already been encountered.
Red-black is better than AVL for insertion-heavy applications. If you foresee relatively uniform look-up, then Red-black is the way to go. If you foresee a relatively unbalanced look-up where more recently viewed elements are more likely to be viewed again, you want to use splay trees.
Why use a BST at all? From your description a dictionary will work just as well, if not better.
The only reason for using a BST would be if you wanted to list out the contents of the container in key order. It certainly doesn't sound like you want to do that, in which case go for the hash table. O(1) insertion and search, no worries about deletion, what could be better?
The two self-balancing BSTs I'm most familiar with are red-black and AVL, so I can't say for certain if any other solutions are better, but as I recall, red-black has faster insertion and slower retrieval compared to AVL.
So if insertion is a higher priority than retrieval, red-black may be a better solution.
[hash tables have] O(1) insertion and search
I think this is wrong.
First of all, if you limit the keyspace to be finite, you could store the elements in an array and do an O(1) linear scan. Or you could shufflesort the array and then do a linear scan in O(1) expected time. When stuff is finite, stuff is easily O(1).
So let's say your hash table will store any arbitrary bit string; it doesn't much matter, as long as there's an infinite set of keys, each of which are finite. Then you have to read all the bits of any query and insertion input, else I insert y0 in an empty hash and query on y1, where y0 and y1 differ at a single bit position which you don't look at.
But let's say the key lengths are not a parameter. If your insertion and search take O(1), in particular hashing takes O(1) time, which means that you only look at a finite amount of output from the hash function (from which there's likely to be only a finite output, granted).
This means that with finitely many buckets, there must be an infinite set of strings which all have the same hash value. Suppose I insert a lot, i.e. ω(1), of those, and start querying. This means that your hash table has to fall back on some other O(1) insertion/search mechanism to answer my queries. Which one, and why not just use that directly?