Is there any map data structure that allows fast merging? - data-structures

Are there any map data structures that have at least O(log n) insertion, deletion, access, and merging?
Most self-balancing binary trees such as AVL trees and red-black trees have most of these properties, but I believe they have O(n log n) merging. Are there any data structures that have faster merging?
Edit: I have looked around, and I can't find anything like this. If there is no such data structure, I would love some insight into why this is not possible.

I'd take a look at splay trees. You'll probably end up paying the merging cost along the way, but you should be able to inject another tree in and put the cost off until later.

Do you need a tree for arbitrary key types that only have a comparison defined, or would it be OK if it only works with types that have a fixed-size binary representation (int, long, float, double, ...)? If the latter is the case, then a binary radix tree is a data structure that has very efficient merging (O(1) if you are lucky, O(N) worst case).
See Fast Mergeable Integer Maps by Chris Okasaki and Andrew Gill for details of the data structure.
The Scala Collections Library contains an implementation for ints and longs. All other java primitive types can be translated to either ints or longs, e.g. by using java.lang.Double.doubleToLongBits for Double.

Related

Why don't we use AVL tree for hash table's item storage?

Recently, I was looking at the hash table which is using chaining as linked list. I came to the possibility of using the "chain" as AVL tree.
Therefore, each buckets in the hash table will have little AVL tree's root pointers. Wikipedia says hash table's worst case is O(n) (http://en.wikipedia.org/wiki/Hash_table). However, if we use each bucket's "chain" as AVL tree, we can bring it down to O(ln n).
Am I missing something?
As far as I know we can replace a linked list with an AVL tree.
Wouldn't such ADT be better than single AVL tree or hash table with linked list chaining?
I searched the internet and could not find such ADT.
This is discussed directly in the Wikipedia article your referenced:
Separate chaining with other structures
Instead of a list, one can use any other data structure that supports the required operations. For example, by using a self-balancing tree, the theoretical worst-case time of common hash table operations (insertion, deletion, lookup) can be brought down to O(log n) rather than O(n). However, this approach is only worth the trouble and extra memory cost if long delays must be avoided at all costs (e.g., in a real-time application), or if one must guard against many entries hashed to the same slot (e.g., if one expects extremely non-uniform distributions, or in the case of web sites or other publicly accessible services, which are vulnerable to malicious key distributions in requests).
In Java, standard HashMap use red-black trees in buckets, if buckets size exceeds constant 8; they are linearized back to the singly-linked list if bucket becomes smaller than 6 entries; apparently real world tests showed that for smaller buckets managing them as trees loses more due to general complexity of this data structure and extra memory footprint (because tree entries should hold at least 2 references to other entries, singly-linked entries hold only one reference), than gain from theoretically better asymptotic complexity.
I would also add, that for best performance hash table should be configured so that most buckets has only one entry (i. e. they are not even lists, just sole entries), marginally less should contain two entries and only exceptional buckets occasionally should have 3 or more entries. So holding 1-3 entries in tree makes absolutely no sense, compared to simple linked list.

When is an AVL tree better than a hash table?

More specifically, are there any operations that can be performed more efficiently if using an AVL tree rather than a hash table?
I generally prefer AVL trees to hash tables. I know that the expected-time O(1) complexity of hash tables beats the guaranteed-time O(log n) complexity of AVL trees, but in practice constant factors make the two data structures generally competitive, and there are no niggling worries about some unexpected data that evokes bad behavior. Also, I often find that sometime during the maintenance life of a program, in a situation not foreseen when the initial choice of a hash table seemed right, that I need the data in sorted order, so I end up rewriting the program to use an AVL tree instead of a hash table; do that enough times, and you learn that you may as well just start with AVL trees.
If your keys are strings, ternary search tries offer a reasonable alternative to AVL trees or hash tables.
An obvious difference, of course, is that with AVL trees (and other balanced trees), you can have persistency: you can insert/remove an element from the tree in O(log N) space-and-time and end up with not just the new tree, but also get to keep the old tree.
With a hash-table, you generally cannot do that in less than O(N) time-and-space.
Another important difference is the operations needed on the keys: AVL tress need a <= comparison between keys, whereas hash-tables need an = comparison as well as a hash function.

What class of problem would one use a binary search tree to solve?

I've seen this data structure talked about a lot, but I am unclear as to what sort of problem would demand such a data structure (over alternative representations). I've never needed one, but perhaps that's because I don't quite grok it. Can you enlighten me?
One example of where you would use a binary search tree would be a sorted list of values where you want to be able to quickly add elements.
Consider using an array for this purpose. You have very fast access to read random values, but if you want to add a new value, you have to find the place in the array where it belongs, shift everything over, and then insert the new value.
With a binary search tree, you simply traverse the tree looking for where the value would be if it were in the tree already, and then add it there.
Also, consider if you want to find out if your sorted array contains a particular value. You have to start at one end of the array and compare the value you're looking for to each individual value until you either find the value in the array, or pass the point where it would have been. With a binary search tree, you greatly reduce the number of comparisons you are likely to have to make. Just a quick caveat, however, it is definitely possible to contrive situations where the binary search tree requires more comparisons, but these are the exception, not the rule.
One thing I've used it for in the past is Huffman decoding (or any variable-bit-length scheme).
If you maintain your binary tree with the characters at the leaves, each incoming bit decides whether you move to the left or right node.
When you reach a leaf node, you have your decoded character and you can start on the next one.
For example, consider the following tree:
.
/ \
. C
/ \
A B
This would be a tree for a file where the predominant letter was C (by having less bits used for common letters, the file is shorter than it would be for a fixed-bit-length scheme). The codes for the individual letters are:
A: 00 (left, left).
B: 01 (left, right).
C: 1 (right).
The class of problems you use then for are those where you want to be able to both insert and access elements reasonably efficiently. As well as unbalanced trees (such as the Huffman example above), you can also use balanced trees which make the insertions a little more costly (since you may have to rebalance on the fly) but make lookups a lot more efficient since you're traversing the minimum possible number of nodes.
from wiki
Self-balancing binary search trees can be used in a natural way to construct and maintain ordered lists, such as priority queues. They can also be used for associative arrays; key-value pairs are simply inserted with an ordering based on the key alone. In this capacity, self-balancing BSTs have a number of advantages and disadvantages over their main competitor, hash tables. One advantage of self-balancing BSTs is that they allow fast (indeed, asymptotically optimal) enumeration of the items in key order, which hash tables do not provide. One disadvantage is that their lookup algorithms get more complicated when there may be multiple items with the same key.
Self-balancing BSTs can be used to implement any algorithm that requires mutable ordered lists, to achieve optimal worst-case asymptotic performance. For example, if binary tree sort is implemented with a self-balanced BST, we have a very simple-to-describe yet asymptotically optimal O(n log n) sorting algorithm. Similarly, many algorithms in computational geometry exploit variations on self-balancing BSTs to solve problems such as the line segment intersection problem and the point location problem efficiently. (For average-case performance, however, self-balanced BSTs may be less efficient than other solutions. Binary tree sort, in particular, is likely to be slower than mergesort or quicksort, because of the tree-balancing overhead as well as cache access patterns.)
Self-balancing BSTs are flexible data structures, in that it's easy to extend them to efficiently record additional information or perform new operations. For example, one can record the number of nodes in each subtree having a certain property, allowing one to count the number of nodes in a certain key range with that property in O(log n) time. These extensions can be used, for example, to optimize database queries or other list-processing algorithms.

What is the smartest data structure to use when doing several lookups and insertions, but no deletes?

I will never be deleting from this data structure, but will be doing a huge number of lookups and insertions (~a trillion lookups and insertions). What is the best data structure for handling this?
Red-black and AVL trees seem decent, but are there any better suited for this situation?
A hash table would seem to be ideal if you are only doing insertions and lookup by exact key.
Try Splay trees if you are doing insertions, and find/find-next on ordered keys.
I assume that most of your operations are going to be lookups, or you're going to need one heap of a lot of memory.
I would choose a red-black tree or a hash table.
Operations on a red-black is O(log2(n)).
If implementet right the hash can have a O(1 + k/n). If implementet wrong it can be as bad as o(k). If what you are trying to do is just to make it as fast as possible, I would go with hash and do the extra work. Otherwise I would go with red-black. It is fairly simple and you know your running time.
If all of the queries are successful (i.e., to elements that are actually stored in the table), then hashing is probably best, and you could experiment with various types of collision resolution, such as cuckoo hashing, which provides worst-case performance guarantees on lookups (see http://en.wikipedia.org/wiki/Hash_table).
If some queries are in between the stored keys, I would use van Emde Boas trees, y-fast trees, or fusion trees, which offer better performance than binary search trees (see http://courses.csail.mit.edu/6.851/spring10/scribe/lec09.pdf and http://courses.csail.mit.edu/6.851/spring10/scribe/lec10.pdf).

An Efficient data structure for Sorted List

I want to save my objects according to a key in the attributes of my object in a sorted fashion. Later on I'll access these objects sequentially from max key to min key. I'll do some search tasks as well.
I consider to use either AVL tree or RB Tree. As far as i know they are nearly equivalent in theory(Both have O(logn)). But in practice which one might be better in performance in my situation. And is there a better alternative than those, considering that I'll mostly do insert and sequentially access to the ds.
Edit: I'm going to use java
For what it's worth, in C#, SortedDictionary<K, V> is implemented as a red-black tree, and in many implementations of the STL in C++, std::map<K, T> is implemented as a red-black tree.
Also, from Wikipedia on AVL vs. red-black trees:
The AVL tree is another structure
supporting O(log n) search, insertion,
and removal. It is more rigidly
balanced than red-black trees, leading
to slower insertion and removal but
faster retrieval. This makes it
attractive for data structures that
may be built once and loaded without
reconstruction, such as language
dictionaries (or program dictionaries,
such as the order codes of an
assembler or interpreter).
Which ever is easiest for you to implement, you won't get better insertion than log(n) with a sorted list and we'd probably need a lot more detail than what you've provided to decide if there are other factors that make another structure more appropriate.
As you're doing it in Java, consider using a TreeSet (although it's a Set, so you can't have duplicate entries)...

Resources