Best hash table for insertion and lookup only

Best hash table for insertion and lookup only - data-structures

I want to make a hash table that allows insertion and lookup only. Once something is in the table, it is in for good, unless you make a new hash table and refill the contents. Is there any algorithms/data structures that are more suited for this (over say B-tree/RB-tree/LLRB-tree)? Better would be like - faster insertion and lookup times, or can be sharded easier, or smaller overhead. Thanks

If you know (roughly) how many times each item will be looked up (and these frequencies are not uniform) then you could use Knuth's algorithm (free pdf) for finding an optimal tree that will make the more frequently-accessed items closer to the top (and so are faster to access). Each node in this tree would be a hash code (used for navigating the tree) and a pointer to the actual item. I don't know of any implementations of this algorithm, though...

Related

Best algorithm for sequential access of nodes

I want to know the best algorithm where I can create a "sorted" list based on the key (ranging from 0 to 2 power 32) and traverse them in sorted order when needed in an embedded device. I am aware of possible options namely
sorted linklist
As number of nodes in the linked list increases searching for the right node in the list for insertion/update operations takes more time O(n)
Hash
Might be the current best choice until and unless we do not have collisions with the hashing logic
Table of size 2 power 32
Wastage of space
Is there any other best alternative which is suited to be used in an embedded device ?

There are many design choices to be weighed.
Generalities
Since you're working on an embedded device, it's reasonable to assume that you have limited memory. In such a situation, you'll probably want to choose memory-compact data structures over performant data structures.
Linked lists tend to scatter their contents across memory in a way which can make accesses slow, though this will depend somewhat on your architecture.
The Options You've Proposed
Sorted linked-list. This structure is already slow to access (O(n)), slow to construct (O(N²)), and slow to traverse (because a linked-list scatters memory, which reduces your ability to pre-fetch).
Hash table: This is a fast structure (O(1) access, O(N) construction). There are two problems, though. If you use open addressing, the table must be no more than about 70% full or performance will degrade. That means you'll be wasting some memory. Alternatively, you can use linked list buckets, but this has performance implications for traversal. I have an answer here which shows order-of-magnitude differences in traversal performance between a linked-list bucket design and open addressing for a hash table. More problematically, hash tables work by "randomly" distributing data across memory space. Getting an in-order traversal out of that will require an additional data structure of some sort.
Table of size 2 power 32. There's significant wastage of space for this solution. But also poor performance since, I expect, most of the entries of this table will be empty, but they must all be traversed.
An Alternative
Sort before use
If you do not need your list to always be sorted, I'd suggest adding new entries to an array and then sorting just prior to traversal. This gives you tight control over your memory layout, which is contiguous, so you'll get good memory performance. Insertion is quick: just throw your new data at the beginning or end of the array. Traversal, when it happens, will be fast because you just walk along the array. The only potentially slow bit is the sort.
You have several options for sort. You'll want to keep in mind that your array is either mostly sorted (only a few insertions between traversals) or mostly unsorted (many insertions between traversals). In the mostly-sorted case, insertion sort is a good choice. In the mostly-unsorted case, [quicksort](https://en.wikipedia.org/wiki/Quicksort] is solid. Both have the benefit of being in-place, which reduces memory consumption. Timsort balances these strategies.

Why implement a Hashtable with a Binary Search Tree?

When implementing a Hashtable using an array, we inherit the constant time indexing of the array. What are the reasons for implementing a Hashtable with a Binary Search Tree since it offers search with O(logn)? Why not just use a Binary Search Tree directly?

If the elements don't have a total order (i.e. the "greater than" and "less than" is not be defined for all pairs or it is not consistent between elements), you can't compare all pairs, thus you can't use a BST directly, but nothing's stopping you from indexing the BST by the hash value - since this is an integral value, it obviously has a total order (although you'd still need to resolve collision, that is have a way to handle elements with the same hash value).
However, one of the biggest advantages of a BST over a hash table is the fact that the elements are in order - if we order it by hash value, the elements will have an arbitrary order instead, and this advantage would no longer be applicable.
As for why one might consider implementing a hash table using a BST instead of an array, it would:
Not have the disadvantage of needing to resize the array - with an array, you typically mod the hash value with the array size and resize the array if it gets full, reinserting all elements, but with a BST, you can just directly insert the unchanging hash value into the BST.
This might be relevant if we want any individual operation to never take more than a certain amount of time (which could very well happen if we need to resize the array), with the overall performance being secondary, but there might be better ways to solve this problem.
Have a reduced risk of hash collisions since you don't mod with the array size and thus the number of possible hashes could be significantly bigger. This would reduce the risk of getting the worst-case performance of a hash table (which is when a significant portion of the elements hash to the same value).
What the actual worst-case performance is would depend on how you're resolving collisions. This is typically done with linked-lists for O(n) worst case performance. But we can also achieve O(log n) performance with BST's (as is done in Java's hash table implementation if the number of elements with some hash are above a threshold) - that is, have your hash table array where each element points to a BST where all elements have the same hash value.
Possibly use less memory - with an array you'd inevitably have some empty indices, but with a BST, these simply won't need to exist. Although this is not a clear-cut advantage, if it's an advantage at all.
If we assume we use the less common array-based BST implementation, this array will also have some empty indices and this would also require the occasional resizing, but this is a simply memory copy as opposed to needing to reinsert all elements with updated hashes.
If we use the typical pointer-based BST implementation, the added cost for the pointers would seemingly outweigh the cost of having a few empty indices in an array (unless the array is particularly sparse, which tends to be a bad sign for a hash table anyway).
But, since I haven't personally ever heard of this ever being done, presumably the benefits are not worth the increased cost of operations from expected O(1) to O(log n).
Typically the choice is indeed between using a BST directly (without hash values) and using a hash table (with an array).

Pros:
Potentially use less space b/c we don't allocate a large array
Can iterate through the keys in order, sometimes useful
Cons:
You'd have O(log N) lookup time, which is worse than the guaranteed O(1) for a chained hash table.

Since the requirements of a Hash Table are O(1) lookup, it's not a Hash Table if it has logarithmic lookup times. Granted, since collision is an issue with the array implementation (well, not likely an issue), using a BST could offer benefits in that regard. Generally, though, it's not worth the tradeoff - I can't think of a situation where you wouldn't want guaranteed O(1) lookup time when using a Hash Table.
Alternatively, there is the possibility of an underlying structure to guarantee logarithmic insertion and deletion via a BST variant, where each index in the array has a reference to the corresponding node in the BST. A structure like that could get sort of complex, but would guarantee O(1) lookup and O(logn) insertion/deletion.

I found this looking to see if anyone had done it. I guess maybe not.
I came up with an idea this morning of implementing a binary tree as an array consisting of rows stored by index. Row 1 has 1, row 2 has 2, row 3 has 4 (yes, powers of two). The advantage of this structure is a bit shift and addition or subtraction can be used to walk the tree instead of using extra memory to store bi- or uni-directional references.
This would allow you to rapidly search for a hash value based on some sort of hashable input, to discover if the value exists in some other store. Or for a hash collision (or partial collision) search. I can't think of many other uses for it but for these it would be phenomenally fast. Very likely a lot of the rotation operations would happen entirely in cpu cache and be written out in nice linear blobs to main memory.
Its main utility would be with sorting input values of a random nature. If the blobs in the array were two parts, like a hash, and an identifier for another store, you could do the comparisons very fast and insert very fast to discover where an item bearing a hash value is kept in another location (like the UUID of a filesystem node or maybe even the filename, or other short identifiable string).
I'll leave it to others to dream of other ways to use it but I'm using it for a graph theoretic proof of work search table for identifying partial collisions for a variant of Cuckoo Cycle.
I am just now working on the walk formula, and here it is:
i = index of array element
Walk Up (go to parent):
i>>1-(i+1)%2
(Obviously you probably need to test if i is zero)
Walk Left (down and left):
i<<1+2
(this and the next would also need to test against 2^depth of the structure, so it doesn't walk off the edge and fall back to the root)
Walk Right (down and right):
i<<1+1
As you can see, each walk is a short formula based on the index. A bit shift and addition for going left and right, and a bit shift, addition and modulus for ascending. Two instructions to move down, 4 to move up (in assembler, or as above in C and other HLL operator notation)
edit:
I can see from further commentary that the benefit of slashing the insert time definitely would be of benefit. But I don't think that a conventional vector based binary tree would provide nearly as much benefit as a dense version. A dense version, where all the nodes are in a contiguous array, when it is searched, naturally will travel in a linear fashion through the memory, which should help reduce cache misses and thus reduce the latency of the searches significantly, as well as the fact that there is a latency hit with memory in accessing randomly compared to streaming through blocks sequentially.
https://github.com/calibrae-project/bast/blob/master/pkg/bast/bast.go
This is my current state of a WiP to implement what I am calling a Bifurcation Array Search Tree. For the purpose of a fast insert/delete and not horribly slow search through a sorted collection of hashes, I think that this would be of quite large benefit for cases where there is a lot of data coming and going through the structure, or more to the point, beneficial for more realtime applications.

Perfect List Structure?

Is it theoretically possible to have a data-structure that has
O(1) access, insertion, deletion times
and dynamic length?
I'm guessing one hasn't yet been invented or we would entirely forego the use of arrays and linked lists (seperately) and instead opt to use one of these.
Is there a proof this cannot happen, and therefore some relationship between access-time, insertion-time and deletion-time (like conservation of energy) that suggests if one of the times becomes constant the other has to be linear or something along that.

No such data structure exists on current architectures.
Informal reasoning:
To get better than O(n) time for insertion/deletion, you need a tree data structure of some sort
To get O(1) random access, you can't afford to traverse a tree
The best you can do is get O(log n) for all these operations. That's a fairly good compromise, and there are plenty of data structures that achieve this (e.g. a Skip List).
You can also get "close to O(1)" by using trees with high branching factors. For example, Clojure's persistent data structure use 32-way trees, which gives you O(log32 n) operations. For practical purposes, that's fairly close to O(1) (i.e. for realistic sizes of n that you are likely to encounter in real-world collections)

If you are willing to settle for amortized constant time, it is called a hash table.

The closest such datastructure is a B+-tree, which can easily answer questions like "what is the kth item", but performs the requisite operations in O(log(n)) time. Notably iteration (and access of close elements), especially with a cursor implementation, can be very close to array speeds.
Throw in an extra factor, C, as our "block size" (which should be a multiple of a cache line), and we can get something like insertion time ~ log_C(n) + log_2(C) + C. For C = 256 and 32-bit integers, log_C(n) = 3 implies our structure is 64GB. Beyond this point you're probably looking for a hybrid datastructure and are more worried about network cache effects than local ones.

Let's enumerate your requirements instead of mentioning a single possible data structure first.
Basically, you want constant operation time for...
Access
If you know exactly where the entity that you're looking for is, this is easily accomplished. A hashed value or an indexed location is something that can be used to uniquely identify entities, and provide constant access time. The chief drawback with this approach is that you will not be able to have truly identical entities placed into the same data structure.
Insertion
If you can insert at the very end of a list without having to traverse it, then you can accomplish constant access time. The chief drawback with this approach is that you have to have a reference pointing to the end of your list at all times, which must be modified at update time (which, in theory, should be a constant time operation as well). If you decide to hash every value for fast access later, then there's a cost for both calculating the hash and adding it to some backing structure for quick indexing.
Deletion Time
The main principle here is that there can't be too many moving parts; I'm deleting from a fixed, well-defined location. Something like a Stack, Queue, or Deque can provide that for the most part, in that they're deleting only one element, either in LIFO or FIFO order. The chief drawback with this approach is that you can't scan the collection to find any elements in it, since that would take O(n) time. If you were going about the route of using a hash, you could probably do it in O(1) time at the cost of some multiple of O(n) storage space (for the hashes).
Dynamic Length
If you're chaining references, then that shouldn't be such a big deal; LinkedList already has an internal Node class. The chief drawback to this approach is that your memory is not infinite. If you were going the approach of hashes, then the more stuff you have to hash, the higher of a probability of a collision (which does take you out of the O(1) time, and put you more into an amortized O(1) time).
By this, there's really no single, perfect data structure that gives you absolutely constant runtime performance with dynamic length. I'm also unsure of any value that would be provided by writing a proof for such a thing, since the general use of data structures is to make use of its positives and live with its negatives (in the case of hashed collections: love the access time, no duplicates is an ouchie).
Although, if you were willing to live with some amortized performance, a set is likely your best option.

Advantages of Binary Search Trees over Hash Tables

What are the advantages of binary search trees over hash tables?
Hash tables can look up any element in Theta(1) time and it is just as easy to add an element....but I'm not sure of the advantages going the other way around.

One advantage that no one else has pointed out is that binary search tree allows you to do range searches efficiently.
In order to illustrate my idea, I want to make an extreme case. Say you want to get all the elements whose keys are between 0 to 5000. And actually there is only one such element and 10000 other elements whose keys are not in the range. BST can do range searches quite efficiently since it does not search a subtree which is impossible to have the answer.
While, how can you do range searches in a hash table? You either need to iterate every bucket space, which is O(n), or you have to look for whether each of 1,2,3,4... up to 5000 exists.
(what about the keys between 0 and 5000 are an infinite set? for example keys can be decimals)

Remember that Binary Search Trees (reference-based) are memory-efficient. They do not reserve more memory than they need to.
For instance, if a hash function has a range R(h) = 0...100, then you need to allocate an array of 100 (pointers-to) elements, even if you are just hashing 20 elements. If you were to use a binary search tree to store the same information, you would only allocate as much space as you needed, as well as some metadata about links.

One "advantage" of a binary tree is that it may be traversed to list off all elements in order. This is not impossible with a Hash table but is not a normal operation one design into a hashed structure.

In addition to all the other good comments:
Hash tables in general have better cache behavior requiring less memory reads compared to a binary tree. For a hash table you normally only incur a single read before you have access to a reference holding your data. The binary tree, if it is a balanced variant, requires something in the order of k * lg(n) memory reads for some constant k.
On the other hand, if an enemy knows your hash-function the enemy can enforce your hash table to make collisions, greatly hampering its performance. The workaround is to choose the hash-function randomly from a family, but a BST does not have this disadvantage. Also, when the hash table pressure grows too much, you often tend to enlargen and reallocate the hash table which may be an expensive operation. The BST has simpler behavior here and does not tend to suddenly allocate a lot of data and do a rehashing operation.
Trees tend to be the ultimate average data structure. They can act as lists, can easily be split for parallel operation, have fast removal, insertion and lookup on the order of O(lg n). They do nothing particularly well, but they don't have any excessively bad behavior either.
Finally, BSTs are much easier to implement in (pure) functional languages compared to hash-tables and they do not require destructive updates to be implemented (the persistence argument by Pascal above).

The main advantages of a binary tree over a hash table is that the binary tree gives you two additional operations you can't do (easily, quickly) with a hash table
find the element closest to (not necessarily equal to) some arbitrary key value (or closest above/below)
iterate through the contents of the tree in sorted order
The two are connected -- the binary tree keeps its contents in a sorted order, so things that require that sorted order are easy to do.

A (balanced) binary search tree also has the advantage that its asymptotic complexity is actually an upper bound, while the "constant" times for hash tables are amortized times: If you have a unsuitable hash function, you could end up degrading to linear time, rather than constant.

A binary tree is slower to search and insert into, but has the very nice feature of the infix traversal which essentially means that you can iterate through the nodes of the tree in a sorted order.
Iterating through the entries of a hash table just doesn't make a lot of sense because they are all scattered in memory.

A hashtable would take up more space when it is first created - it will have available slots for the elements that are yet to be inserted (whether or not they are ever inserted), a binary search tree will only be as big as it needs to be. Also, when a hash-table needs more room, expanding to another structure could be time-consuming, but that might depend on the implementation.

A binary search tree can be implemented with a persistent interface, where a new tree is returned but the old tree continues to exist. Implemented carefully, the old and new trees shares most of their nodes. You cannot do this with a standard hash table.

BSTs also provide the "findPredecessor" and "findSuccessor" operations (To find the next smallest and next largest elements) in O(logn) time, which might also be very handy operations. Hash Table can't provide in that time efficiency.

From Cracking the Coding Interview, 6th Edition
We can implement the hash table with a balanced binary search tree (BST) . This gives us an O(log n) lookup time. The advantage of this is potentially using less space, since we no longer allocate a large array. We can also iterate through the keys in order, which can be useful sometimes.

GCC C++ case study
Let's also get some insight from one of the most important implementations in the world. As we will see, it actually matches out theory perfectly!
As shown at What is the underlying data structure of a STL set in C++?, in GCC 6.4:
std::map uses BST
std::unordered_map uses hashmap
So this already points out to the fact that you can't transverse a hashmap efficiently, which is perhaps the main advantage of a BST.
And then, I also benchmarked insertion times in hash map vs BST vs heap at Heap vs Binary Search Tree (BST) which clearly highlights the key performance characteristics:
BST insertion is O(log), hashmap is O(1). And in this particular implementation, hashmap is almost always faster than BST, even for relatively small sizes
hashmap, although much faster in general, has some extremely slow insertions visible as single points in the zoomed out plot.
These happen when the implementation decides that it is time to increase its size, and it needs to be copied over to a larger one.
In more precise terms, this is because only its amortized complexity is O(1), not the worst case, which is actually O(n) during the array copy.
This might make hashmaps inadequate for certain real-time applications, where you need stronger time guarantees.
Related:
Binary Trees vs. Linked Lists vs. Hash Tables
https://cs.stackexchange.com/questions/270/hash-tables-versus-binary-trees

If you want to access the data in a sorted manner, then a sorted list has to be maintained in parallel to the hash table. A good example is Dictionary in .Net. (see http://msdn.microsoft.com/en-us/library/3fcwy8h6.aspx).
This has the side-effect of not only slowing inserts, but it consumes a larger amount of memory than a b-tree.
Further, since a b-tree is sorted, it is simple to find ranges of results, or to perform unions or merges.

It also depends on the use, Hash allows to locate exact match. If you want to query for a range then BST is the choice. Suppose you have a lots of data e1, e2, e3 ..... en.
With hash table you can locate any element in constant time.
If you want to find range values greater than e41 and less than e8, BST can quickly find that.
The key thing is the hash function used to avoid a collision. Of course, we cannot totally avoid a collision, in which case we resort to chaining or other methods. This makes retrieval no longer constant time in worst cases.
Once full, hash table has to increase its bucket size and copy over all the elements again. This is an additional cost not present over BST.

Binary search trees are good choice to implement dictionary if the keys have some total order (keys are comparable) defined on them and you want to preserve the order information.
As BST preserves the order information, it provides you with four additional dynamic set operations that cannot be performed (efficiently) using hash tables. These operations are:
Maximum
Minimum
Successor
Predecessor
All these operations like every BST operation have time complexity of O(H). Additionally all the stored keys remain sorted in the BST thus enabling you to get the sorted sequence of keys just by traversing the tree in in-order.
In summary if all you want is operations insert, delete and remove then hash table is unbeatable (most of the time) in performance. But if you want any or all the operations listed above you should use a BST, preferably a self-balancing BST.

A hashmap is a set associative array. So, your array of input values gets pooled into buckets. In an open addressing scheme, you have a pointer to a bucket, and each time you add a new value into a bucket, you find out where in the bucket there are free spaces. There are a few ways to do this- you start at the beginning of the bucket and increment the pointer each time and test whether its occupied. This is called linear probing. Then, you can do a binary search like add, where you double the difference between the beginning of the bucket and where you double up or back down each time you are searching for a free space. This is called quadratic probing.
OK. Now the problems in both these methods is that if the bucket overflows into the next buckets address, then you need to-
Double each buckets size- malloc(N buckets)/change the hash function-
Time required: depends on malloc implementation
Transfer/Copy each of the earlier buckets data into the new buckets data. This is an O(N) operation where N represents the whole data
OK. but if you use a linkedlist there shouldn't be such a problem right? Yes, In linked lists you don't have this problem. Considering each bucket to begin with a linked list, and if you have 100 elements in a bucket it requires you to traverse those 100 elements to reach the end of the linkedlist hence the List.add(Element E) will take time to-
Hash the element to a bucket- Normal as in all implementations
Take time to find the last element in said bucket- O(N) operation.
The advantage of the linkedlist implementation is that you don't need the memory allocation operation and O(N) transfer/copy of all buckets as in the case of the open addressing implementation.
So, the way to minimize the O(N) operation is to convert the implementation to that of a Binary Search Tree where find operations are O(log(N)) and you add the element in its position based on it's value. The added feature of a BST is that it comes sorted!

Hash Tables are not good for indexing. When you are searching for a range, BSTs are better. That's the reason why most database indexes use B+ trees instead of Hash Tables

Binary search trees can be faster when used with string keys. Especially when strings are long.
Binary search trees using comparisons for less/greater which are fast for strings (when they are not equal). So a BST can quickly answer when a string is not found.
When it's found it will need to do only one full comparison.
In a hash table. You need to calculate the hash of the string and this means you need to go through all bytes at least once to compute the hash. Then again, when a matching entry is found.

Binary Trees vs. Linked Lists vs. Hash Tables

I'm building a symbol table for a project I'm working on. I was wondering what peoples opinions are on the advantages and disadvantages of the various methods available for storing and creating a symbol table.
I've done a fair bit of searching and the most commonly recommended are binary trees or linked lists or hash tables. What are the advantages and or disadvantages of all of the above? (working in c++)

The standard trade offs between these data structures apply.
Binary Trees
medium complexity to implement (assuming you can't get them from a library)
inserts are O(logN)
lookups are O(logN)
Linked lists (unsorted)
low complexity to implement
inserts are O(1)
lookups are O(N)
Hash tables
high complexity to implement
inserts are O(1) on average
lookups are O(1) on average

Your use case is presumably going to be "insert the data once (e.g., application startup) and then perform lots of reads but few if any extra insertions".
Therefore you need to use an algorithm that is fast for looking up the information that you need.
I'd therefore think the HashTable was the most suitable algorithm to use, as it is simply generating a hash of your key object and using that to access the target data - it is O(1). The others are O(N) (Linked Lists of size N - you have to iterate through the list one at a time, an average of N/2 times) and O(log N) (Binary Tree - you halve the search space with each iteration - only if the tree is balanced, so this depends on your implementation, an unbalanced tree can have significantly worse performance).
Just make sure that there are enough spaces (buckets) in the HashTable for your data (R.e., Soraz's comment on this post). Most framework implementations (Java, .NET, etc) will be of a quality that you won't need to worry about the implementations.
Did you do a course on data structures and algorithms at university?

What everybody seems to forget is that for small Ns, IE few symbols in your table, the linked list can be much faster than the hash-table, although in theory its asymptotic complexity is indeed higher.
There is a famous qoute from Pike's Notes on Programming in C: "Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy." http://www.lysator.liu.se/c/pikestyle.html
I can't tell from your post if you will be dealing with a small N or not, but always remember that the best algorithm for large N's are not necessarily good for small Ns.

It sounds like the following may all be true:
Your keys are strings.
Inserts are done once.
Lookups are done frequently.
The number of key-value pairs is relatively small (say, fewer than a K or so).
If so, you might consider a sorted list over any of these other structures. This would perform worse than the others during inserts, as a sorted list is O(N) on insert, versus O(1) for a linked list or hash table, and O(log2N) for a balanced binary tree. But lookups in a sorted list may be faster than any of these others structures (I'll explain this shortly), so you may come out on top. Also, if you perform all your inserts at once (or otherwise don't require lookups until all insertions are complete), then you can simplify insertions to O(1) and do one much quicker sort at the end. What's more, a sorted list uses less memory than any of these other structures, but the only way this is likely to matter is if you have many small lists. If you have one or a few large lists, then a hash table is likely to out-perform a sorted list.
Why might lookups be faster with a sorted list? Well, it's clear that it's faster than a linked list, with the latter's O(N) lookup time. With a binary tree, lookups only remain O(log2 N) if the tree remains perfectly balanced. Keeping the tree balanced (red-black, for instance) adds to the complexity and insertion time. Additionally, with both linked lists and binary trees, each element is a separately-allocated1 node, which means you'll have to dereference pointers and likely jump to potentially widely varying memory addresses, increasing the chances of a cache miss.
As for hash tables, you should probably read a couple of other questions here on StackOverflow, but the main points of interest here are:
A hash table can degenerate to O(N) in the worst case.
The cost of hashing is non-zero, and in some implementations it can be significant, particularly in the case of strings.
As in linked lists and binary trees, each entry is a node storing more than just key and value, also separately-allocated in some implementations, so you use more memory and increase chances of a cache miss.
Of course, if you really care about how any of these data structures will perform, you should test them. You should have little problem finding good implementations of any of these for most common languages. It shouldn't be too difficult to throw some of your real data at each of these data structures and see which performs best.
It's possible for an implementation to pre-allocate an array of nodes, which would help with the cache-miss problem. I've not seen this in any real implementation of linked lists or binary trees (not that I've seen every one, of course), although you could certainly roll your own. You'd still have a slightly higher possibility of a cache miss, though, since the node objects would be necessarily larger than the key/value pairs.

I like Bill's answer, but it doesn't really synthesize things.
From the three choices:
Linked lists are relatively slow to lookup items from (O(n)). So if you have a lot of items in your table, or you are going to be doing a lot of lookups, then they are not the best choice. However, they are easy to build, and easy to write too. If the table is small, and/or you only ever do one small scan through it after it is built, then this might be the choice for you.
Hash tables can be blazingly fast. However, for it to work you have to pick a good hash for your input, and you have to pick a table big enough to hold everything without a lot of hash collisions. What that means is you have to know something about the size and quantity of your input. If you mess this up, you end up with a really expensive and complex set of linked lists. I'd say that unless you know ahead of time roughly how large the table is going to be, don't use a hash table. This disagrees with your "accepted" answer. Sorry.
That leaves trees. You have an option here though: To balance or not to balance. What I've found by studying this problem on C and Fortran code we have here is that the symbol table input tends to be sufficiently random that you only lose about a tree level or two by not balancing the tree. Given that balanced trees are slower to insert elements into and are harder to implement, I wouldn't bother with them. However, if you already have access to nice debugged component libraries (eg: C++'s STL), then you might as well go ahead and use the balanced tree.

A couple of things to watch out for.
Binary trees only have O(log n) lookup and insert complexity if the tree is balanced. If your symbols are inserted in a pretty random fashion, this shouldn't be a problem. If they're inserted in order, you'll be building a linked list. (For your specific application they shouldn't be in any kind of order, so you should be okay.) If there's a chance that the symbols will be too orderly, a Red-Black Tree is a better option.
Hash tables give O(1) average insert and lookup complexity, but there's a caveat here, too. If your hash function is bad (and I mean really bad) you could end up building a linked list here as well. Any reasonable string hash function should do, though, so this warning is really only to make sure you're aware that it could happen. You should be able to just test that your hash function doesn't have many collisions over your expected range of inputs, and you'll be fine. One other minor drawback is if you're using a fixed-size hash table. Most hash table implementations grow when they reach a certain size (load factor to be more precise, see here for details). This is to avoid the problem you get when you're inserting a million symbols into ten buckets. That just leads to ten linked lists with an average size of 100,000.
I would only use a linked list if I had a really short symbol table. It's easiest to implement, but the best case performance for a linked list is the worst case performance for your other two options.

Other comments have focused on adding/retrieving elements, but this discussion isn't complete without considering what it takes to iterate over the entire collection. The short answer here is that hash tables require less memory to iterate over, but trees require less time.
For a hash table, the memory overhead of iterating over the (key, value) pairs does not depend on the capacity of the table or the number of elements stored in the table; in fact, iterating should require just a single index variable or two.
For trees, the amount of memory required always depends on the size of the tree. You can either maintain a queue of unvisited nodes while iterating or add additional pointers to the tree for easier iteration (making the tree, for purposes of iteration, act like a linked list), but either way, you have to allocate extra memory for iteration.
But the situation is reversed when it comes to timing. For a hash table, the time it takes to iterate depends on the capacity of the table, not the number of stored elements. So a table loaded at 10% of capacity will take about 10 times longer to iterate over than a linked list with the same elements!

This depends on several things, of course. I'd say that a linked list is right out, since it has few suitable properties to work as a symbol table. A binary tree might work, if you already have one and don't have to spend time writing and debugging it. My choice would be a hash table, I think that is more or less the default for this purpose.

This question goes through the different containers in C#, but they are similar in any language you use.

Unless you expect your symbol table to be small, I should steer clear of linked lists. A list of 1000 items will on average take 500 iterations to find any item within it.
A binary tree can be much faster, so long as it's balanced. If you're persisting the contents, the serialised form will likely be sorted, and when it's re-loaded, the resulting tree will be wholly un-balanced as a consequence, and it'll behave the same as the linked list - because that's basically what it has become. Balanced tree algorithms solve this matter, but make the whole shebang more complex.
A hashmap (so long as you pick a suitable hashing algorithm) looks like the best solution. You've not mentioned your environment, but just about all modern languages have a Hashmap built in.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio