fastest sorting in functional style - sorting

As an undergrad, I studied O(n log n) algorithms for sorting and proofs that we cannot do better for the general case when all we can do is compare 2 numbers. That was for a random access memory model of computation.
I want to know if such theoretic lowerbounds exists for functional style(referentially transparent) programs. Let us assume that each beta reduction counts as one step, and each comparison counts as one step.

So let us assume that you agree with the proof that we cannot do better than (n*log n) in a general case where the only thing we have is a comparison.
So the question is whether we can show that we can do the same without side effects.
One way to do that uses the idea that if you can build an immutable binary search tree in O(n*log n) and then infix-traverse it (can be done in O(n)), then we would have a sorting algorithm.
If we can go through all items and add every item to a balanced immutable (persistent) tree in O(log n) it would give us O(n*log n) algorithm.
Can we add to a persistent binary tree in O(log n)? Sure, there are immutable variants of balanced binary search trees with O(log n) insert in every reasonable implementation of persistent data structure library.
To get an idea why it is possible, imagine a standard balanced binary search tree as e.g. red-black tree. You can make immutable version of it by following the same algorithm as for the mutable one, except whenever pointers or color change, you need to allocate a new node and consequently all of its parents to the root (while simultaneously transforming them too if necessary). Side branches that do not change get reused. There are at most O(log n) affected nodes, so at most O(log n) operations (including allocations) per insertion. If you know red-black, you can see that there are no other multipliers to this except for constants (for rotations you can get a few extra allocations for affected siblings, but that still remains a constant factor).
This - quite informal - demonstration can give you an idea that a proof for O(n*log n) for sorting without side effects exists. However, there are a few more things that I left out. E.g. allocation is considered to be O(1) here, which may not be a case always, but that would get too complex.

I think modern functional programming implementations (at least Clojure, since that's the only one I know) do have immutable memory, but that doesn't mean that altering lists and such result in copying of the whole original list. As such, I don't believe there are order-of-magnitude level computational differences between implementing sorting algorithms with imperative or functional idioms.
More on how immutable lists can be modified without memory copying:
For an example of how this might work, consider this snippet from the Clojure reference at: http://clojure-doc.org/articles/tutorials/introduction.html
...In Clojure, all scalars and core data structures are like this. They
are values. They are immutable.
The map {:name "John" :hit-points 200 :super-power :resourcefulness} is a
value. If you want to "change" John's hit-points, you don't change
anything per se, but rather, you just conjure up a whole new hashmap
value.
But wait: If you've done any imperative style programming in C-like
languages, this sounds crazy wasteful. However, the yin to this
immutability yang is that --- behind the scenes --- Clojure shares
data structures. It keeps track of all their pieces and re-uses them
pervasively. For example, if you have a 1,000,000-item list and want
to tack on one more item, you just tell Clojure, "give me a new one
but with this item added" --- and Clojure dutifully gives you back a
1,000,001-item list in no time flat. Unbeknownced to you it's re-using
the original list.
So why the fuss about immutability?
I don't know the complete history of functional programming, but it seems to me that the immutability characteristic of functional languages essentially abstracts away the intricacies of shared memory.
While this is cool, without a language-managed shared data structure underpinning the whole mechanism, it would be impractically slow for many use cases.

Related

Perfect List Structure?

Is it theoretically possible to have a data-structure that has
O(1) access, insertion, deletion times
and dynamic length?
I'm guessing one hasn't yet been invented or we would entirely forego the use of arrays and linked lists (seperately) and instead opt to use one of these.
Is there a proof this cannot happen, and therefore some relationship between access-time, insertion-time and deletion-time (like conservation of energy) that suggests if one of the times becomes constant the other has to be linear or something along that.
No such data structure exists on current architectures.
Informal reasoning:
To get better than O(n) time for insertion/deletion, you need a tree data structure of some sort
To get O(1) random access, you can't afford to traverse a tree
The best you can do is get O(log n) for all these operations. That's a fairly good compromise, and there are plenty of data structures that achieve this (e.g. a Skip List).
You can also get "close to O(1)" by using trees with high branching factors. For example, Clojure's persistent data structure use 32-way trees, which gives you O(log32 n) operations. For practical purposes, that's fairly close to O(1) (i.e. for realistic sizes of n that you are likely to encounter in real-world collections)
If you are willing to settle for amortized constant time, it is called a hash table.
The closest such datastructure is a B+-tree, which can easily answer questions like "what is the kth item", but performs the requisite operations in O(log(n)) time. Notably iteration (and access of close elements), especially with a cursor implementation, can be very close to array speeds.
Throw in an extra factor, C, as our "block size" (which should be a multiple of a cache line), and we can get something like insertion time ~ log_C(n) + log_2(C) + C. For C = 256 and 32-bit integers, log_C(n) = 3 implies our structure is 64GB. Beyond this point you're probably looking for a hybrid datastructure and are more worried about network cache effects than local ones.
Let's enumerate your requirements instead of mentioning a single possible data structure first.
Basically, you want constant operation time for...
Access
If you know exactly where the entity that you're looking for is, this is easily accomplished. A hashed value or an indexed location is something that can be used to uniquely identify entities, and provide constant access time. The chief drawback with this approach is that you will not be able to have truly identical entities placed into the same data structure.
Insertion
If you can insert at the very end of a list without having to traverse it, then you can accomplish constant access time. The chief drawback with this approach is that you have to have a reference pointing to the end of your list at all times, which must be modified at update time (which, in theory, should be a constant time operation as well). If you decide to hash every value for fast access later, then there's a cost for both calculating the hash and adding it to some backing structure for quick indexing.
Deletion Time
The main principle here is that there can't be too many moving parts; I'm deleting from a fixed, well-defined location. Something like a Stack, Queue, or Deque can provide that for the most part, in that they're deleting only one element, either in LIFO or FIFO order. The chief drawback with this approach is that you can't scan the collection to find any elements in it, since that would take O(n) time. If you were going about the route of using a hash, you could probably do it in O(1) time at the cost of some multiple of O(n) storage space (for the hashes).
Dynamic Length
If you're chaining references, then that shouldn't be such a big deal; LinkedList already has an internal Node class. The chief drawback to this approach is that your memory is not infinite. If you were going the approach of hashes, then the more stuff you have to hash, the higher of a probability of a collision (which does take you out of the O(1) time, and put you more into an amortized O(1) time).
By this, there's really no single, perfect data structure that gives you absolutely constant runtime performance with dynamic length. I'm also unsure of any value that would be provided by writing a proof for such a thing, since the general use of data structures is to make use of its positives and live with its negatives (in the case of hashed collections: love the access time, no duplicates is an ouchie).
Although, if you were willing to live with some amortized performance, a set is likely your best option.

Sorting in place

What is meant by to "sort in place"?
The idea of an in-place algorithm isn't unique to sorting, but sorting is probably the most important case, or at least the most well-known. The idea is about space efficiency - using the minimum amount of RAM, hard disk or other storage that you can get away with. This was especially relevant going back a few decades, when hardware was much more limited.
The idea is to produce an output in the same memory space that contains the input by successively transforming that data until the output is produced. This avoids the need to use twice the storage - one area for the input and an equal-sized area for the output.
Sorting is a fairly obvious case for this because sorting can be done by repeatedly exchanging items - sorting only re-arranges items. Exchanges aren't the only approach - the Insertion Sort, for example, uses a slightly different approach which is equivalent to doing a run of exchanges but faster.
Another example is matrix transposition - again, this can be implemented by exchanging items. Adding two very large numbers can also be done in-place (the result replacing one of the inputs) by starting at the least significant digit and propogating carries upwards.
Getting back to sorting, the advantages to re-arranging "in place" get even more obvious when you think of stacks of punched cards - it's preferable to avoid copying punched cards just to sort them.
Some algorithms for sorting allow this style of in-place operation whereas others don't.
However, all algorithms require some additional storage for working variables. If the goal is simply to produce the output by successively modifying the input, it's fairly easy to define algorithms that do that by reserving a huge chunk of memory, using that to produce some auxiliary data structure, then using that to guide those modifications. You're still producing the output by transforming the input "in place", but you're defeating the whole point of the exercise - you're not being space-efficient.
For that reason, the normal definition of an in-place definition requires that you achieve some standard of space efficiency. It's absolutely not acceptable to use extra space proportional to the input (that is, O(n) extra space) and still call your algorithm "in-place".
The Wikipedia page on in-place algorithms currently claims that an in-place algorithm can only use a constant amount - O(1) - of extra space.
In computer science, an in-place algorithm (or in Latin in situ) is an algorithm which transforms input using a data structure with a small, constant amount of extra storage space.
There are some technicalities specified in the In Computational Complexity section, but the conclusion is still that e.g. Quicksort requires O(log n) space (true) and therefore is not in-place (which I believe is false).
O(log n) is much smaller than O(n) - for example the base 2 log of 16,777,216 is 24.
Quicksort and heapsort are both normally considered in-place, and heapsort can be implemented with O(1) extra space (I was mistaken about this earlier). Mergesort is more difficult to implement in-place, but the out-of-place version is very cache-friendly - I suspect real-world implementations accept the O(n) space overhead - RAM is cheap but memory bandwidth is a major bottleneck, so trading memory for cache-efficiency and speed is often a good deal.
[EDIT When I wrote the above, I assume I was thinking of in-place merge-sorting of an array. In-place merge-sorting of a linked list is very simple. The key difference is in the merge algorithm - doing a merge of two linked lists with no copying or reallocation is easy, doing the same with two sub-arrays of a larger array (and without O(n) auxiliary storage) AFAIK isn't.]
Quicksort is also cache-efficient, even in-place, but can be disqualified as an in-place algorithm by appealing to its worst-case behaviour. There is a degenerate case (in a non-randomized version, typically when the input is already sorted) where the run-time is O(n^2) rather than the expected O(n log n). In this case the extra space requirement is also increased to O(n). However, for large datasets and with some basic precautions (mainly randomized pivot selection) this worst-case behaviour becomes absurdly unlikely.
My personal view is that O(log n) extra space is acceptable for in-place algorithms - it's not cheating as it doesn't defeat the original point of working in-place.
However, my opinion is of course just my opinion.
One extra note - sometimes, people will call a function in-place simply because it has a single parameter for both the input and the output. It doesn't necessarily follow that the function was space efficient, that the result was produced by transforming the input, or even that the parameter still references the same area of memory. This usage isn't correct (or so the prescriptivists will claim), though it's common enough that it's best to be aware but not get stressed about it.
In-place sorting means sorting without any extra space requirement. According to wiki , it says
an in-place algorithm is an algorithm which transforms input using a data structure with a small, constant amount of extra storage space.
Quicksort is one example of In-Place Sorting.
I don't think these terms are closely related:
Sort in place means to sort an existing list by modifying the element order directly within the list. The opposite is leaving the original list as is and create a new list with the elements in order.
Natural ordering is a term that describes how complete objects can somehow be ordered. You can for instance say that 0 is lower that 1 (natural ordering for integers) or that A is before B in alphabetical order (natural ordering for strings). You can hardly say though that Bob is greater or lower than Alice in general as it heavily depends on specific attributes (alphabetically by name, by age, by income, ...). Therefore there is no natural ordering for people.
I'm not sure these concepts are similar enough to compare as suggested. Yes, they both involve sorting, but one is about a sort ordering that is human understandable (natural) and the other defines an algorithm for efficient sorting in terms of memory by overwriting into the existing structure instead of using an additional data structure (like a bubble sort)
it can be done by using swap function , instead of making a whole new structure , we implement that algorithm without even knowing it's name :D

Complexity of algorithms of different programming paradigms

I know that most programming languages are Turing complete, but I wonder whether a problem can be resolved with an algorithm of the same complexity with any programming language (and in particular with any programming paradigm).
To make my answer more explicit with an example: is there any problem which can be resolved with an imperative algorithm of complexity x (say O(n)), but cannot be resolved by a functional algorithm with the same complexity (or vice versa)?
Edit: The algorithm itself can be different. The question is about the complexity of solving the problem -- using any approach available in the language.
In general, no, not all algorithms can be implemented with the same order of complexity in all languages. This can be trivially proven, for instance, with a hypothetical language that disallows O(1) access to an array. However, there aren't any algorithms (to my knowledge) that cannot be implemented with the optimal order of complexity in a functional language. The complexity analysis of an algorithm's pseudocode makes certain assumptions about what operations are legal, and what operations are O(1). If you break one of those assumptions, you can alter the complexity of the algorithm's implementation even though the language is Turing complete. Turing-completeness makes no guarantees regarding the complexity of any operation.
An algorithm has a measured runtime such as O(n) like you said, implementations of an algorithm must adhere to that same runtime or they do not implement the algorithm. The language or implementation does not by definition change the algorithm and thus does not change the asymptotic runtime.
That said certain languages and technologies might make expressing the algorithm easier and offer constant speedups (or slowdowns) due to how the language gets compiled or executed.
I think your first paragraph is wrong. And I think your edit doesn't change that.
Assuming you are requiring that the observed behaviour of an implementation conforms to the time complexity of the algorithm, then...
When calculating the complexity of an algorithm assumptions are made about what operations are constant time. These assumptions are where you're going to find your clues.
Some of the more common assumptions are things like constant time array access, function calls, and arithmetic operations.
If you cannot provide those operations in a language in constant time you cannot reproduce the algorithm in a way that preserves the time complexity.
Reasonable languages can break those assumptions, and sometimes have to if they want to deal with, say, immutable data structures with shared state, concurrency, etc.
For example, Clojure uses trees to represent Vectors. This means that access is not constant time (I think it's log32 of the size of the array, but that's not constant even though it might as well be).
You can easily imagine a language having to do complicated stuff at runtime when calling a function. For example, deciding which one was meant.
Once upon a time floating point and multi-word integer multiplication and division were sadly not constant time (they were implemented in software). There was a period during which languages transitioned to using hardware when very reasonable language implementations behaved very differently.
I'm also pretty sure you can come up with algorithms that fare very poorly in the world of immutable data structures. I've seen some optimisation algorithms that would be horribly difficult, maybe impossible or effectively so, to implement while dealing immutability without breaking the time complexity.
For what it's worth, there are algorithms out there that assume set union and intersection are constant time... good luck implementing those algorithms in constant time. There are also algorithms that use an 'oracle' that can answer questions in constant time... good luck with those too.
I think that a language can have different basilar operations that cost O(1), for example mathematical operations (+, -, *, /), or variable/array access (a[i]), function call and everything you can think.
If a language do not have one of this O(1) operations (as brain bending that do not have O(1) array access) it can not do everything C can do with same complexity, but if a language have more O(1) operations (for example a language with O(1) array search) it can do more than C.
I think that all "serious" language have the same basilar O(1) operations, so they can resolve problem with same complexity.
If you consider Brainfuck or the Turing machine itself, there is one fundamental operation, that takes O(n) time there, although in most other languages it can be done in O(1) – indexing an array.
I'm not completely sure about this, but I think you can't have true array in functional programming either (having O(1) “get element at position” and O(1) “set element at position”). Because of immutability, you can either have a structure that can change quickly, but accessing it takes time or you will have to copy large parts of the structure on every change to get fast access. But I guess you could cheat around that using monads.
Looking at things like functional versus imperative, I doubt you'll find any real differences.
Looking at individual languages and implementations is a different story though. Ignoring, for the moment, the examples from Brainfuck and such, there are still some pretty decent examples to find.
I still remember one example from many years ago, writing APL (on a mainframe). The task was to find (and eliminate) duplicates in a sorted array of numbers. At the time, most of the programming I'd done was in Fortran, with a few bits and pieces in Pascal (still the latest and greatest thing at the time) or BASIC. I did what seemed obvious: wrote a loop that stepped through the array, comparing array[i] to array[i+1], and keeping track of a couple of indexes, copying each unique element back the appropriate number of places, depending on how many elements had already been eliminated.
While this would have worked quite well in the languages to which I was accustomed, it was barely short of a disaster in APL. The solution that worked a lot better was based more on what was easy in APL than computational complexity. Specifically, what you did was compare the first element of the array with the first element of the array after it had been "rotated" by one element. Then, you either kept the array as it was, or eliminated the last element. Repeat that until you'd gone through the whole array (as I recall, detected when the first element was smaller than the first element in the rotated array).
The difference was fairly simple: like most APL implementations (at least at the time), this one was a pure interpreter. A single operation (even one that was pretty complex) was generally pretty fast, but interpreting the input file took quite a bit of time. The improved version was much shorter and faster to interpret (e.g., APL provides the "rotate the array" thing as a single, primitive operation so that was only a character or two to interpret instead of a loop).

Using red black trees for sorting

The worst-case running time of insertion on a red-black tree is O(lg n) and if I perform a in-order walk on the tree, I essentially visit each node, so the total worst-case runtime to print the sorted collection would be O(n lg n)
I am curious, why are red-black trees not preferred for sorting over quick sort (whose average-case running time is O(n lg n).
I see that maybe because red-black trees do not sort in-place, but I am not sure, so maybe someone could help.
Knowing which sort algorithm performs better really depend on your data and situation.
If you are talking in general/practical terms,
Quicksort (the one where you select the pivot randomly/just pick one fixed, making worst case Omega(n^2)) might be better than Red-Black Trees because (not necessarily in order of importance)
Quicksort is in-place. The keeps your memory footprint low. Say this quicksort routine was part of a program which deals with a lot of data. If you kept using large amounts of memory, your OS could start swapping your process memory and trash your perf.
Quicksort memory accesses are localized. This plays well with the caching/swapping.
Quicksort can be easily parallelized (probably more relevant these days).
If you were to try and optimize binary tree sorting (using binary tree without balancing) by using an array instead, you will end up doing something like Quicksort!
Red-Black trees have memory overheads. You have to allocate nodes possibly multiple times, your memory requirements with trees is doubles/triple that using arrays.
After sorting, say you wanted the 1045th (say) element, you will need to maintain order statistics in your tree (extra memory cost because of this) and you will have O(logn) access time!
Red-black trees have overheads just to access the next element (pointer lookups)
Red-black trees do not play well with the cache and the pointer accesses could induce more swapping.
Rotation in red-black trees will increase the constant factor in the O(nlogn).
Perhaps the most important reason (but not valid if you have lib etc available), Quicksort is very simple to understand and implement. Even a school kid can understand it!
I would say you try to measure both implementations and see what happens!
Also, Bob Sedgewick did a thesis on quicksort! Might be worth reading.
There are plenty of sorting algorithms which are worst case O(n log n) - for example, merge sort. The reason quicksort is preferred is because it is faster in practice, even though algorithmically it may not be as good as some other algorithms.
Often in-built sorts use a combination of various methods depending on the values of n.
There are many cases where red-back trees are not bad for sorting. My testing showed, compared to natural merge sort, that red-black trees excel where:
Trees are better for Dups:
All the tests where dups need to be eleminated, tree algorithm is better. This is not astonishing, since the tree can be kept very small from the beginning, whereby algorithms that are designed for inline array sort might pass around larger segments for a longer time.
Trees are better for Random:
All the tests with random, tree algorithm is better. This is also not astonishing, since in a tree distance between elements is shorter and shifting is not necessary. So repeatedly inserting into a tree could need less effort than sorting an array.
So we get the impression that the natural merge sort only excels in ascending and descending special cases. Which cant be even said for quick sort.
Gist with the test cases here.
P.S.: it should be noted that using trees for sorting is non-trivial. One has not only to provide an insert routine but also a routine that can linearize the tree back to an array. We are currently using a get_last and a predecessor routine, which doesn't need a stack. But these routines are not O(1) since they contain loops.
Big-O time complexity measures do not usually take into account scalar factors, e.g., O(2n) and O(4n) are usually just reduced to O(n). Time complexity analysis is based on operational steps at an algorithmic level, not at a strict programming level, i.e., no source code or native machine instruction considerations.
Quicksort is generally faster than tree-based sorting since (1) the methods have the same algorithmic average time complexity, and (2) lookup and swapping operations require fewer program commands and data accesses when working with simple arrays than with red-black trees, even if the tree uses an underlying array-based implementation. Maintenance of the red-black tree constraints requires additional operational steps, data field value storage/access (node colors), etc than the simple array partition-exchange steps of a quicksort.
The net result is that red-black trees have higher scalar coefficients than quicksort does that are being obscured by the standard O(n log n) average time complexity analysis result.
Some other practical considerations related to machine architectures are briefly discussed in the Quicksort article on Wikipedia
Generally, representations of O(nlgn) algorithms can be expanded to A*nlgn + B where A and B are constants. There are many algorithmic proofs that show the coefficients for quicksort are smaller than those of other algorithms. That is in best-case (quick sort performs horribly on sorted data).
Hi the best way to explain the difference between all sorting routine in my opinion is.
(My answer is for people who are confused how quick sort is faster in practice than another sorting algo).
"Think u are running on a very slow computer".
First thing one comparing operation takes 1 hour.
One shifting operation takes 2 hours.
"I am using hour just to make people understand how important time is".
Now from all the sorting operations quick-sort have very very less comparisons and very less swapping for elements.
Quick-sort is faster for this main reason.

Binary Trees vs. Linked Lists vs. Hash Tables

I'm building a symbol table for a project I'm working on. I was wondering what peoples opinions are on the advantages and disadvantages of the various methods available for storing and creating a symbol table.
I've done a fair bit of searching and the most commonly recommended are binary trees or linked lists or hash tables. What are the advantages and or disadvantages of all of the above? (working in c++)
The standard trade offs between these data structures apply.
Binary Trees
medium complexity to implement (assuming you can't get them from a library)
inserts are O(logN)
lookups are O(logN)
Linked lists (unsorted)
low complexity to implement
inserts are O(1)
lookups are O(N)
Hash tables
high complexity to implement
inserts are O(1) on average
lookups are O(1) on average
Your use case is presumably going to be "insert the data once (e.g., application startup) and then perform lots of reads but few if any extra insertions".
Therefore you need to use an algorithm that is fast for looking up the information that you need.
I'd therefore think the HashTable was the most suitable algorithm to use, as it is simply generating a hash of your key object and using that to access the target data - it is O(1). The others are O(N) (Linked Lists of size N - you have to iterate through the list one at a time, an average of N/2 times) and O(log N) (Binary Tree - you halve the search space with each iteration - only if the tree is balanced, so this depends on your implementation, an unbalanced tree can have significantly worse performance).
Just make sure that there are enough spaces (buckets) in the HashTable for your data (R.e., Soraz's comment on this post). Most framework implementations (Java, .NET, etc) will be of a quality that you won't need to worry about the implementations.
Did you do a course on data structures and algorithms at university?
What everybody seems to forget is that for small Ns, IE few symbols in your table, the linked list can be much faster than the hash-table, although in theory its asymptotic complexity is indeed higher.
There is a famous qoute from Pike's Notes on Programming in C: "Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy." http://www.lysator.liu.se/c/pikestyle.html
I can't tell from your post if you will be dealing with a small N or not, but always remember that the best algorithm for large N's are not necessarily good for small Ns.
It sounds like the following may all be true:
Your keys are strings.
Inserts are done once.
Lookups are done frequently.
The number of key-value pairs is relatively small (say, fewer than a K or so).
If so, you might consider a sorted list over any of these other structures. This would perform worse than the others during inserts, as a sorted list is O(N) on insert, versus O(1) for a linked list or hash table, and O(log2N) for a balanced binary tree. But lookups in a sorted list may be faster than any of these others structures (I'll explain this shortly), so you may come out on top. Also, if you perform all your inserts at once (or otherwise don't require lookups until all insertions are complete), then you can simplify insertions to O(1) and do one much quicker sort at the end. What's more, a sorted list uses less memory than any of these other structures, but the only way this is likely to matter is if you have many small lists. If you have one or a few large lists, then a hash table is likely to out-perform a sorted list.
Why might lookups be faster with a sorted list? Well, it's clear that it's faster than a linked list, with the latter's O(N) lookup time. With a binary tree, lookups only remain O(log2 N) if the tree remains perfectly balanced. Keeping the tree balanced (red-black, for instance) adds to the complexity and insertion time. Additionally, with both linked lists and binary trees, each element is a separately-allocated1 node, which means you'll have to dereference pointers and likely jump to potentially widely varying memory addresses, increasing the chances of a cache miss.
As for hash tables, you should probably read a couple of other questions here on StackOverflow, but the main points of interest here are:
A hash table can degenerate to O(N) in the worst case.
The cost of hashing is non-zero, and in some implementations it can be significant, particularly in the case of strings.
As in linked lists and binary trees, each entry is a node storing more than just key and value, also separately-allocated in some implementations, so you use more memory and increase chances of a cache miss.
Of course, if you really care about how any of these data structures will perform, you should test them. You should have little problem finding good implementations of any of these for most common languages. It shouldn't be too difficult to throw some of your real data at each of these data structures and see which performs best.
It's possible for an implementation to pre-allocate an array of nodes, which would help with the cache-miss problem. I've not seen this in any real implementation of linked lists or binary trees (not that I've seen every one, of course), although you could certainly roll your own. You'd still have a slightly higher possibility of a cache miss, though, since the node objects would be necessarily larger than the key/value pairs.
I like Bill's answer, but it doesn't really synthesize things.
From the three choices:
Linked lists are relatively slow to lookup items from (O(n)). So if you have a lot of items in your table, or you are going to be doing a lot of lookups, then they are not the best choice. However, they are easy to build, and easy to write too. If the table is small, and/or you only ever do one small scan through it after it is built, then this might be the choice for you.
Hash tables can be blazingly fast. However, for it to work you have to pick a good hash for your input, and you have to pick a table big enough to hold everything without a lot of hash collisions. What that means is you have to know something about the size and quantity of your input. If you mess this up, you end up with a really expensive and complex set of linked lists. I'd say that unless you know ahead of time roughly how large the table is going to be, don't use a hash table. This disagrees with your "accepted" answer. Sorry.
That leaves trees. You have an option here though: To balance or not to balance. What I've found by studying this problem on C and Fortran code we have here is that the symbol table input tends to be sufficiently random that you only lose about a tree level or two by not balancing the tree. Given that balanced trees are slower to insert elements into and are harder to implement, I wouldn't bother with them. However, if you already have access to nice debugged component libraries (eg: C++'s STL), then you might as well go ahead and use the balanced tree.
A couple of things to watch out for.
Binary trees only have O(log n) lookup and insert complexity if the tree is balanced. If your symbols are inserted in a pretty random fashion, this shouldn't be a problem. If they're inserted in order, you'll be building a linked list. (For your specific application they shouldn't be in any kind of order, so you should be okay.) If there's a chance that the symbols will be too orderly, a Red-Black Tree is a better option.
Hash tables give O(1) average insert and lookup complexity, but there's a caveat here, too. If your hash function is bad (and I mean really bad) you could end up building a linked list here as well. Any reasonable string hash function should do, though, so this warning is really only to make sure you're aware that it could happen. You should be able to just test that your hash function doesn't have many collisions over your expected range of inputs, and you'll be fine. One other minor drawback is if you're using a fixed-size hash table. Most hash table implementations grow when they reach a certain size (load factor to be more precise, see here for details). This is to avoid the problem you get when you're inserting a million symbols into ten buckets. That just leads to ten linked lists with an average size of 100,000.
I would only use a linked list if I had a really short symbol table. It's easiest to implement, but the best case performance for a linked list is the worst case performance for your other two options.
Other comments have focused on adding/retrieving elements, but this discussion isn't complete without considering what it takes to iterate over the entire collection. The short answer here is that hash tables require less memory to iterate over, but trees require less time.
For a hash table, the memory overhead of iterating over the (key, value) pairs does not depend on the capacity of the table or the number of elements stored in the table; in fact, iterating should require just a single index variable or two.
For trees, the amount of memory required always depends on the size of the tree. You can either maintain a queue of unvisited nodes while iterating or add additional pointers to the tree for easier iteration (making the tree, for purposes of iteration, act like a linked list), but either way, you have to allocate extra memory for iteration.
But the situation is reversed when it comes to timing. For a hash table, the time it takes to iterate depends on the capacity of the table, not the number of stored elements. So a table loaded at 10% of capacity will take about 10 times longer to iterate over than a linked list with the same elements!
This depends on several things, of course. I'd say that a linked list is right out, since it has few suitable properties to work as a symbol table. A binary tree might work, if you already have one and don't have to spend time writing and debugging it. My choice would be a hash table, I think that is more or less the default for this purpose.
This question goes through the different containers in C#, but they are similar in any language you use.
Unless you expect your symbol table to be small, I should steer clear of linked lists. A list of 1000 items will on average take 500 iterations to find any item within it.
A binary tree can be much faster, so long as it's balanced. If you're persisting the contents, the serialised form will likely be sorted, and when it's re-loaded, the resulting tree will be wholly un-balanced as a consequence, and it'll behave the same as the linked list - because that's basically what it has become. Balanced tree algorithms solve this matter, but make the whole shebang more complex.
A hashmap (so long as you pick a suitable hashing algorithm) looks like the best solution. You've not mentioned your environment, but just about all modern languages have a Hashmap built in.

Resources