Perfect List Structure? - data-structures

Perfect List Structure? - data-structures

Is it theoretically possible to have a data-structure that has
O(1) access, insertion, deletion times
and dynamic length?
I'm guessing one hasn't yet been invented or we would entirely forego the use of arrays and linked lists (seperately) and instead opt to use one of these.
Is there a proof this cannot happen, and therefore some relationship between access-time, insertion-time and deletion-time (like conservation of energy) that suggests if one of the times becomes constant the other has to be linear or something along that.

No such data structure exists on current architectures.
Informal reasoning:
To get better than O(n) time for insertion/deletion, you need a tree data structure of some sort
To get O(1) random access, you can't afford to traverse a tree
The best you can do is get O(log n) for all these operations. That's a fairly good compromise, and there are plenty of data structures that achieve this (e.g. a Skip List).
You can also get "close to O(1)" by using trees with high branching factors. For example, Clojure's persistent data structure use 32-way trees, which gives you O(log32 n) operations. For practical purposes, that's fairly close to O(1) (i.e. for realistic sizes of n that you are likely to encounter in real-world collections)

If you are willing to settle for amortized constant time, it is called a hash table.

The closest such datastructure is a B+-tree, which can easily answer questions like "what is the kth item", but performs the requisite operations in O(log(n)) time. Notably iteration (and access of close elements), especially with a cursor implementation, can be very close to array speeds.
Throw in an extra factor, C, as our "block size" (which should be a multiple of a cache line), and we can get something like insertion time ~ log_C(n) + log_2(C) + C. For C = 256 and 32-bit integers, log_C(n) = 3 implies our structure is 64GB. Beyond this point you're probably looking for a hybrid datastructure and are more worried about network cache effects than local ones.

Let's enumerate your requirements instead of mentioning a single possible data structure first.
Basically, you want constant operation time for...
Access
If you know exactly where the entity that you're looking for is, this is easily accomplished. A hashed value or an indexed location is something that can be used to uniquely identify entities, and provide constant access time. The chief drawback with this approach is that you will not be able to have truly identical entities placed into the same data structure.
Insertion
If you can insert at the very end of a list without having to traverse it, then you can accomplish constant access time. The chief drawback with this approach is that you have to have a reference pointing to the end of your list at all times, which must be modified at update time (which, in theory, should be a constant time operation as well). If you decide to hash every value for fast access later, then there's a cost for both calculating the hash and adding it to some backing structure for quick indexing.
Deletion Time
The main principle here is that there can't be too many moving parts; I'm deleting from a fixed, well-defined location. Something like a Stack, Queue, or Deque can provide that for the most part, in that they're deleting only one element, either in LIFO or FIFO order. The chief drawback with this approach is that you can't scan the collection to find any elements in it, since that would take O(n) time. If you were going about the route of using a hash, you could probably do it in O(1) time at the cost of some multiple of O(n) storage space (for the hashes).
Dynamic Length
If you're chaining references, then that shouldn't be such a big deal; LinkedList already has an internal Node class. The chief drawback to this approach is that your memory is not infinite. If you were going the approach of hashes, then the more stuff you have to hash, the higher of a probability of a collision (which does take you out of the O(1) time, and put you more into an amortized O(1) time).
By this, there's really no single, perfect data structure that gives you absolutely constant runtime performance with dynamic length. I'm also unsure of any value that would be provided by writing a proof for such a thing, since the general use of data structures is to make use of its positives and live with its negatives (in the case of hashed collections: love the access time, no duplicates is an ouchie).
Although, if you were willing to live with some amortized performance, a set is likely your best option.

Related

Hashtable with chaining efficiency if linked list are sorted

I am currently working on an exercise from the CLRS, here is the problem:
11.2-3
Professor Marley hypothesizes that he can obtain substantial performance gains by
modifying the chaining scheme to keep each list in sorted order. How does the pro-
fessor’s modification affect the running time for successful searches, unsuccessful
searches, insertions, and deletions?
I saw on the internet that the answer is the following:
I do not understand why the result is like this, my answer is that since the linked list are sorted, then we can use a dichotomy to make the search so the excepted search time ( as well as the worst case time ) is θ(log2(a)) ( a being the load factor, n/m, n being the number of keys effectively stored in the table and m it's capacity ).
I am ok that deletion still take θ(1) time ( if lists are double chained ) and I said insertion will now take θ(log2(a)) because you need to determinate the correct place for the element you are adding to the list. Why is this not the correct answer?

A technical point: If you store the buckets as a linked list, then you can't use binary search to look over the items in that linked list in time O(log b), where b is the number of items in the bucket.
But let's suppose that instead of doing this you use dynamic arrays for each bucket. Then you could drop the runtimes down by a log factor, in an asymptotic sense. However, if you were to do that:
You're now using dynamic arrays rather than linked lists for your buckets. If you're storing large elements in your buckets, since most buckets won't be very loaded, the memory overhead of the unused slots in the array will start to add up.
From a practical perspective, you now need some way of comparing the elements you're hashing from lowest to highest. In Theoryland, that's not a problem. In practice, though, this could be a bit of a nuisance.
But more importantly, you might want to ask whether this is worthwhile in the first place. Remember that in a practical hash table the choice of α you'll be using is probably going to be very small (say, α ≤ 5 or something like that). For small α, a binary search might actually be slower than a linear scan, even if in theory for sufficiently large α it's faster.
So generally, you don't see this approach used in practice. If you're looking to speed up a hash table, it's probably better to change hashing strategies (say, use open addressing rather than chaining) or to try to squeeze performance out in other ways.

Why implement a Hashtable with a Binary Search Tree?

When implementing a Hashtable using an array, we inherit the constant time indexing of the array. What are the reasons for implementing a Hashtable with a Binary Search Tree since it offers search with O(logn)? Why not just use a Binary Search Tree directly?

If the elements don't have a total order (i.e. the "greater than" and "less than" is not be defined for all pairs or it is not consistent between elements), you can't compare all pairs, thus you can't use a BST directly, but nothing's stopping you from indexing the BST by the hash value - since this is an integral value, it obviously has a total order (although you'd still need to resolve collision, that is have a way to handle elements with the same hash value).
However, one of the biggest advantages of a BST over a hash table is the fact that the elements are in order - if we order it by hash value, the elements will have an arbitrary order instead, and this advantage would no longer be applicable.
As for why one might consider implementing a hash table using a BST instead of an array, it would:
Not have the disadvantage of needing to resize the array - with an array, you typically mod the hash value with the array size and resize the array if it gets full, reinserting all elements, but with a BST, you can just directly insert the unchanging hash value into the BST.
This might be relevant if we want any individual operation to never take more than a certain amount of time (which could very well happen if we need to resize the array), with the overall performance being secondary, but there might be better ways to solve this problem.
Have a reduced risk of hash collisions since you don't mod with the array size and thus the number of possible hashes could be significantly bigger. This would reduce the risk of getting the worst-case performance of a hash table (which is when a significant portion of the elements hash to the same value).
What the actual worst-case performance is would depend on how you're resolving collisions. This is typically done with linked-lists for O(n) worst case performance. But we can also achieve O(log n) performance with BST's (as is done in Java's hash table implementation if the number of elements with some hash are above a threshold) - that is, have your hash table array where each element points to a BST where all elements have the same hash value.
Possibly use less memory - with an array you'd inevitably have some empty indices, but with a BST, these simply won't need to exist. Although this is not a clear-cut advantage, if it's an advantage at all.
If we assume we use the less common array-based BST implementation, this array will also have some empty indices and this would also require the occasional resizing, but this is a simply memory copy as opposed to needing to reinsert all elements with updated hashes.
If we use the typical pointer-based BST implementation, the added cost for the pointers would seemingly outweigh the cost of having a few empty indices in an array (unless the array is particularly sparse, which tends to be a bad sign for a hash table anyway).
But, since I haven't personally ever heard of this ever being done, presumably the benefits are not worth the increased cost of operations from expected O(1) to O(log n).
Typically the choice is indeed between using a BST directly (without hash values) and using a hash table (with an array).

Pros:
Potentially use less space b/c we don't allocate a large array
Can iterate through the keys in order, sometimes useful
Cons:
You'd have O(log N) lookup time, which is worse than the guaranteed O(1) for a chained hash table.

Since the requirements of a Hash Table are O(1) lookup, it's not a Hash Table if it has logarithmic lookup times. Granted, since collision is an issue with the array implementation (well, not likely an issue), using a BST could offer benefits in that regard. Generally, though, it's not worth the tradeoff - I can't think of a situation where you wouldn't want guaranteed O(1) lookup time when using a Hash Table.
Alternatively, there is the possibility of an underlying structure to guarantee logarithmic insertion and deletion via a BST variant, where each index in the array has a reference to the corresponding node in the BST. A structure like that could get sort of complex, but would guarantee O(1) lookup and O(logn) insertion/deletion.

I found this looking to see if anyone had done it. I guess maybe not.
I came up with an idea this morning of implementing a binary tree as an array consisting of rows stored by index. Row 1 has 1, row 2 has 2, row 3 has 4 (yes, powers of two). The advantage of this structure is a bit shift and addition or subtraction can be used to walk the tree instead of using extra memory to store bi- or uni-directional references.
This would allow you to rapidly search for a hash value based on some sort of hashable input, to discover if the value exists in some other store. Or for a hash collision (or partial collision) search. I can't think of many other uses for it but for these it would be phenomenally fast. Very likely a lot of the rotation operations would happen entirely in cpu cache and be written out in nice linear blobs to main memory.
Its main utility would be with sorting input values of a random nature. If the blobs in the array were two parts, like a hash, and an identifier for another store, you could do the comparisons very fast and insert very fast to discover where an item bearing a hash value is kept in another location (like the UUID of a filesystem node or maybe even the filename, or other short identifiable string).
I'll leave it to others to dream of other ways to use it but I'm using it for a graph theoretic proof of work search table for identifying partial collisions for a variant of Cuckoo Cycle.
I am just now working on the walk formula, and here it is:
i = index of array element
Walk Up (go to parent):
i>>1-(i+1)%2
(Obviously you probably need to test if i is zero)
Walk Left (down and left):
i<<1+2
(this and the next would also need to test against 2^depth of the structure, so it doesn't walk off the edge and fall back to the root)
Walk Right (down and right):
i<<1+1
As you can see, each walk is a short formula based on the index. A bit shift and addition for going left and right, and a bit shift, addition and modulus for ascending. Two instructions to move down, 4 to move up (in assembler, or as above in C and other HLL operator notation)
edit:
I can see from further commentary that the benefit of slashing the insert time definitely would be of benefit. But I don't think that a conventional vector based binary tree would provide nearly as much benefit as a dense version. A dense version, where all the nodes are in a contiguous array, when it is searched, naturally will travel in a linear fashion through the memory, which should help reduce cache misses and thus reduce the latency of the searches significantly, as well as the fact that there is a latency hit with memory in accessing randomly compared to streaming through blocks sequentially.
https://github.com/calibrae-project/bast/blob/master/pkg/bast/bast.go
This is my current state of a WiP to implement what I am calling a Bifurcation Array Search Tree. For the purpose of a fast insert/delete and not horribly slow search through a sorted collection of hashes, I think that this would be of quite large benefit for cases where there is a lot of data coming and going through the structure, or more to the point, beneficial for more realtime applications.

Is a data structure implementation with O(1) search possible without using arrays?

I am currently taking a university course in data structures, and this topic has been bothering me for a while now (this is not a homework assignment, just a purely theoretical question).
Let's assume you want to implement a dictionary. The dictionary should, of course, have a search function, accepting a key and returning a value.
Right now, I can only imagine 2 very general methods of implementing such a thing:
Using some kind of search tree, which would (always?) give an O(log n) worst case running time for finding the value by the key, or,
Hashing the key, which essentially returns a natural number which corresponds to an index in an array of values, giving an O(1) worst case running time.
Is O(1) worst case running time possible for a search function, without the use of arrays?
Is random access available only through the use of arrays?
Is it possible through the use of a pointer-based data structure (such as linked lists, search trees, etc.)?
Is it possible when making some specific assumptions, for example, the keys being in some order?
In other words, can you think of an implementation (if one is possible) for the search function and the dictionary that will receive any key in the dictionary and return its value in O(1) time, without using arrays for random access?

Here's another answer I made on that general subject.
Essentially, algorithms reach their results by processing a certain number of bits of information. The length of time they take depends on how quickly they can do that.
A decision point having only 2 branches cannot process more than 1 bit of information. However, a decision point having n branches can process up to log(n) bits (base 2).
The only mechanism I'm aware of, in computers, that can process more than 1 bit of information, in a single operation, is indexing, whether it is indexing an array or executing a jump table (which is indexing an array).

It is not the use of an array that makes the lookup O(1), it's the fact that the lookup time is not dependent upon the size of the data storage. Hence any method that accesses data directly, without a search proportional in some way to the data sotrage size, would be O(1).

you could have a hash implemented with a trie tree. The complexity is O(max(length(string))), if you have strings of limited size, then you could say it runs in O(1), it doesn't depend on the number of strings you have in the structure. http://en.wikipedia.org/wiki/Trie

Are there O(1) random access data structures that don't rely on contiguous storage?

The classic O(1) random access data structure is the array. But an array relies on the programming language being used supporting guaranteed continuous memory allocation (since the array relies on being able to take a simple offset of the base to find any element).
This means that the language must have semantics regarding whether or not memory is continuous, rather than leaving this as an implementation detail. Thus it could be desirable to have a data structure that has O(1) random access, yet doesn't rely on continuous storage.
Is there such a thing?

How about a trie where the length of keys is limited to some contant K (for example, 4 bytes so you can use 32-bit integers as indices). Then lookup time will be O(K), i.e. O(1) with non-contiguous memory. Seems reasonable to me.
Recalling our complexity classes, don't forget that every big-O has a constant factor, i.e. O(n) + C, This approach will certainly have a much larger C than a real array.
EDIT: Actually, now that I think about it, it's O(K*A) where A is the size of the "alphabet". Each node has to have a list of up to A child nodes, which will have to be a linked list keep the implementation non-contiguous. But A is still constant, so it's still O(1).

In practice, for small datasets using contiguous storage is not a problem, and for large datasets O(log(n)) is just as good as O(1); the constant factor is rather more important.
In fact, For REALLY large datasets, O(root3(n)) random access is the best you can get in a 3-dimensional physical universe.
Edit:
Assuming log10 and the O(log(n)) algorithm being twice as fast as the O(1) one at a million elements, it will take a trillion elements for them to become even, and a quintillion for the O(1) algorithm to become twice as fast - rather more than even the biggest databases on earth have.
All current and foreseeable storage technologies require a certain physical space (let's call it v) to store each element of data. In a 3-dimensional universe, this means for n elements there is a minimum distance of root3(n*v*3/4/pi) between at least some of the elements and the place that does the lookup, because that's the radius of a sphere of volume n*v. And then, the speed of light gives a physical lower boundary of root3(n*v*3/4/pi)/c for the access time to those elements - and that's O(root3(n)), no matter what fancy algorithm you use.

Apart from a hashtable, you can have a two-level array-of-arrays:
Store the first 10,000 element in the first sub-array
Store the next 10,000 element in the next sub-array
etc.

Thus it could be desirable to have a data structure that has O(1) random access, yet
doesn't rely on continuous storage.
Is there such a thing?
No, there is not. Sketch of proof:
If you have a limit on your continuous block size, then obviously you'll have to use indirection to get to your data items. Fixed depth of indirection with a limited block size gets you only a fixed-sized graph (although its size grows exponentially with the depth), so as your data set grows, the indirection depth will grow (only logarithmically, but not O(1)).

Hashtable?
Edit:
An array is O(1) lookup because a[i] is just syntactic sugar for *(a+i). In other words, to get O(1), you need either a direct pointer or an easily-calculated pointer to every element (along with a good-feeling that the memory you're about to lookup is for your program). In the absence of having a pointer to every element, it's not likely to have an easily-calculated pointer (and know the memory is reserved for you) without contiguous memory.
Of course, it's plausible (if terrible) to have a Hashtable implementation where each lookup's memory address is simply *(a + hash(i)) Not being done in an array, i.e. being dynamically created at the specified memory location if you have that sort of control.. the point is that the most efficient implementation is going to be an underlying array, but it's certainly plausible to take hits elsewhere to do a WTF implementation that still gets you constant-time lookup.
Edit2:
My point is that an array relies on contiguous memory because it's syntactic sugar, but a Hashtable chooses an array because it's the best implementation method, not because it's required. Of course I must be reading the DailyWTF too much, since I'm imagining overloading C++'s array-index operator to also do it without contiguous memory in the same fashion..

Aside from the obvious nested structures to finite depth noted by others, I'm not aware of a data structure with the properties you describe. I share others' opinions that with a well-designed logarithmic data structure, you can have non-contiguous memory with fast access times to any data that will fit in main memory.
I am aware of an interesting and closely related data structure:
Cedar ropes are immutable strings that provide logarithmic rather than constant-time access, but they do provide a constant-time concatenation operation and efficient insertion of characters. The paper is copyrighted but there is a Wikipedia explanation.
This data structure is efficient enough that you can represent the entire contents of a large file using it, and the implementation is clever enough to keep bits on disk unless you need them.

Surely what you're talking about here is not contiguous memory storage as such, but more the ability to index a containing data structure. It is common to internally implement a dynamic array or list as an array of pointers with the actual content of each element elsewhere in memory. There are a number of reasons for doing this - not least that it enables each entry to be a different size. As others have pointed out, most hashtable implementations also rely on indexing too. I can't think of a way to implement an O(1) algorithm that doesn't rely on indexing, but this implies contiguous memory for the index at least.

Distributed hash maps have such a property. Well, actually, not quite, basically a hash function tells you what storage bucket to look in, in there you'll probably need to rely on traditional hash maps. It doesn't completely cover your requirements, as the list containing the storage areas / nodes (in a distributed scenario), again, is usually a hash map (essentially making it a hash table of hash tables), although you could use some other algorithm, e.g. if the number of storage areas is known.
EDIT:
Forgot a little tid-bit, you'd probably want to use different hash functions for the different levels, otherwise you'll end up with a lot of similar hash-values within each storage area.

A bit of a curiosity: the hash trie saves space by interleaving in memory the key-arrays of trie nodes that happen not to collide. That is, if node 1 has keys A,B,D while node 2 has keys C,X,Y,Z, for example, then you can use the same contiguous storage for both nodes at once. It's generalized to different offsets and an arbitrary number of nodes; Knuth used this in his most-common-words program in Literate Programming.
So this gives O(1) access to the keys of any given node, without reserving contiguous storage to it, albeit using contiguous storage for all nodes collectively.

It's possible to allocate a memory block not for the whole data, but only for a reference array to pieces of data. This brings dramatic increase decrease in length of necessary contiguous memory.
Another option, If the elements can be identified with keys and these keys can be uniquely mapped to the available memory locations, it is possible not to place all the objects contiguously, leaving spaces between them. This requires control over the memory allocation so you can still distribute free memory and relocate 2nd-priroty objects to somewhere else when you have to use that memory location for your 1st-priority object. They would still be contiguous in a sub-dimension, though.
Can I name a common data structure which answers your question? No.

Some pseudo O(1) answers-
A VList is O(1) access (on average), and doesn't require that the whole of the data is contiguous, though it does require contiguous storage in small blocks. Other data structures based on numerical representations are also amortized O(1).
A numerical representation applies the same 'cheat' that a radix sort does, yielding an O(k) access structure - if there is another upper bound of the index, such as it being a 64 bit int, then a binary tree where each level correspond to a bit in the index takes a constant time. Of course, that constant k is greater than lnN for any N which can be used with the structure, so it's not likely to be a performance improvement (radix sort can get performance improvements if k is only a little greater than lnN and the implementation of the radix sort runs better exploits the platform).
If you use the same representation of a binary tree that is common in heap implementations, you end up back at an array.

Binary Trees vs. Linked Lists vs. Hash Tables

I'm building a symbol table for a project I'm working on. I was wondering what peoples opinions are on the advantages and disadvantages of the various methods available for storing and creating a symbol table.
I've done a fair bit of searching and the most commonly recommended are binary trees or linked lists or hash tables. What are the advantages and or disadvantages of all of the above? (working in c++)

The standard trade offs between these data structures apply.
Binary Trees
medium complexity to implement (assuming you can't get them from a library)
inserts are O(logN)
lookups are O(logN)
Linked lists (unsorted)
low complexity to implement
inserts are O(1)
lookups are O(N)
Hash tables
high complexity to implement
inserts are O(1) on average
lookups are O(1) on average

Your use case is presumably going to be "insert the data once (e.g., application startup) and then perform lots of reads but few if any extra insertions".
Therefore you need to use an algorithm that is fast for looking up the information that you need.
I'd therefore think the HashTable was the most suitable algorithm to use, as it is simply generating a hash of your key object and using that to access the target data - it is O(1). The others are O(N) (Linked Lists of size N - you have to iterate through the list one at a time, an average of N/2 times) and O(log N) (Binary Tree - you halve the search space with each iteration - only if the tree is balanced, so this depends on your implementation, an unbalanced tree can have significantly worse performance).
Just make sure that there are enough spaces (buckets) in the HashTable for your data (R.e., Soraz's comment on this post). Most framework implementations (Java, .NET, etc) will be of a quality that you won't need to worry about the implementations.
Did you do a course on data structures and algorithms at university?

What everybody seems to forget is that for small Ns, IE few symbols in your table, the linked list can be much faster than the hash-table, although in theory its asymptotic complexity is indeed higher.
There is a famous qoute from Pike's Notes on Programming in C: "Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy." http://www.lysator.liu.se/c/pikestyle.html
I can't tell from your post if you will be dealing with a small N or not, but always remember that the best algorithm for large N's are not necessarily good for small Ns.

It sounds like the following may all be true:
Your keys are strings.
Inserts are done once.
Lookups are done frequently.
The number of key-value pairs is relatively small (say, fewer than a K or so).
If so, you might consider a sorted list over any of these other structures. This would perform worse than the others during inserts, as a sorted list is O(N) on insert, versus O(1) for a linked list or hash table, and O(log2N) for a balanced binary tree. But lookups in a sorted list may be faster than any of these others structures (I'll explain this shortly), so you may come out on top. Also, if you perform all your inserts at once (or otherwise don't require lookups until all insertions are complete), then you can simplify insertions to O(1) and do one much quicker sort at the end. What's more, a sorted list uses less memory than any of these other structures, but the only way this is likely to matter is if you have many small lists. If you have one or a few large lists, then a hash table is likely to out-perform a sorted list.
Why might lookups be faster with a sorted list? Well, it's clear that it's faster than a linked list, with the latter's O(N) lookup time. With a binary tree, lookups only remain O(log2 N) if the tree remains perfectly balanced. Keeping the tree balanced (red-black, for instance) adds to the complexity and insertion time. Additionally, with both linked lists and binary trees, each element is a separately-allocated1 node, which means you'll have to dereference pointers and likely jump to potentially widely varying memory addresses, increasing the chances of a cache miss.
As for hash tables, you should probably read a couple of other questions here on StackOverflow, but the main points of interest here are:
A hash table can degenerate to O(N) in the worst case.
The cost of hashing is non-zero, and in some implementations it can be significant, particularly in the case of strings.
As in linked lists and binary trees, each entry is a node storing more than just key and value, also separately-allocated in some implementations, so you use more memory and increase chances of a cache miss.
Of course, if you really care about how any of these data structures will perform, you should test them. You should have little problem finding good implementations of any of these for most common languages. It shouldn't be too difficult to throw some of your real data at each of these data structures and see which performs best.
It's possible for an implementation to pre-allocate an array of nodes, which would help with the cache-miss problem. I've not seen this in any real implementation of linked lists or binary trees (not that I've seen every one, of course), although you could certainly roll your own. You'd still have a slightly higher possibility of a cache miss, though, since the node objects would be necessarily larger than the key/value pairs.

I like Bill's answer, but it doesn't really synthesize things.
From the three choices:
Linked lists are relatively slow to lookup items from (O(n)). So if you have a lot of items in your table, or you are going to be doing a lot of lookups, then they are not the best choice. However, they are easy to build, and easy to write too. If the table is small, and/or you only ever do one small scan through it after it is built, then this might be the choice for you.
Hash tables can be blazingly fast. However, for it to work you have to pick a good hash for your input, and you have to pick a table big enough to hold everything without a lot of hash collisions. What that means is you have to know something about the size and quantity of your input. If you mess this up, you end up with a really expensive and complex set of linked lists. I'd say that unless you know ahead of time roughly how large the table is going to be, don't use a hash table. This disagrees with your "accepted" answer. Sorry.
That leaves trees. You have an option here though: To balance or not to balance. What I've found by studying this problem on C and Fortran code we have here is that the symbol table input tends to be sufficiently random that you only lose about a tree level or two by not balancing the tree. Given that balanced trees are slower to insert elements into and are harder to implement, I wouldn't bother with them. However, if you already have access to nice debugged component libraries (eg: C++'s STL), then you might as well go ahead and use the balanced tree.

A couple of things to watch out for.
Binary trees only have O(log n) lookup and insert complexity if the tree is balanced. If your symbols are inserted in a pretty random fashion, this shouldn't be a problem. If they're inserted in order, you'll be building a linked list. (For your specific application they shouldn't be in any kind of order, so you should be okay.) If there's a chance that the symbols will be too orderly, a Red-Black Tree is a better option.
Hash tables give O(1) average insert and lookup complexity, but there's a caveat here, too. If your hash function is bad (and I mean really bad) you could end up building a linked list here as well. Any reasonable string hash function should do, though, so this warning is really only to make sure you're aware that it could happen. You should be able to just test that your hash function doesn't have many collisions over your expected range of inputs, and you'll be fine. One other minor drawback is if you're using a fixed-size hash table. Most hash table implementations grow when they reach a certain size (load factor to be more precise, see here for details). This is to avoid the problem you get when you're inserting a million symbols into ten buckets. That just leads to ten linked lists with an average size of 100,000.
I would only use a linked list if I had a really short symbol table. It's easiest to implement, but the best case performance for a linked list is the worst case performance for your other two options.

Other comments have focused on adding/retrieving elements, but this discussion isn't complete without considering what it takes to iterate over the entire collection. The short answer here is that hash tables require less memory to iterate over, but trees require less time.
For a hash table, the memory overhead of iterating over the (key, value) pairs does not depend on the capacity of the table or the number of elements stored in the table; in fact, iterating should require just a single index variable or two.
For trees, the amount of memory required always depends on the size of the tree. You can either maintain a queue of unvisited nodes while iterating or add additional pointers to the tree for easier iteration (making the tree, for purposes of iteration, act like a linked list), but either way, you have to allocate extra memory for iteration.
But the situation is reversed when it comes to timing. For a hash table, the time it takes to iterate depends on the capacity of the table, not the number of stored elements. So a table loaded at 10% of capacity will take about 10 times longer to iterate over than a linked list with the same elements!

This depends on several things, of course. I'd say that a linked list is right out, since it has few suitable properties to work as a symbol table. A binary tree might work, if you already have one and don't have to spend time writing and debugging it. My choice would be a hash table, I think that is more or less the default for this purpose.

This question goes through the different containers in C#, but they are similar in any language you use.

Unless you expect your symbol table to be small, I should steer clear of linked lists. A list of 1000 items will on average take 500 iterations to find any item within it.
A binary tree can be much faster, so long as it's balanced. If you're persisting the contents, the serialised form will likely be sorted, and when it's re-loaded, the resulting tree will be wholly un-balanced as a consequence, and it'll behave the same as the linked list - because that's basically what it has become. Balanced tree algorithms solve this matter, but make the whole shebang more complex.
A hashmap (so long as you pick a suitable hashing algorithm) looks like the best solution. You've not mentioned your environment, but just about all modern languages have a Hashmap built in.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio