Related
Are there are any cases where you would prefer O(log n) time complexity to O(1) time complexity? Or O(n) to O(log n)?
Do you have any examples?
There can be many reasons to prefer an algorithm with higher big O time complexity over the lower one:
most of the time, lower big-O complexity is harder to achieve and requires skilled implementation, a lot of knowledge and a lot of testing.
big-O hides the details about a constant: algorithm that performs in 10^5 is better from big-O point of view than 1/10^5 * log(n) (O(1) vs O(log(n)), but for most reasonable n the first one will perform better. For example the best complexity for matrix multiplication is O(n^2.373) but the constant is so high that no (to my knowledge) computational libraries use it.
big-O makes sense when you calculate over something big. If you need to sort array of three numbers, it matters really little whether you use O(n*log(n)) or O(n^2) algorithm.
sometimes the advantage of the lowercase time complexity can be really negligible. For example there is a data structure tango tree which gives a O(log log N) time complexity to find an item, but there is also a binary tree which finds the same in O(log n). Even for huge numbers of n = 10^20 the difference is negligible.
time complexity is not everything. Imagine an algorithm that runs in O(n^2) and requires O(n^2) memory. It might be preferable over O(n^3) time and O(1) space when the n is not really big. The problem is that you can wait for a long time, but highly doubt you can find a RAM big enough to use it with your algorithm
parallelization is a good feature in our distributed world. There are algorithms that are easily parallelizable, and there are some that do not parallelize at all. Sometimes it makes sense to run an algorithm on 1000 commodity machines with a higher complexity than using one machine with a slightly better complexity.
in some places (security) a complexity can be a requirement. No one wants to have a hash algorithm that can hash blazingly fast (because then other people can bruteforce you way faster)
although this is not related to switch of complexity, but some of the security functions should be written in a manner to prevent timing attack. They mostly stay in the same complexity class, but are modified in a way that it always takes worse case to do something. One example is comparing that strings are equal. In most applications it makes sense to break fast if the first bytes are different, but in security you will still wait for the very end to tell the bad news.
somebody patented the lower-complexity algorithm and it is more economical for a company to use higher complexity than to pay money.
some algorithms adapt well to particular situations. Insertion sort, for example, has an average time-complexity of O(n^2), worse than quicksort or mergesort, but as an online algorithm it can efficiently sort a list of values as they are received (as user input) where most other algorithms can only efficiently operate on a complete list of values.
There is always the hidden constant, which can be lower on the O(log n) algorithm. So it can work faster in practice for real-life data.
There are also space concerns (e.g. running on a toaster).
There's also developer time concern - O(log n) may be 1000× easier to implement and verify.
I'm surprised nobody has mentioned memory-bound applications yet.
There may be an algorithm that has less floating point operations either due to its complexity (i.e. O(1) < O(log n)) or because the constant in front of the complexity is smaller (i.e. 2n2 < 6n2). Regardless, you might still prefer the algorithm with more FLOP if the lower FLOP algorithm is more memory-bound.
What I mean by "memory-bound" is that you are often accessing data that is constantly out-of-cache. In order to fetch this data, you have to pull the memory from your actually memory space into your cache before you can perform your operation on it. This fetching step is often quite slow - much slower than your operation itself.
Therefore, if your algorithm requires more operations (yet these operations are performed on data that is already in cache [and therefore no fetching required]), it will still out-perform your algorithm with fewer operations (which must be performed on out-of-cache data [and therefore require a fetch]) in terms of actual wall-time.
In contexts where data security is a concern, a more complex algorithm may be preferable to a less complex algorithm if the more complex algorithm has better resistance to timing attacks.
Alistra nailed it but failed to provide any examples so I will.
You have a list of 10,000 UPC codes for what your store sells. 10 digit UPC, integer for price (price in pennies) and 30 characters of description for the receipt.
O(log N) approach: You have a sorted list. 44 bytes if ASCII, 84 if Unicode. Alternately, treat the UPC as an int64 and you get 42 & 72 bytes. 10,000 records--in the highest case you're looking at a bit under a megabyte of storage.
O(1) approach: Don't store the UPC, instead you use it as an entry into the array. In the lowest case you're looking at almost a third of a terabyte of storage.
Which approach you use depends on your hardware. On most any reasonable modern configuration you're going to use the log N approach. I can picture the second approach being the right answer if for some reason you're running in an environment where RAM is critically short but you have plenty of mass storage. A third of a terabyte on a disk is no big deal, getting your data in one probe of the disk is worth something. The simple binary approach takes 13 on average. (Note, however, that by clustering your keys you can get this down to a guaranteed 3 reads and in practice you would cache the first one.)
Consider a red-black tree. It has access, search, insert, and delete of O(log n). Compare to an array, which has access of O(1) and the rest of the operations are O(n).
So given an application where we insert, delete, or search more often than we access and a choice between only these two structures, we would prefer the red-black tree. In this case, you might say we prefer the red-black tree's more cumbersome O(log n) access time.
Why? Because the access is not our overriding concern. We are making a trade off: the performance of our application is more heavily influenced by factors other than this one. We allow this particular algorithm to suffer performance because we make large gains by optimizing other algorithms.
So the answer to your question is simply this: when the algorithm's growth rate isn't what we want to optimize, when we want to optimize something else. All of the other answers are special cases of this. Sometimes we optimize the run time of other operations. Sometimes we optimize for memory. Sometimes we optimize for security. Sometimes we optimize maintainability. Sometimes we optimize for development time. Even the overriding constant being low enough to matter is optimizing for run time when you know the growth rate of the algorithm isn't the greatest impact on run time. (If your data set was outside this range, you would optimize for the growth rate of the algorithm because it would eventually dominate the constant.) Everything has a cost, and in many cases, we trade the cost of a higher growth rate for the algorithm to optimize something else.
Yes.
In a real case, we ran some tests on doing table lookups with both short and long string keys.
We used a std::map, a std::unordered_map with a hash that samples at most 10 times over the length of the string (our keys tend to be guid-like, so this is decent), and a hash that samples every character (in theory reduced collisions), an unsorted vector where we do a == compare, and (if I remember correctly) an unsorted vector where we also store a hash, first compare the hash, then compare the characters.
These algorithms range from O(1) (unordered_map) to O(n) (linear search).
For modest sized N, quite often the O(n) beat the O(1). We suspect this is because the node-based containers required our computer to jump around in memory more, while the linear-based containers did not.
O(lg n) exists between the two. I don't remember how it did.
The performance difference wasn't that large, and on larger data sets the hash-based one performed much better. So we stuck with the hash-based unordered map.
In practice, for reasonable sized n, O(lg n) is O(1). If your computer only has room for 4 billion entries in your table, then O(lg n) is bounded above by 32. (lg(2^32)=32) (in computer science, lg is short hand for log based 2).
In practice, lg(n) algorithms are slower than O(1) algorithms not because of the logarithmic growth factor, but because the lg(n) portion usually means there is a certain level of complexity to the algorithm, and that complexity adds a larger constant factor than any of the "growth" from the lg(n) term.
However, complex O(1) algorithms (like hash mapping) can easily have a similar or larger constant factor.
The possibility to execute an algorithm in parallel.
I don't know if there is an example for the classes O(log n) and O(1), but for some problems, you choose an algorithm with a higher complexity class when the algorithm is easier to execute in parallel.
Some algorithms cannot be parallelized but have so low complexity class. Consider another algorithm which achieves the same result and can be parallelized easily, but has a higher complexity class. When executed on one machine, the second algorithm is slower, but when executed on multiple machines, the real execution time gets lower and lower while the first algorithm cannot speed up.
Let's say you're implementing a blacklist on an embedded system, where numbers between 0 and 1,000,000 might be blacklisted. That leaves you two possible options:
Use a bitset of 1,000,000 bits
Use a sorted array of the blacklisted integers and use a binary search to access them
Access to the bitset will have guaranteed constant access. In terms of time complexity, it is optimal. Both from a theoretical and from a practical point view (it is O(1) with an extremely low constant overhead).
Still, you might want to prefer the second solution. Especially if you expect the number of blacklisted integers to be very small, as it will be more memory efficient.
And even if you do not develop for an embedded system where memory is scarce, I just can increase the arbitrary limit of 1,000,000 to 1,000,000,000,000 and make the same argument. Then the bitset would require about 125G of memory. Having a guaranteed worst-case complexitity of O(1) might not convince your boss to provide you such a powerful server.
Here, I would strongly prefer a binary search (O(log n)) or binary tree (O(log n)) over the O(1) bitset. And probably, a hash table with its worst-case complexity of O(n) will beat all of them in practice.
My answer here Fast random weighted selection across all rows of a stochastic matrix is an example where an algorithm with complexity O(m) is faster than one with complexity O(log(m)), when m is not too big.
A more general question is if there are situations where one would prefer an O(f(n)) algorithm to an O(g(n)) algorithm even though g(n) << f(n) as n tends to infinity. As others have already mentioned, the answer is clearly "yes" in the case where f(n) = log(n) and g(n) = 1. It is sometimes yes even in the case that f(n) is polynomial but g(n) is exponential. A famous and important example is that of the Simplex Algorithm for solving linear programming problems. In the 1970s it was shown to be O(2^n). Thus, its worse-case behavior is infeasible. But -- its average case behavior is extremely good, even for practical problems with tens of thousands of variables and constraints. In the 1980s, polynomial time algorithms (such a Karmarkar's interior-point algorithm) for linear programming were discovered, but 30 years later the simplex algorithm still seems to be the algorithm of choice (except for certain very large problems). This is for the obvious reason that average-case behavior is often more important than worse-case behavior, but also for a more subtle reason that the simplex algorithm is in some sense more informative (e.g. sensitivity information is easier to extract).
People have already answered your exact question, so I'll tackle a slightly different question that people may actually be thinking of when coming here.
A lot of the "O(1) time" algorithms and data structures actually only take expected O(1) time, meaning that their average running time is O(1), possibly only under certain assumptions.
Common examples: hashtables, expansion of "array lists" (a.k.a. dynamically sized arrays/vectors).
In such scenarios, you may prefer to use data structures or algorithms whose time is guaranteed to be absolutely bounded logarithmically, even though they may perform worse on average.
An example might therefore be a balanced binary search tree, whose running time is worse on average but better in the worst case.
To put my 2 cents in:
Sometimes a worse complexity algorithm is selected in place of a better one, when the algorithm runs on a certain hardware environment. Suppose our O(1) algorithm non-sequentially accesses every element of a very big, fixed-size array to solve our problem. Then put that array on a mechanical hard drive, or a magnetic tape.
In that case, the O(logn) algorithm (suppose it accesses disk sequentially), becomes more favourable.
There is a good use case for using a O(log(n)) algorithm instead of an O(1) algorithm that the numerous other answers have ignored: immutability. Hash maps have O(1) puts and gets, assuming good distribution of hash values, but they require mutable state. Immutable tree maps have O(log(n)) puts and gets, which is asymptotically slower. However, immutability can be valuable enough to make up for worse performance and in the case where multiple versions of the map need to be retained, immutability allows you to avoid having to copy the map, which is O(n), and therefore can improve performance.
Simply: Because the coefficient - the costs associated with setup, storage, and the execution time of that step - can be much, much larger with a smaller big-O problem than with a larger one. Big-O is only a measure of the algorithms scalability.
Consider the following example from the Hacker's Dictionary, proposing a sorting algorithm relying on the Multiple Worlds Interpretation of Quantum Mechanics:
Permute the array randomly using a quantum process,
If the array is not sorted, destroy the universe.
All remaining universes are now sorted [including the one you are in].
(Source: http://catb.org/~esr/jargon/html/B/bogo-sort.html)
Notice that the big-O of this algorithm is O(n), which beats any known sorting algorithm to date on generic items. The coefficient of the linear step is also very low (since it's only a comparison, not a swap, that is done linearly). A similar algorithm could, in fact, be used to solve any problem in both NP and co-NP in polynomial time, since each possible solution (or possible proof that there is no solution) can be generated using the quantum process, then verified in polynomial time.
However, in most cases, we probably don't want to take the risk that Multiple Worlds might not be correct, not to mention that the act of implementing step 2 is still "left as an exercise for the reader".
At any point when n is bounded and the constant multiplier of O(1) algorithm is higher than the bound on log(n). For example, storing values in a hashset is O(1), but may require an expensive computation of a hash function. If the data items can be trivially compared (with respect to some order) and the bound on n is such that log n is significantly less than the hash computation on any one item, then storing in a balanced binary tree may be faster than storing in a hashset.
In a realtime situation where you need a firm upper bound you would select e.g. a heapsort as opposed to a Quicksort, because heapsort's average behaviour is also its worst-case behaviour.
Adding to the already good answers.A practical example would be Hash indexes vs B-tree indexes in postgres database.
Hash indexes form a hash table index to access the data on the disk while btree as the name suggests uses a Btree data structure.
In Big-O time these are O(1) vs O(logN).
Hash indexes are presently discouraged in postgres since in a real life situation particularly in database systems, achieving hashing without collision is very hard(can lead to a O(N) worst case complexity) and because of this, it is even more harder to make them crash safe (called write ahead logging - WAL in postgres).
This tradeoff is made in this situation since O(logN) is good enough for indexes and implementing O(1) is pretty hard and the time difference would not really matter.
When n is small, and O(1) is constantly slow.
When the "1" work unit in O(1) is very high relative to the work unit in O(log n) and the expected set size is small-ish. For example, it's probably slower to compute Dictionary hash codes than iterate an array if there are only two or three items.
or
When the memory or other non-time resource requirements in the O(1) algorithm are exceptionally large relative to the O(log n) algorithm.
when redesigning a program, a procedure is found to be optimized with O(1) instead of O(lgN), but if it's not the bottleneck of this program, and it's hard to understand the O(1) alg. Then you would not have to use O(1) algorithm
when O(1) needs much memory that you cannot supply, while the time of O(lgN) can be accepted.
This is often the case for security applications that we want to design problems whose algorithms are slow on purpose in order to stop someone from obtaining an answer to a problem too quickly.
Here are a couple of examples off the top of my head.
Password hashing is sometimes made arbitrarily slow in order to make it harder to guess passwords by brute-force. This Information Security post has a bullet point about it (and much more).
Bit Coin uses a controllably slow problem for a network of computers to solve in order to "mine" coins. This allows the currency to be mined at a controlled rate by the collective system.
Asymmetric ciphers (like RSA) are designed to make decryption without the keys intentionally slow in order to prevent someone else without the private key to crack the encryption. The algorithms are designed to be cracked in hopefully O(2^n) time where n is the bit-length of the key (this is brute force).
Elsewhere in CS, Quick Sort is O(n^2) in the worst case but in the general case is O(n*log(n)). For this reason, "Big O" analysis sometimes isn't the only thing you care about when analyzing algorithm efficiency.
There are plenty of good answers, a few of which mention the constant factor, the input size and memory constraints, among many other reasons complexity is only a theoretical guideline rather than the end-all determination of real-world fitness for a given purpose or speed.
Here's a simple, concrete example to illustrate these ideas. Let's say we want to figure out whether an array has a duplicate element. The naive quadratic approach is to write a nested loop:
const hasDuplicate = arr => {
for (let i = 0; i < arr.length; i++) {
for (let j = i + 1; j < arr.length; j++) {
if (arr[i] === arr[j]) {
return true;
}
}
}
return false;
};
console.log(hasDuplicate([1, 2, 3, 4]));
console.log(hasDuplicate([1, 2, 4, 4]));
But this can be done in linear time by creating a set data structure (i.e. removing duplicates), then comparing its size to the length of the array:
const hasDuplicate = arr => new Set(arr).size !== arr.length;
console.log(hasDuplicate([1, 2, 3, 4]));
console.log(hasDuplicate([1, 2, 4, 4]));
Big O tells us is that the new Set approach will scale a great deal better from a time complexity standpoint.
However, it turns out that the "naive" quadratic approach has a lot going for it that Big O can't account for:
No additional memory usage
No heap memory allocation (no new)
No garbage collection for the temporary Set
Early bailout; in a case when the duplicate is known to be likely in the front of the array, there's no need to check more than a few elements.
If our use case is on bounded small arrays, we have a resource-constrained environment and/or other known common-case properties allow us to establish through benchmarks that the nested loop is faster on our particular workload, it might be a good idea.
On the other hand, maybe the set can be created one time up-front and used repeatedly, amortizing its overhead cost across all of the lookups.
This leads inevitably to maintainability/readability/elegance and other "soft" costs. In this case, the new Set() approach is probably more readable, but it's just as often (if not more often) that achieving the better complexity comes at great engineering cost.
Creating and maintaining a persistent, stateful Set structure can introduce bugs, memory/cache pressure, code complexity, and all other manner of design tradeoffs. Negotiating these tradeoffs optimally is a big part of software engineering, and time complexity is just one factor to help guide that process.
A few other examples that I don't see mentioned yet:
In real-time environments, for example resource-constrained embedded systems, sometimes complexity sacrifices are made (typically related to caches and memory or scheduling) to avoid incurring occasional worst-case penalties that can't be tolerated because they might cause jitter.
Also in embedded programming, the size of the code itself can cause cache pressure, impacting memory performance. If an algorithm has worse complexity but will result in massive code size savings, that might be a reason to choose it over an algorithm that's theoretically better.
In most implementations of recursive linearithmic algorithms like quicksort, when the array is small enough, a quadratic sorting algorithm like insertion sort is often called because the overhead of recursive function calls on increasingly tiny arrays tends to outweigh the cost of nested loops. Insertion sort is also fast on mostly-sorted arrays as the inner loop won't run much. This answer discusses this in an older version of Chrome's V8 engine before they moved to Timsort.
I read some blog and tutorial on Tries , hashing, Map(stl) and BST. I am very confused in which one is better to use and where. I know that to make such difference between them are nonsense because they are all implementation dependent. Would you please tell me more specific and please do not forgot to mention the complexity (worst , avg, and best case).
Thanks in advance...
BST is Binary Search Tree. It is used for a dictionary. BST has no limitations on structure, and thus a search/insertion/deletion is O(n) worst case.
Map [on stl] is also a dictionary, and is actually a red-black tree [on stl]. it is a special kind of BST, which has limitations on structures, because of it, worst case search/insert/delete is O(logn).
hashing hash table is a different type of dictionary, the advantage of a hash table [with good hash functions] is O(1) average time for search/delete/insert. however, the worst case is O(n), which happens if too much elements collide and/or when a rehash is needed [when Load Balance is too high, we allocate a bigger array, and rehash all the elements to is].
Tries are special for strings. all ops are O(S) where S is the string's length. it's advantage on the others [when dealing with strings] is you need to read the string anyway, so the complexity if a Map for instance, when dealing with strings, is actually O(S*n*logn).
when to use?
a Map [or any other balanced tree] should almost always be a better choice then a regular BST.
hash table is a good choice when you want average short time, but do not care that some times you will have performance loss due to rehash, and in some cases collisions may occur.
Trie is usually a good dictionary for strings.
The worst-case running time of insertion on a red-black tree is O(lg n) and if I perform a in-order walk on the tree, I essentially visit each node, so the total worst-case runtime to print the sorted collection would be O(n lg n)
I am curious, why are red-black trees not preferred for sorting over quick sort (whose average-case running time is O(n lg n).
I see that maybe because red-black trees do not sort in-place, but I am not sure, so maybe someone could help.
Knowing which sort algorithm performs better really depend on your data and situation.
If you are talking in general/practical terms,
Quicksort (the one where you select the pivot randomly/just pick one fixed, making worst case Omega(n^2)) might be better than Red-Black Trees because (not necessarily in order of importance)
Quicksort is in-place. The keeps your memory footprint low. Say this quicksort routine was part of a program which deals with a lot of data. If you kept using large amounts of memory, your OS could start swapping your process memory and trash your perf.
Quicksort memory accesses are localized. This plays well with the caching/swapping.
Quicksort can be easily parallelized (probably more relevant these days).
If you were to try and optimize binary tree sorting (using binary tree without balancing) by using an array instead, you will end up doing something like Quicksort!
Red-Black trees have memory overheads. You have to allocate nodes possibly multiple times, your memory requirements with trees is doubles/triple that using arrays.
After sorting, say you wanted the 1045th (say) element, you will need to maintain order statistics in your tree (extra memory cost because of this) and you will have O(logn) access time!
Red-black trees have overheads just to access the next element (pointer lookups)
Red-black trees do not play well with the cache and the pointer accesses could induce more swapping.
Rotation in red-black trees will increase the constant factor in the O(nlogn).
Perhaps the most important reason (but not valid if you have lib etc available), Quicksort is very simple to understand and implement. Even a school kid can understand it!
I would say you try to measure both implementations and see what happens!
Also, Bob Sedgewick did a thesis on quicksort! Might be worth reading.
There are plenty of sorting algorithms which are worst case O(n log n) - for example, merge sort. The reason quicksort is preferred is because it is faster in practice, even though algorithmically it may not be as good as some other algorithms.
Often in-built sorts use a combination of various methods depending on the values of n.
There are many cases where red-back trees are not bad for sorting. My testing showed, compared to natural merge sort, that red-black trees excel where:
Trees are better for Dups:
All the tests where dups need to be eleminated, tree algorithm is better. This is not astonishing, since the tree can be kept very small from the beginning, whereby algorithms that are designed for inline array sort might pass around larger segments for a longer time.
Trees are better for Random:
All the tests with random, tree algorithm is better. This is also not astonishing, since in a tree distance between elements is shorter and shifting is not necessary. So repeatedly inserting into a tree could need less effort than sorting an array.
So we get the impression that the natural merge sort only excels in ascending and descending special cases. Which cant be even said for quick sort.
Gist with the test cases here.
P.S.: it should be noted that using trees for sorting is non-trivial. One has not only to provide an insert routine but also a routine that can linearize the tree back to an array. We are currently using a get_last and a predecessor routine, which doesn't need a stack. But these routines are not O(1) since they contain loops.
Big-O time complexity measures do not usually take into account scalar factors, e.g., O(2n) and O(4n) are usually just reduced to O(n). Time complexity analysis is based on operational steps at an algorithmic level, not at a strict programming level, i.e., no source code or native machine instruction considerations.
Quicksort is generally faster than tree-based sorting since (1) the methods have the same algorithmic average time complexity, and (2) lookup and swapping operations require fewer program commands and data accesses when working with simple arrays than with red-black trees, even if the tree uses an underlying array-based implementation. Maintenance of the red-black tree constraints requires additional operational steps, data field value storage/access (node colors), etc than the simple array partition-exchange steps of a quicksort.
The net result is that red-black trees have higher scalar coefficients than quicksort does that are being obscured by the standard O(n log n) average time complexity analysis result.
Some other practical considerations related to machine architectures are briefly discussed in the Quicksort article on Wikipedia
Generally, representations of O(nlgn) algorithms can be expanded to A*nlgn + B where A and B are constants. There are many algorithmic proofs that show the coefficients for quicksort are smaller than those of other algorithms. That is in best-case (quick sort performs horribly on sorted data).
Hi the best way to explain the difference between all sorting routine in my opinion is.
(My answer is for people who are confused how quick sort is faster in practice than another sorting algo).
"Think u are running on a very slow computer".
First thing one comparing operation takes 1 hour.
One shifting operation takes 2 hours.
"I am using hour just to make people understand how important time is".
Now from all the sorting operations quick-sort have very very less comparisons and very less swapping for elements.
Quick-sort is faster for this main reason.
People say it takes amortized O(1) to put into a hash table. Therefore, putting n elements must be O(n). That's not true for large n, however, since as an answerer said, "All you need to satisfy expected amortized O(1) is to expand the table and rehash everything with a new random hash function any time there is a collision."
So: what is the average running-time of inserting n elements into a hash table? I realize this is probably implementation-dependent, so mention what type of implementation you're talking about.
For example, if there are (log n) equally spaced collisions, and each collision takes O(k) to resolve, where k is the current size of the hashtable, then you'd have this recurrence relation:
T(n) = T(n/2) + n/2 + n/2
(that is, you take the time to insert n/2 elements, then you have a collision, taking n/2 to resolve, then you do the remaining n/2 inserts without a collision). This still ends up being O(n), so yay. But is this reasonable?
It completely depends on how inefficient your rehashing is. Specifically, if you can properly estimate the expected size of your hashtable the second time, your runtime still approaches O(n). Effectively, you have to specify how inefficient your rehash size calculation is before you can determine the expected order.
People say it takes amortized O(1) to put into a hash table.
From a theoretical standpoint, it is expected amortized O(1).
Hash tables are fundamentally a randomized data structure, in the same sense that quicksort is a randomized algorithm. You need to generate your hash functions with some randomness, or else there exist pathological inputs which are not O(1).
You can achieve expected amortized O(1) using dynamic perfect hashing:
The naive idea I originally posted was to rehash with a new random hash function on every collision. (See also perfect hash functions) The problem with this is that this requires O(n^2) space, from birthday paradox.
The solution is to have two hash tables, with the second table for collisions; resolve collisions on that second table by rebuilding it. That table will have O(\sqrt{n}) elements, so would grow to O(n) size.
In practice you often just use a fixed hash function because you can assume (or don't care if) your input is pathological, much like you often quicksort without prerandomizing the input.
All O(1) is saying is that the operation is performed in constant time, and it's not dependent on the number of elements in your data structure.
In simple words, this means that you'll have to pay the same cost no matter how big your data structure is.
In practical terms this means that simple data structures such as trees are generally more effective when you don't have to store a lot of data. In my experience I find trees faster up to ~1k elements (32bit integers), then hash tables take over. But as usual YMMW.
Why not just run a few tests on your system? Maybe if you'll post the source, we can go back and test them on our systems and we could really shape this into a very useful discussion.
It is just not the implementation, but the environment as well that decides how much time the algorithm actually takes. You can however, look if any benchmarking samples are available or not. The problem with me posting my results will be of no use since people have no idea what else is running on my system, how much RAM is free right now and so on. You can only ever have a broad idea. And that is about as good as what the big-O gives you.
I'm building a symbol table for a project I'm working on. I was wondering what peoples opinions are on the advantages and disadvantages of the various methods available for storing and creating a symbol table.
I've done a fair bit of searching and the most commonly recommended are binary trees or linked lists or hash tables. What are the advantages and or disadvantages of all of the above? (working in c++)
The standard trade offs between these data structures apply.
Binary Trees
medium complexity to implement (assuming you can't get them from a library)
inserts are O(logN)
lookups are O(logN)
Linked lists (unsorted)
low complexity to implement
inserts are O(1)
lookups are O(N)
Hash tables
high complexity to implement
inserts are O(1) on average
lookups are O(1) on average
Your use case is presumably going to be "insert the data once (e.g., application startup) and then perform lots of reads but few if any extra insertions".
Therefore you need to use an algorithm that is fast for looking up the information that you need.
I'd therefore think the HashTable was the most suitable algorithm to use, as it is simply generating a hash of your key object and using that to access the target data - it is O(1). The others are O(N) (Linked Lists of size N - you have to iterate through the list one at a time, an average of N/2 times) and O(log N) (Binary Tree - you halve the search space with each iteration - only if the tree is balanced, so this depends on your implementation, an unbalanced tree can have significantly worse performance).
Just make sure that there are enough spaces (buckets) in the HashTable for your data (R.e., Soraz's comment on this post). Most framework implementations (Java, .NET, etc) will be of a quality that you won't need to worry about the implementations.
Did you do a course on data structures and algorithms at university?
What everybody seems to forget is that for small Ns, IE few symbols in your table, the linked list can be much faster than the hash-table, although in theory its asymptotic complexity is indeed higher.
There is a famous qoute from Pike's Notes on Programming in C: "Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy." http://www.lysator.liu.se/c/pikestyle.html
I can't tell from your post if you will be dealing with a small N or not, but always remember that the best algorithm for large N's are not necessarily good for small Ns.
It sounds like the following may all be true:
Your keys are strings.
Inserts are done once.
Lookups are done frequently.
The number of key-value pairs is relatively small (say, fewer than a K or so).
If so, you might consider a sorted list over any of these other structures. This would perform worse than the others during inserts, as a sorted list is O(N) on insert, versus O(1) for a linked list or hash table, and O(log2N) for a balanced binary tree. But lookups in a sorted list may be faster than any of these others structures (I'll explain this shortly), so you may come out on top. Also, if you perform all your inserts at once (or otherwise don't require lookups until all insertions are complete), then you can simplify insertions to O(1) and do one much quicker sort at the end. What's more, a sorted list uses less memory than any of these other structures, but the only way this is likely to matter is if you have many small lists. If you have one or a few large lists, then a hash table is likely to out-perform a sorted list.
Why might lookups be faster with a sorted list? Well, it's clear that it's faster than a linked list, with the latter's O(N) lookup time. With a binary tree, lookups only remain O(log2 N) if the tree remains perfectly balanced. Keeping the tree balanced (red-black, for instance) adds to the complexity and insertion time. Additionally, with both linked lists and binary trees, each element is a separately-allocated1 node, which means you'll have to dereference pointers and likely jump to potentially widely varying memory addresses, increasing the chances of a cache miss.
As for hash tables, you should probably read a couple of other questions here on StackOverflow, but the main points of interest here are:
A hash table can degenerate to O(N) in the worst case.
The cost of hashing is non-zero, and in some implementations it can be significant, particularly in the case of strings.
As in linked lists and binary trees, each entry is a node storing more than just key and value, also separately-allocated in some implementations, so you use more memory and increase chances of a cache miss.
Of course, if you really care about how any of these data structures will perform, you should test them. You should have little problem finding good implementations of any of these for most common languages. It shouldn't be too difficult to throw some of your real data at each of these data structures and see which performs best.
It's possible for an implementation to pre-allocate an array of nodes, which would help with the cache-miss problem. I've not seen this in any real implementation of linked lists or binary trees (not that I've seen every one, of course), although you could certainly roll your own. You'd still have a slightly higher possibility of a cache miss, though, since the node objects would be necessarily larger than the key/value pairs.
I like Bill's answer, but it doesn't really synthesize things.
From the three choices:
Linked lists are relatively slow to lookup items from (O(n)). So if you have a lot of items in your table, or you are going to be doing a lot of lookups, then they are not the best choice. However, they are easy to build, and easy to write too. If the table is small, and/or you only ever do one small scan through it after it is built, then this might be the choice for you.
Hash tables can be blazingly fast. However, for it to work you have to pick a good hash for your input, and you have to pick a table big enough to hold everything without a lot of hash collisions. What that means is you have to know something about the size and quantity of your input. If you mess this up, you end up with a really expensive and complex set of linked lists. I'd say that unless you know ahead of time roughly how large the table is going to be, don't use a hash table. This disagrees with your "accepted" answer. Sorry.
That leaves trees. You have an option here though: To balance or not to balance. What I've found by studying this problem on C and Fortran code we have here is that the symbol table input tends to be sufficiently random that you only lose about a tree level or two by not balancing the tree. Given that balanced trees are slower to insert elements into and are harder to implement, I wouldn't bother with them. However, if you already have access to nice debugged component libraries (eg: C++'s STL), then you might as well go ahead and use the balanced tree.
A couple of things to watch out for.
Binary trees only have O(log n) lookup and insert complexity if the tree is balanced. If your symbols are inserted in a pretty random fashion, this shouldn't be a problem. If they're inserted in order, you'll be building a linked list. (For your specific application they shouldn't be in any kind of order, so you should be okay.) If there's a chance that the symbols will be too orderly, a Red-Black Tree is a better option.
Hash tables give O(1) average insert and lookup complexity, but there's a caveat here, too. If your hash function is bad (and I mean really bad) you could end up building a linked list here as well. Any reasonable string hash function should do, though, so this warning is really only to make sure you're aware that it could happen. You should be able to just test that your hash function doesn't have many collisions over your expected range of inputs, and you'll be fine. One other minor drawback is if you're using a fixed-size hash table. Most hash table implementations grow when they reach a certain size (load factor to be more precise, see here for details). This is to avoid the problem you get when you're inserting a million symbols into ten buckets. That just leads to ten linked lists with an average size of 100,000.
I would only use a linked list if I had a really short symbol table. It's easiest to implement, but the best case performance for a linked list is the worst case performance for your other two options.
Other comments have focused on adding/retrieving elements, but this discussion isn't complete without considering what it takes to iterate over the entire collection. The short answer here is that hash tables require less memory to iterate over, but trees require less time.
For a hash table, the memory overhead of iterating over the (key, value) pairs does not depend on the capacity of the table or the number of elements stored in the table; in fact, iterating should require just a single index variable or two.
For trees, the amount of memory required always depends on the size of the tree. You can either maintain a queue of unvisited nodes while iterating or add additional pointers to the tree for easier iteration (making the tree, for purposes of iteration, act like a linked list), but either way, you have to allocate extra memory for iteration.
But the situation is reversed when it comes to timing. For a hash table, the time it takes to iterate depends on the capacity of the table, not the number of stored elements. So a table loaded at 10% of capacity will take about 10 times longer to iterate over than a linked list with the same elements!
This depends on several things, of course. I'd say that a linked list is right out, since it has few suitable properties to work as a symbol table. A binary tree might work, if you already have one and don't have to spend time writing and debugging it. My choice would be a hash table, I think that is more or less the default for this purpose.
This question goes through the different containers in C#, but they are similar in any language you use.
Unless you expect your symbol table to be small, I should steer clear of linked lists. A list of 1000 items will on average take 500 iterations to find any item within it.
A binary tree can be much faster, so long as it's balanced. If you're persisting the contents, the serialised form will likely be sorted, and when it's re-loaded, the resulting tree will be wholly un-balanced as a consequence, and it'll behave the same as the linked list - because that's basically what it has become. Balanced tree algorithms solve this matter, but make the whole shebang more complex.
A hashmap (so long as you pick a suitable hashing algorithm) looks like the best solution. You've not mentioned your environment, but just about all modern languages have a Hashmap built in.