When does Big-O notation fail?

When does Big-O notation fail? - algorithm

What are some examples where Big-O notation[1] fails in practice?
That is to say: when will the Big-O running time of algorithms predict algorithm A to be faster than algorithm B, yet in practice algorithm B is faster when you run it?
Slightly broader: when do theoretical predictions about algorithm performance mismatch observed running times? A non-Big-O prediction might be based on the average/expected number of rotations in a search tree, or the number of comparisons in a sorting algorithm, expressed as a factor times the number of elements.
Clarification:
Despite what some of the answers say, the Big-O notation is meant to predict algorithm performance. That said, it's a flawed tool: it only speaks about asymptotic performance, and it blurs out the constant factors. It does this for a reason: it's meant to predict algorithmic performance independent of which computer you execute the algorithm on.
What I want to know is this: when do the flaws of this tool show themselves? I've found Big-O notation to be reasonably useful, but far from perfect. What are the pitfalls, the edge cases, the gotchas?
An example of what I'm looking for: running Dijkstra's shortest path algorithm with a Fibonacci heap instead of a binary heap, you get O(m + n log n) time versus O((m+n) log n), for n vertices and m edges. You'd expect a speed increase from the Fibonacci heap sooner or later, yet said speed increase never materialized in my experiments.
(Experimental evidence, without proof, suggests that binary heaps operating on uniformly random edge weights spend O(1) time rather than O(log n) time; that's one big gotcha for the experiments. Another one that's a bitch to count is the expected number of calls to DecreaseKey).
[1] Really it isn't the notation that fails, but the concepts the notation stands for, and the theoretical approach to predicting algorithm performance. </anti-pedantry>
On the accepted answer:
I've accepted an answer to highlight the kind of answers I was hoping for. Many different answers which are just as good exist :) What I like about the answer is that it suggests a general rule for when Big-O notation "fails" (when cache misses dominate execution time) which might also increase understanding (in some sense I'm not sure how to best express ATM).

It fails in exactly one case: When people try to use it for something it's not meant for.
It tells you how an algorithm scales. It does not tell you how fast it is.
Big-O notation doesn't tell you which algorithm will be faster in any specific case. It only tells you that for sufficiently large input, one will be faster than the other.

When N is small, the constant factor dominates. Looking up an item in an array of five items is probably faster than looking it up in a hash table.

Short answer: When n is small. The Traveling Salesman Problem is quickly solved when you only have three destinations (however, finding the smallest number in a list of a trillion elements can last a while, although this is O(n). )

the canonical example is Quicksort, which has a worst time of O(n^2), while Heapsort's is O(n logn). in practice however, Quicksort is usually faster then Heapsort. why? two reasons:
each iteration in Quicksort is a lot simpler than Heapsort. Even more, it's easily optimized by simple cache strategies.
the worst case is very hard to hit.
But IMHO, this doesn't mean 'big O fails' in any way. the first factor (iteration time) is easy to incorporate into your estimates. after all, big O numbers should be multiplied by this almost-constant facto.
the second factor melts away if you get the amortized figures instead of average. They can be harder to estimate, but tell a more complete story

One area where Big O fails is memory access patterns. Big O only counts operations that need to be performed - it can't keep track if an algorithm results in more cache misses or data that needs to be paged in from disk. For small N, these effects will typically dominate. For instance, a linear search through an array of 100 integers will probably beat out a search through a binary tree of 100 integers due to memory accesses, even though the binary tree will most likely require fewer operations. Each tree node would result in a cache miss, whereas the linear search would mostly hit the cache for each lookup.

Big-O describes the efficiency/complexity of the algorithm and not necessarily the running time of the implementation of a given block of code. This doesn't mean Big-O fails. It just means that it's not meant to predict running time.
Check out the answer to this question for a great definition of Big-O.

For most algorithms there is an "average case" and a "worst case". If your data routinely falls into the "worst case" scenario, it is possible that another algorithm, while theoretically less efficient in the average case, might prove more efficient for your data.
Some algorithms also have best cases that your data can take advantage of. For example, some sorting algorithms have a terrible theoretical efficiency, but are actually very fast if the data is already sorted (or nearly so). Another algorithm, while theoretically faster in the general case, may not take advantage of the fact that the data is already sorted and in practice perform worse.
For very small data sets sometimes an algorithm that has a better theoretical efficiency may actually be less efficient because of a large "k" value.

One example (that I'm not an expert on) is that simplex algorithms for linear programming have exponential worst-case complexity on arbitrary inputs, even though they perform well in practice. An interesting solution to this is considering "smoothed complexity", which blends worst-case and average-case performance by looking at small random perturbations of arbitrary inputs.
Spielman and Teng (2004) were able to show that the shadow-vertex simplex algorithm has polynomial smoothed complexity.

Big O does not say e.g. that algorithm A runs faster than algorithm B. It can say that the time or space used by algorithm A grows at a different rate than algorithm B, when the input grows. However, for any specific input size, big O notation does not say anything about the performance of one algorithm relative to another.
For example, A may be slower per operation, but have a better big-O than B. B is more performant for smaller input, but if the data size increases, there will be some cut-off point where A becomes faster. Big-O in itself does not say anything about where that cut-off point is.

The general answer is that Big-O allows you to be really sloppy by hiding the constant factors. As mentioned in the question, the use of Fibonacci Heaps is one example. Fibonacci Heaps do have great asymptotic runtimes, but in practice the constants factors are way too large to be useful for the sizes of data sets encountered in real life.
Fibonacci Heaps are often used in proving a good lower bound for asymptotic complexity of graph-related algorithms.
Another similar example is the Coppersmith-Winograd algorithm for matrix multiplication. It is currently the algorithm with the fastest known asymptotic running time for matrix multiplication, O(n2.376). However, its constant factor is far too large to be useful in practice. Like Fibonacci Heaps, it's frequently used as a building block in other algorithms to prove theoretical time bounds.

This somewhat depends on what the Big-O is measuring - when it's worst case scenarios, it will usually "fail" in that the runtime performance will be much better than the Big-O suggests. If it's average case, then it may be much worse.
Big-O notation typically "fails" if the input data to the algorithm has some prior information. Often, the Big-O notation refers to the worst case complexity - which will often happen if the data is either completely random or completely non-random.
As an example, if you feed data to an algorithm that's profiled and the big-o is based on randomized data, but your data has a very well-defined structure, your result times may be much faster than expected. On the same token, if you're measuring average complexity, and you feed data that is horribly randomized, the algorithm may perform much worse than expected.

Small N - And for todays computers, 100 is likely too small to worry.
Hidden Multipliers - IE merge vs quick sort.
Pathological Cases - Again, merge vs quick

One broad area where Big-Oh notation fails is when the amount of data exceeds the available amount of RAM.
Using sorting as an example, the amount of time it takes to sort is not dominated by the number of comparisons or swaps (of which there are O(n log n) and O(n), respectively, in the optimal case). The amount of time is dominated by the number of disk operations: block writes and block reads.
To better analyze algorithms which handle data in excess of available RAM, the I/O-model was born, where you count the number of disk reads. In that, you consider three parameters:
The number of elements, N;
The amount of memory (RAM), M (the number of elements that can be in memory); and
The size of a disk block, B (the number of elements per block).
Notably absent is the amount of disk space; this is treated as if it were infinite. A typical extra assumption is that M > B2.
Continuing the sorting example, you typically favor merge sort in the I/O case: divide the elements into chunks of size θ(M) and sort them in memory (with, say, quicksort). Then, merge θ(M/B) of them by reading the first block from each chunk into memory, stuff all the elements into a heap, and repeatedly pick the smallest element until you have picked B of them. Write this new merge block out and continue. If you ever deplete one of the blocks you read into memory, read a new block from the same chunk and put it into the heap.
(All expressions should be read as being big θ). You form N/M sorted chunks which you then merge. You merge log (base M/B) of N/M times; each time you read and write all the N/B blocks, so it takes you N/B * (log base M/B of N/M) time.
You can analyze in-memory sorting algorithms (suitably modified to include block reads and block writes) and see that they're much less efficient than the merge sort I've presented.
This knowledge is courtesy of my I/O-algorithms course, by Arge and Brodal (http://daimi.au.dk/~large/ioS08/); I also conducted experiments which validate the theory: heap sort takes "almost infinite" time once you exceed memory. Quick sort becomes unbearably slow, merge sort barely bearably slow, I/O-efficient merge sort performs well (the best of the bunch).

I've seen a few cases where, as the data set grew, the algorithmic complexity became less important than the memory access pattern. Navigating a large data structure with a smart algorithm can, in some cases, cause far more page faults or cache misses, than an algorithm with a worse big-O.
For small n, two algorithms may be comparable. As n increases, the smarter algorithm outperforms. But, at some point, n grows big enough that the system succumbs to memory pressure, in which case the "worse" algorithm may actually perform better because the constants are essentially reset.
This isn't particularly interesting, though. By the time you reach this inversion point, the performance of both algorithms is usually unacceptable, and you have to find a new algorithm that has a friendlier memory access pattern AND a better big-O complexity.

This question is like asking, "When does a person's IQ fail in practice?" It's clear that having a high IQ does not mean you'll be successful in life and having a low IQ does not mean you'll perish. Yet, we measure IQ as a means of assessing potential, even if its not an absolute.
In algorithms, the Big-Oh notation gives you the algorithm's IQ. It doesn't necessarily mean that the algorithm will perform best for your particular situation, but there's some mathematical basis that says this algorithm has some good potential. If Big-Oh notation were enough to measure performance you would see a lot more of it and less runtime testing.
Think of Big-Oh as a range instead of a specific measure of better-or-worse. There's best case scenarios and worst case scenarios and a huge set of scenarios in between. Choose your algorithms by how well they fit within the Big-Oh range, but don't rely on the notation as an absolute for measuring performance.

When your data doesn't fit the model, big-o notation will still work, but you're going to see an overlap from best and worst case scenarios.
Also, some operations are tuned for linear data access vs. random data access, so one algorithm while superior in terms of cycles, might be doggedly slow if the method of calling it changes from design. Similarly, if an algorithm causes page/cache misses due to the way it access memory, Big-O isn't going to going to give an accurate estimate of the cost of running a process.
Apparently, as I've forgotten, also when N is small :)

The short answer: always on modern hardware when you start using a lot of memory. The textbooks assume memory access is uniform, and it is no longer. You can of course do Big O analysis for a non-uniform access model, but that is somewhat more complex.
The small n cases are obvious but not interesting: fast enough is fast enough.
In practice I've had problems using the standard collections in Delphi, Java, C# and Smalltalk with a few million objects. And with smaller ones where the dominant factor proved to be the hash function or the compare

Robert Sedgewick talks about shortcomings of the big-O notation in his Coursera course on Analysis of Algorithms. He calls particularly egregious examples galactic algorithms because while they may have a better complexity class than their predecessors, it would take inputs of astronomical sizes for it to show in practice.
https://www.cs.princeton.edu/~rs/talks/AlgsMasses.pdf

Big O and its brothers are used to compare asymptotic mathematical function growth. I would like to emphasize on the mathematical part. Its entirely about being able reduce your problem to a function where the input grows a.k.a scales. It gives you a nice plot where your input (x axis) related to the number of operations performed(y-axis). This is purely based on the mathematical function and as such requires us to accurately model the algorithm used into a polynomial of sorts. Then the assumption of scaling.
Big O immediately loses its relevance when the data is finite, fixed and constant size. Which is why nearly all embedded programmers don't even bother with big O. Mathematically this will always come out to O(1) but we know that we need to optimize our code for space and Mhz timing budget at a level that big O simply doesn't work. This is optimization is on the same order where the individual components matter due to their direct performance dependence on the system.
Big O's other failure is in its assumption that hardware differences do not matter. A CPU that has a MAC, MMU and/or a bit shift low latency math operations will outperform some tasks which may be falsely identified as higher order in the asymptotic notation. This is simply because of the limitation of the model itself.
Another common case where big O becomes absolutely irrelevant is where we falsely identify the nature of the problem to be solved and end up with a binary tree when in reality the solution is actually a state machine. The entire algorithm regimen often overlooks finite state machine problems. This is because a state machine complexity grows based on the number of states and not the number of inputs or data which in most cases are constant.
The other aspect here is the memory access itself which is an extension of the problem of being disconnected from hardware and execution environment. Many times the memory optimization gives performance optimization and vice-versa. They are not necessarily mutually exclusive. These relations cannot be easily modeled into simple polynomials. A theoretically bad algorithm running on heap (region of memory not algorithm heap) data will usually outperform a theoretically good algorithm running on data in stack. This is because there is a time and space complexity to memory access and storage efficiency that is not part of the mathematical model in most cases and even if attempted to model often get ignored as lower order terms that can have high impact. This is because these will show up as a long series of lower order terms which can have a much larger impact when there are sufficiently large number of lower order terms which are ignored by the model.
Imagine n3+86n2+5*106n2+109n
It's clear that the lower order terms that have high multiples will likely together have larger significance than the highest order term which the big O model tends to ignore. It would have us ignore everything other than n3. The term "sufficiently large n' is completely abused to imagine unrealistic scenarios to justify the algorithm. For this case, n has to be so large that you will run out of physical memory long before you have to worry about the algorithm itself. The algorithm doesn't matter if you can't even store the data. When memory access is modeled in; the lower order terms may end up looking like the above polynomial with over a 100 highly scaled lower order terms. However for all practical purposes these terms are never even part of the equation that the algorithm is trying to define.
Most scientific notations are generally the description of mathematical functions and used to model something. They are tools. As such the utility of the tool is constrained and only as good as the model itself. If the model cannot describe or is an ill fit to the problem at hand, then the model simply doesn't serve the purpose. This is when a different model needs to be used and when that doesn't work, a direct approach may serve your purpose well.
In addition many of the original algorithms were models of Turing machine that has a completely different working mechanism and all computing today are RASP models. Before you go into big O or any other model, ask yourself this question first "Am I choosing the right model for the task at hand and do I have the most practically accurate mathematical function ?". If the answer is 'No', then go with your experience, intuition and ignore the fancy stuff.

Related

Are there any cases where you would prefer a higher big-O time complexity algorithm over the lower one?

Are there are any cases where you would prefer O(log n) time complexity to O(1) time complexity? Or O(n) to O(log n)?
Do you have any examples?

There can be many reasons to prefer an algorithm with higher big O time complexity over the lower one:
most of the time, lower big-O complexity is harder to achieve and requires skilled implementation, a lot of knowledge and a lot of testing.
big-O hides the details about a constant: algorithm that performs in 10^5 is better from big-O point of view than 1/10^5 * log(n) (O(1) vs O(log(n)), but for most reasonable n the first one will perform better. For example the best complexity for matrix multiplication is O(n^2.373) but the constant is so high that no (to my knowledge) computational libraries use it.
big-O makes sense when you calculate over something big. If you need to sort array of three numbers, it matters really little whether you use O(n*log(n)) or O(n^2) algorithm.
sometimes the advantage of the lowercase time complexity can be really negligible. For example there is a data structure tango tree which gives a O(log log N) time complexity to find an item, but there is also a binary tree which finds the same in O(log n). Even for huge numbers of n = 10^20 the difference is negligible.
time complexity is not everything. Imagine an algorithm that runs in O(n^2) and requires O(n^2) memory. It might be preferable over O(n^3) time and O(1) space when the n is not really big. The problem is that you can wait for a long time, but highly doubt you can find a RAM big enough to use it with your algorithm
parallelization is a good feature in our distributed world. There are algorithms that are easily parallelizable, and there are some that do not parallelize at all. Sometimes it makes sense to run an algorithm on 1000 commodity machines with a higher complexity than using one machine with a slightly better complexity.
in some places (security) a complexity can be a requirement. No one wants to have a hash algorithm that can hash blazingly fast (because then other people can bruteforce you way faster)
although this is not related to switch of complexity, but some of the security functions should be written in a manner to prevent timing attack. They mostly stay in the same complexity class, but are modified in a way that it always takes worse case to do something. One example is comparing that strings are equal. In most applications it makes sense to break fast if the first bytes are different, but in security you will still wait for the very end to tell the bad news.
somebody patented the lower-complexity algorithm and it is more economical for a company to use higher complexity than to pay money.
some algorithms adapt well to particular situations. Insertion sort, for example, has an average time-complexity of O(n^2), worse than quicksort or mergesort, but as an online algorithm it can efficiently sort a list of values as they are received (as user input) where most other algorithms can only efficiently operate on a complete list of values.

There is always the hidden constant, which can be lower on the O(log n) algorithm. So it can work faster in practice for real-life data.
There are also space concerns (e.g. running on a toaster).
There's also developer time concern - O(log n) may be 1000× easier to implement and verify.

I'm surprised nobody has mentioned memory-bound applications yet.
There may be an algorithm that has less floating point operations either due to its complexity (i.e. O(1) < O(log n)) or because the constant in front of the complexity is smaller (i.e. 2n2 < 6n2). Regardless, you might still prefer the algorithm with more FLOP if the lower FLOP algorithm is more memory-bound.
What I mean by "memory-bound" is that you are often accessing data that is constantly out-of-cache. In order to fetch this data, you have to pull the memory from your actually memory space into your cache before you can perform your operation on it. This fetching step is often quite slow - much slower than your operation itself.
Therefore, if your algorithm requires more operations (yet these operations are performed on data that is already in cache [and therefore no fetching required]), it will still out-perform your algorithm with fewer operations (which must be performed on out-of-cache data [and therefore require a fetch]) in terms of actual wall-time.

In contexts where data security is a concern, a more complex algorithm may be preferable to a less complex algorithm if the more complex algorithm has better resistance to timing attacks.

Alistra nailed it but failed to provide any examples so I will.
You have a list of 10,000 UPC codes for what your store sells. 10 digit UPC, integer for price (price in pennies) and 30 characters of description for the receipt.
O(log N) approach: You have a sorted list. 44 bytes if ASCII, 84 if Unicode. Alternately, treat the UPC as an int64 and you get 42 & 72 bytes. 10,000 records--in the highest case you're looking at a bit under a megabyte of storage.
O(1) approach: Don't store the UPC, instead you use it as an entry into the array. In the lowest case you're looking at almost a third of a terabyte of storage.
Which approach you use depends on your hardware. On most any reasonable modern configuration you're going to use the log N approach. I can picture the second approach being the right answer if for some reason you're running in an environment where RAM is critically short but you have plenty of mass storage. A third of a terabyte on a disk is no big deal, getting your data in one probe of the disk is worth something. The simple binary approach takes 13 on average. (Note, however, that by clustering your keys you can get this down to a guaranteed 3 reads and in practice you would cache the first one.)

Consider a red-black tree. It has access, search, insert, and delete of O(log n). Compare to an array, which has access of O(1) and the rest of the operations are O(n).
So given an application where we insert, delete, or search more often than we access and a choice between only these two structures, we would prefer the red-black tree. In this case, you might say we prefer the red-black tree's more cumbersome O(log n) access time.
Why? Because the access is not our overriding concern. We are making a trade off: the performance of our application is more heavily influenced by factors other than this one. We allow this particular algorithm to suffer performance because we make large gains by optimizing other algorithms.
So the answer to your question is simply this: when the algorithm's growth rate isn't what we want to optimize, when we want to optimize something else. All of the other answers are special cases of this. Sometimes we optimize the run time of other operations. Sometimes we optimize for memory. Sometimes we optimize for security. Sometimes we optimize maintainability. Sometimes we optimize for development time. Even the overriding constant being low enough to matter is optimizing for run time when you know the growth rate of the algorithm isn't the greatest impact on run time. (If your data set was outside this range, you would optimize for the growth rate of the algorithm because it would eventually dominate the constant.) Everything has a cost, and in many cases, we trade the cost of a higher growth rate for the algorithm to optimize something else.

Yes.
In a real case, we ran some tests on doing table lookups with both short and long string keys.
We used a std::map, a std::unordered_map with a hash that samples at most 10 times over the length of the string (our keys tend to be guid-like, so this is decent), and a hash that samples every character (in theory reduced collisions), an unsorted vector where we do a == compare, and (if I remember correctly) an unsorted vector where we also store a hash, first compare the hash, then compare the characters.
These algorithms range from O(1) (unordered_map) to O(n) (linear search).
For modest sized N, quite often the O(n) beat the O(1). We suspect this is because the node-based containers required our computer to jump around in memory more, while the linear-based containers did not.
O(lg n) exists between the two. I don't remember how it did.
The performance difference wasn't that large, and on larger data sets the hash-based one performed much better. So we stuck with the hash-based unordered map.
In practice, for reasonable sized n, O(lg n) is O(1). If your computer only has room for 4 billion entries in your table, then O(lg n) is bounded above by 32. (lg(2^32)=32) (in computer science, lg is short hand for log based 2).
In practice, lg(n) algorithms are slower than O(1) algorithms not because of the logarithmic growth factor, but because the lg(n) portion usually means there is a certain level of complexity to the algorithm, and that complexity adds a larger constant factor than any of the "growth" from the lg(n) term.
However, complex O(1) algorithms (like hash mapping) can easily have a similar or larger constant factor.

The possibility to execute an algorithm in parallel.
I don't know if there is an example for the classes O(log n) and O(1), but for some problems, you choose an algorithm with a higher complexity class when the algorithm is easier to execute in parallel.
Some algorithms cannot be parallelized but have so low complexity class. Consider another algorithm which achieves the same result and can be parallelized easily, but has a higher complexity class. When executed on one machine, the second algorithm is slower, but when executed on multiple machines, the real execution time gets lower and lower while the first algorithm cannot speed up.

Let's say you're implementing a blacklist on an embedded system, where numbers between 0 and 1,000,000 might be blacklisted. That leaves you two possible options:
Use a bitset of 1,000,000 bits
Use a sorted array of the blacklisted integers and use a binary search to access them
Access to the bitset will have guaranteed constant access. In terms of time complexity, it is optimal. Both from a theoretical and from a practical point view (it is O(1) with an extremely low constant overhead).
Still, you might want to prefer the second solution. Especially if you expect the number of blacklisted integers to be very small, as it will be more memory efficient.
And even if you do not develop for an embedded system where memory is scarce, I just can increase the arbitrary limit of 1,000,000 to 1,000,000,000,000 and make the same argument. Then the bitset would require about 125G of memory. Having a guaranteed worst-case complexitity of O(1) might not convince your boss to provide you such a powerful server.
Here, I would strongly prefer a binary search (O(log n)) or binary tree (O(log n)) over the O(1) bitset. And probably, a hash table with its worst-case complexity of O(n) will beat all of them in practice.

My answer here Fast random weighted selection across all rows of a stochastic matrix is an example where an algorithm with complexity O(m) is faster than one with complexity O(log(m)), when m is not too big.

A more general question is if there are situations where one would prefer an O(f(n)) algorithm to an O(g(n)) algorithm even though g(n) << f(n) as n tends to infinity. As others have already mentioned, the answer is clearly "yes" in the case where f(n) = log(n) and g(n) = 1. It is sometimes yes even in the case that f(n) is polynomial but g(n) is exponential. A famous and important example is that of the Simplex Algorithm for solving linear programming problems. In the 1970s it was shown to be O(2^n). Thus, its worse-case behavior is infeasible. But -- its average case behavior is extremely good, even for practical problems with tens of thousands of variables and constraints. In the 1980s, polynomial time algorithms (such a Karmarkar's interior-point algorithm) for linear programming were discovered, but 30 years later the simplex algorithm still seems to be the algorithm of choice (except for certain very large problems). This is for the obvious reason that average-case behavior is often more important than worse-case behavior, but also for a more subtle reason that the simplex algorithm is in some sense more informative (e.g. sensitivity information is easier to extract).

People have already answered your exact question, so I'll tackle a slightly different question that people may actually be thinking of when coming here.
A lot of the "O(1) time" algorithms and data structures actually only take expected O(1) time, meaning that their average running time is O(1), possibly only under certain assumptions.
Common examples: hashtables, expansion of "array lists" (a.k.a. dynamically sized arrays/vectors).
In such scenarios, you may prefer to use data structures or algorithms whose time is guaranteed to be absolutely bounded logarithmically, even though they may perform worse on average.
An example might therefore be a balanced binary search tree, whose running time is worse on average but better in the worst case.

To put my 2 cents in:
Sometimes a worse complexity algorithm is selected in place of a better one, when the algorithm runs on a certain hardware environment. Suppose our O(1) algorithm non-sequentially accesses every element of a very big, fixed-size array to solve our problem. Then put that array on a mechanical hard drive, or a magnetic tape.
In that case, the O(logn) algorithm (suppose it accesses disk sequentially), becomes more favourable.

There is a good use case for using a O(log(n)) algorithm instead of an O(1) algorithm that the numerous other answers have ignored: immutability. Hash maps have O(1) puts and gets, assuming good distribution of hash values, but they require mutable state. Immutable tree maps have O(log(n)) puts and gets, which is asymptotically slower. However, immutability can be valuable enough to make up for worse performance and in the case where multiple versions of the map need to be retained, immutability allows you to avoid having to copy the map, which is O(n), and therefore can improve performance.

Simply: Because the coefficient - the costs associated with setup, storage, and the execution time of that step - can be much, much larger with a smaller big-O problem than with a larger one. Big-O is only a measure of the algorithms scalability.
Consider the following example from the Hacker's Dictionary, proposing a sorting algorithm relying on the Multiple Worlds Interpretation of Quantum Mechanics:
Permute the array randomly using a quantum process,
If the array is not sorted, destroy the universe.
All remaining universes are now sorted [including the one you are in].
(Source: http://catb.org/~esr/jargon/html/B/bogo-sort.html)
Notice that the big-O of this algorithm is O(n), which beats any known sorting algorithm to date on generic items. The coefficient of the linear step is also very low (since it's only a comparison, not a swap, that is done linearly). A similar algorithm could, in fact, be used to solve any problem in both NP and co-NP in polynomial time, since each possible solution (or possible proof that there is no solution) can be generated using the quantum process, then verified in polynomial time.
However, in most cases, we probably don't want to take the risk that Multiple Worlds might not be correct, not to mention that the act of implementing step 2 is still "left as an exercise for the reader".

At any point when n is bounded and the constant multiplier of O(1) algorithm is higher than the bound on log(n). For example, storing values in a hashset is O(1), but may require an expensive computation of a hash function. If the data items can be trivially compared (with respect to some order) and the bound on n is such that log n is significantly less than the hash computation on any one item, then storing in a balanced binary tree may be faster than storing in a hashset.

In a realtime situation where you need a firm upper bound you would select e.g. a heapsort as opposed to a Quicksort, because heapsort's average behaviour is also its worst-case behaviour.

Adding to the already good answers.A practical example would be Hash indexes vs B-tree indexes in postgres database.
Hash indexes form a hash table index to access the data on the disk while btree as the name suggests uses a Btree data structure.
In Big-O time these are O(1) vs O(logN).
Hash indexes are presently discouraged in postgres since in a real life situation particularly in database systems, achieving hashing without collision is very hard(can lead to a O(N) worst case complexity) and because of this, it is even more harder to make them crash safe (called write ahead logging - WAL in postgres).
This tradeoff is made in this situation since O(logN) is good enough for indexes and implementing O(1) is pretty hard and the time difference would not really matter.

When n is small, and O(1) is constantly slow.

When the "1" work unit in O(1) is very high relative to the work unit in O(log n) and the expected set size is small-ish. For example, it's probably slower to compute Dictionary hash codes than iterate an array if there are only two or three items.
or
When the memory or other non-time resource requirements in the O(1) algorithm are exceptionally large relative to the O(log n) algorithm.

when redesigning a program, a procedure is found to be optimized with O(1) instead of O(lgN), but if it's not the bottleneck of this program, and it's hard to understand the O(1) alg. Then you would not have to use O(1) algorithm
when O(1) needs much memory that you cannot supply, while the time of O(lgN) can be accepted.

This is often the case for security applications that we want to design problems whose algorithms are slow on purpose in order to stop someone from obtaining an answer to a problem too quickly.
Here are a couple of examples off the top of my head.
Password hashing is sometimes made arbitrarily slow in order to make it harder to guess passwords by brute-force. This Information Security post has a bullet point about it (and much more).
Bit Coin uses a controllably slow problem for a network of computers to solve in order to "mine" coins. This allows the currency to be mined at a controlled rate by the collective system.
Asymmetric ciphers (like RSA) are designed to make decryption without the keys intentionally slow in order to prevent someone else without the private key to crack the encryption. The algorithms are designed to be cracked in hopefully O(2^n) time where n is the bit-length of the key (this is brute force).
Elsewhere in CS, Quick Sort is O(n^2) in the worst case but in the general case is O(n*log(n)). For this reason, "Big O" analysis sometimes isn't the only thing you care about when analyzing algorithm efficiency.

There are plenty of good answers, a few of which mention the constant factor, the input size and memory constraints, among many other reasons complexity is only a theoretical guideline rather than the end-all determination of real-world fitness for a given purpose or speed.
Here's a simple, concrete example to illustrate these ideas. Let's say we want to figure out whether an array has a duplicate element. The naive quadratic approach is to write a nested loop:
const hasDuplicate = arr => {
for (let i = 0; i < arr.length; i++) {
for (let j = i + 1; j < arr.length; j++) {
if (arr[i] === arr[j]) {
return true;
}
}
}
return false;
};
console.log(hasDuplicate([1, 2, 3, 4]));
console.log(hasDuplicate([1, 2, 4, 4]));
But this can be done in linear time by creating a set data structure (i.e. removing duplicates), then comparing its size to the length of the array:
const hasDuplicate = arr => new Set(arr).size !== arr.length;
console.log(hasDuplicate([1, 2, 3, 4]));
console.log(hasDuplicate([1, 2, 4, 4]));
Big O tells us is that the new Set approach will scale a great deal better from a time complexity standpoint.
However, it turns out that the "naive" quadratic approach has a lot going for it that Big O can't account for:
No additional memory usage
No heap memory allocation (no new)
No garbage collection for the temporary Set
Early bailout; in a case when the duplicate is known to be likely in the front of the array, there's no need to check more than a few elements.
If our use case is on bounded small arrays, we have a resource-constrained environment and/or other known common-case properties allow us to establish through benchmarks that the nested loop is faster on our particular workload, it might be a good idea.
On the other hand, maybe the set can be created one time up-front and used repeatedly, amortizing its overhead cost across all of the lookups.
This leads inevitably to maintainability/readability/elegance and other "soft" costs. In this case, the new Set() approach is probably more readable, but it's just as often (if not more often) that achieving the better complexity comes at great engineering cost.
Creating and maintaining a persistent, stateful Set structure can introduce bugs, memory/cache pressure, code complexity, and all other manner of design tradeoffs. Negotiating these tradeoffs optimally is a big part of software engineering, and time complexity is just one factor to help guide that process.
A few other examples that I don't see mentioned yet:
In real-time environments, for example resource-constrained embedded systems, sometimes complexity sacrifices are made (typically related to caches and memory or scheduling) to avoid incurring occasional worst-case penalties that can't be tolerated because they might cause jitter.
Also in embedded programming, the size of the code itself can cause cache pressure, impacting memory performance. If an algorithm has worse complexity but will result in massive code size savings, that might be a reason to choose it over an algorithm that's theoretically better.
In most implementations of recursive linearithmic algorithms like quicksort, when the array is small enough, a quadratic sorting algorithm like insertion sort is often called because the overhead of recursive function calls on increasingly tiny arrays tends to outweigh the cost of nested loops. Insertion sort is also fast on mostly-sorted arrays as the inner loop won't run much. This answer discusses this in an older version of Chrome's V8 engine before they moved to Timsort.

Right way to discuss computational complexity for small n

When discussing computational complexity, it seems everyone generally goes straight to Big O.
Lets say for example I have a hybrid algorithm such as merge sort which uses insertion sort for smaller subarrays (I believe this is called tiled merge sort). It's still ultimately merge sort with O(n log n), but I want to discuss the behaviour/characteristics of the algorithm for small n, in cases where no merging actually takes place.
For all intents and purposes the tiled merge sort is insertion sort, executing exactly the same instructions for the domain of my small n. However, Big O deals with the large and asymptotic cases and discussing Big O for small n is pretty much an oxymoron. People have yelled at me for even thinking the words "behaves like an O(n^2) algorithm in such cases". What is the correct way to describe the algorithm's behaviour in cases of small n within the context of formal theoretical computational analysis? To clarify, not just in the case where n is small, but in the case where n is never big.
One might say that for such small n it doesn't matter but I'm interested in the cases where it does, for example with a large constant such as being executed many times, and where in practice it would show a clear trend and be the dominant factor. For example the initial quadratic growth seen in the graph below. I'm not dismissing Big O, more asking for a way to properly tell both sides of the story.
[EDIT]
If for "small n", constants can easily remove all trace of a growth rate then either
only the asymptotic case is discussed, in which case there is less relevance to any practical application, or
there must be a threshold at which we agree n is no longer "small".
What about the cases where n is not "small" (n is sufficiently big that the growth rate will not to affected significantly by any practical constant), but not yet big enough to show the final asymptotic growth rate so only sub growth rates are visible (for example the shape in the image above)?
Are there no practical algorithms that exhibit this behaviour? Even if there aren't, theoretical discussion should still be possible. Do we measure instead of discussing the theory purely because that's "what one should do"? If some behaviour is observed in all practical cases, why can't there be theory that's meaningful?
Let me turn the question around the other way. I have a graph that shows segmented super-linear steps. It sounds like many people would say "this is a pure coincidence, it could be any shape imaginable" (at the extreme of course) and wouldn't bat an eyelid if it were a sine wave instead. I know in many cases the shape could be hidden by constants, but here it's quite obvious. How can I give a formal explanation of why the graph produces this shape?
I particularly like #Sneftel's words "imprecise but useful guidance".
I know Big O and asymptotic analysis isn't applicable. What is? How far can I take it?
Discuss in chat

For small n, computation complexity - how things change as n increases towards infinity - isn't meaningful as other effects dominate.
Papers I've seen which discuss behaviour for small values of n do so by measuring the algorithms on real systems, and discuss how the algorithms perform in practice rather than from a theoretical viewpoint. For example, for the graph you've added to your post I would say 'this graph demonstrates an O(N) asymptotic behaviour overall, but the growth within each tile is bounded quadratic'.
I don't know of a situation where a discussion of such behaviour from a theoretical viewpoint would be meaningful - it is well known that for small n the practical effects outweigh the effects of scaling.

It's important to remember that asymptotic analysis is an analytic simplification, not a mandate for analyzing algorithms. Take selection sort, for instance. Yes, it executes in O(n^2) time. But it is also true that it performs precisely n*(n-1)/2 comparisons, and n-1-k swaps, where k is the number of elements (other than the maximum) which start in the correct position. Asymptotic analysis is a tool for simplifying the (otherwise generally impractical) task of performance analysis, and one we can put aside if we're not interested in the "really big n" segment.
And you can choose how you express your bounds. Say a function requires precisely n + floor(0.01*2^n) operations. That's exponential time, of course. But one can also say "for data sizes up to 10 elements, this algorithm requires between n and 2*n operations." The latter describes not the shape of the curve, but an envelope around that curve, giving imprecise but useful guidance about the practicalities of the algorithm within a particular range of use cases.

You are right.
For small n, i.e. when only insertion sort is performed, the asymptotic behavior is quadratic O(n^2).
And for larger n, when tiled merge sort enters into play, the behavior switches to O(n.Log(n)).
There is no contradiction if you remember that every behavior has its domain of validity, before the switching threshold, let N, and after it.
In practice there will be a smooth blend between the curves around N. But in practice too, that value of N is so small that the quadratic behavior does not have enough "room" to manifest itself.
Another way to deal with this analysis is to say that N being a constant, the insertion sorts take constant time. But I would disagree to say that this is a must.

Let's unpack things a bit. Big-O is a tool for describing the growth rate of a function. One of the functions to which it is commonly applied is the worst-case running time of an algorithm on inputs of length n, executing on a particular abstract machine. We often forget about the last part because there is a large class of machines with random-access memory that can emulate one another with only constant-factor slowdown, and the class of problems solvable within a particular big-O running-time bound is equivalent across these machines.
If you want to talk about complexity on small inputs, then you need to measure constant factors. One way is to measure running times of actual implementations. People usually do this on physical machines, but if you're hardcore like Knuth, you invent your own assembly language complete with instruction timings. Another way is to measure something that's readily identifiable but also a useful proxy for the other work performed. For comparison sorts, this could be comparisons. For numerical algorithms, this is often floating-point operations.

Complexity is not about execution time for one n on one machine, so there is no need to consider it even if constant is large. Complexity tells you how the size of the input affects execution time. For small n, you can treat execution time as constant. This is the one side.
From the second side you are saying that:
You have a hybrid algorithm working in O(n log n) for n larger than some k and O(n^2) for n smaller than k.
The constant k is so large that algorithm works slowly.
There is no sense in such algorithm, because you could easily improve it.
Lets take Red-black tree. Operations on this tree are performed in O(n log n) time complexity, but there is a large constant. So, on normal machines, it could work slowly (i.e. slower than simpler solutions) in some cases. There is no need to consider it in analyzing complexity. You need to consider it when you are implementing it in your system: you need to check if it's the best choice considering the actual machine(s) on which it will be working and what problems it will be solving.

Read Knuth's "The Art of Computer Programming series", starting with "Volume 1. Fundamental Algorithms", section "1.2.10: Analysis of an Algorithm". There he shows (and in all the rest of his seminal work) how exact analysis can be conducted for arbitrary problem sizes, using a suitable reference machine, by taking a detailed census of every processor instruction.
Such analyses have to take into account not only the problem size, but also any relevant aspect of the input data distribution which will influence the running time. For simplification, the analysis are often limited to the study of the worst case, the expected case or the output-sensitive case, rather than a general statistical characterization. And for further simplification, asymptotic analysis is used.
Not counting the fact that except for trivial algorithms the detailed approach is mathematically highly challenging, it has become unrealistic on modern machines. Indeed, it relies on a processor behavior similar to the so-called RAM model, which assumes constant time per instruction and per memory access (http://en.wikipedia.org/wiki/Random-access_machine). Except maybe for special hardware combined to a hard real-time OS, these assumptions are nowadays completely wrong.

When you have an algorithm with a time complexity say O(n^2).And you also have an another algorithm with a time complexity, say O(n).Then from these two time complexity you can't conclude that the latter algorithm is faster than the former one for all input values.You can only say latter algorithm is asymptotically(means for sufficiently large input values)faster than the former one.Here you have to keep in mind the fact that in case of asymptotic notations constant factors are generally ignored to increase the understand-ability of the time complexity of the algorithm.As example: marge sort runs in O(nlogn) worst-case time and insertion sort runs in O(n^2) worst case time.But as the hidden constant factors in insertion sort is smaller than that of marge sort, in practice insertion sort can be faster than marge sort for small problem sizes on many machines.
Asymptotic notation does not describe the actual running-time of an algorithm.Actual running time is dependent on machine as different machine has different architecture and different Instruction Cycle Execution time.Asymptotic notation just describes asymptotically how fast an algorithm is with respect to other algorithms.But it does not describe the behavior of the algorithm in case of small input values(n<=no).The value of no (threshold) is dependent on the hidden constant factors and lower order terms.And hidden constant factors are dependent on the machine on which it will be executed.

What is the purpose of Big-O notation in computer science if it doesn't give all the information needed?

What is the use of Big-O notation in computer science if it doesn't give all the information needed?
For example, if one algorithm runs at 1000n and one at n, it is true that they are both O(n). But I still may make a foolish choice based on this information, since one algorithm takes 1000 times as long as the other for any given input.
I still need to know all the parts of the equation, including the constant, to make an informed choice, so what is the importance of this "intermediate" comparison? I end up loosing important information when it gets reduced to this form, and what do I gain?

What does that constant factor represent? You can't say with certainty, for example, that an algorithm that is O(1000n) will be slower than an algorithm that's O(5n). It might be that the 1000n algorithm loads all data into memory and makes 1,000 passes over that data, and the 5n algorithm makes five passes over a file that's stored on a slow I/O device. The 1000n algorithm will run faster even though its "constant" is much larger.
In addition, some computers perform some operations more quickly than other computers do. It's quite common, given two O(n) algorithms (call them A and B), for A to execute faster on one computer and B to execute faster on the other computer. Or two different implementations of the same algorithm can have widely varying runtimes on the same computer.
Asymptotic analysis, as others have said, gives you an indication of how an algorithm's running time varies with the size of the input. It's useful for giving you a good starting place in algorithm selection. Quick reference will tell you that a particular algorithm is O(n) or O(n log n) or whatever, but it's very easy to find more detailed information on most common algorithms. Still, that more detailed analysis will only give you a constant number without saying how that number relates to real running time.
In the end, the only way you can determine which algorithm is right for you is to study it yourself and then test it against your expected data.
In short, I think you're expecting too much from asymptotic analysis. It's a useful "first line" filter. But when you get beyond that you have to look for more information.

As you correctly noted, it does not give you information on the exact running time of an algorithm. It is mainly used to indicate the complexity of an algorithm, to indicate if it is linear in the input size, quadratic, exponential, etc. This is important when choosing between algorithms if you know that your input size is large, since even a 1000n algorithm well beat a 1.23 exp(n) algorithm for large enough n.
In real world algorithms, the hidden 'scaling factor' is of course important. It is therefore not uncommon to use an algorithm with a 'worse' complexity if it has a lower scaling factor. Many practical implementations of sorting algorithms are for example 'hybrid' and will resort to some 'bad' algorithm like insertion sort (which is O(n^2) but very simple to implement) for n < 10, while changing to quicksort (which is O(n log(n)) but more complex) for n >= 10.

Big-O tells you how the runtime or memory consumption of a process changes as the size of its input changes. O(n) and O(1000n) are both still O(n) -- if you double the size of the input, then for all practical purposes the runtime doubles too.
Now, we can have an O(n) algorithm and an O(n2) algorithm where the coefficient of n is 1000000 and the coefficient of n2 is 1, in which case the O(n2) algorithm would outperform the O(n) for smaller n values. This doesn't change the fact, however, that the second algorithm's runtime grows more rapidly than the first's, and this is the information that big-O tells us. There will be some input size at which the O(n) algorithm begins to outperform the O(n2) algorithm.

In addition to the hidden impact of the constant term, complexity notation also only considers the worst case instance of a problem.
Case in point, the simplex method (linear programming) has exponential complexity for all known implementations. However, the simplex method works much faster in practice than the provably polynomial-time interior point methods.
Complexity notation has much value for theoretical problem classification. If you want some more information on practical consequences check out "Smoothed Analysis" by Spielman: http://www.cs.yale.edu/homes/spielman
This is what you are looking for.

It's main purpose is for rough comparisons of logic. The difference of O(n) and O(1000n) is large for n ~ 1000 (n roughly equal to 1000) and n < 1000, but when you compare it to values where n >> 1000 (n much larger than 1000) the difference is miniscule.
You are right in saying they both scale linearly and knowing the coefficient helps in a detailed analysis but generally in computing the difference between linear (O(cn)) and exponential (O(cn^x)) performance is more important to note than the difference between two linear times. There is a larger value in the comparisons of runtime of higher orders such as and Where the performance difference scales exponentially.
The overall purpose of Big O notation is to give a sense of relative performance time in order to compare and further optimize algorithms.

You're right that it doesn't give you all information, but there's no single metric in any field that does that.
Big-O notation tells you how quickly the performance gets worse, as your dataset gets larger. In other words, it describes the type of performance curve, but not the absolute performance.

Generally, Big-O notation is useful to express an algorithm's scaling performance as it falls into one of three basic categories:
Linear
Logarithmic (or "linearithmic")
Exponential
It is possible to do deep analysis of an algorithm for very accurate performance measurements, but it is time consuming and not really necessary to get a broad indication of performance.

Analyzing algorithms - Why only time complexity?

I was learning about algorithms and time complexity, and this quesiton just sprung into my mind.
Why do we only analyze an algorithm's time complexity?
My question is, shouldn't there be another metric for analyzing an algorithm? Say I have two algos A and B.
A takes 5s for 100 elements, B takes 1000s for 100 elements. But both have O(n) time.
So this means that the time for A and B both grow slower than cn grows for two separate constants c=c1 and c=c2. But in my very limited experience with algorithms, we've always ignored this constant term and just focused on the growth. But isn't it very important while choosing between my given example of A and B? Over here c1<<c2 so Algo A is much better than Algo B.
Or am I overthinking at an early stage and proper analysis will come later on? What is it called?
OR is my whole concept of time complexity wrong and in my given example both can't have O(n) time?

We worry about the order of growth because it provides a useful abstraction to the behaviour of the algorithm as the input size goes to infinity.
The constants "hidden" by the O notation are important, but they're also difficult to calculate because they depend on factors such as:
the particular programming language that is being used to implement the algorithm
the specific compiler that is being used
the underlying CPU architecture
We can try to estimate these, but in general it's a lost cause unless we make some simplifying assumptions and work on some well defined model of computation, like the RAM model.
But then, we're back into the world of abstractions, precisely where we started!

We measure lots of other types of complexity.
Space (Memory usage)
Circuit Depth / Size
Network Traffic / Amount of Interaction
IO / Cache Hits
But I guess you're talking more about a "don't the constants matter?" approach. Yes, they do. The reason it's useful to ignore the constants is that they keep changing. Different machines perform different operations at different speeds. You have to walk the line between useful in general and useful on your specific machine.

It's not always time. There's also space.
As for the asymptotic time cost/complexity, which O() gives you, if you have a lot of data, then, for example, O(n2)=n2 is going to be worse than O(n)=100*n for n>100. For smaller n you should prefer this O(n2).
And, obviously, O(n)=100*n is always worse than O(n)=10*n.
The details of your problem should contribute to your decision between several possible solutions (choices of algorithms) to it.

A takes 5s for 100 elements, B takes 1000s for 100 elements. But both
have O(n) time.
Why is that?
O(N) is an asymptotic measurement on the number of steps required to execute a program in relation to the programs input.
This means that for really large values of N the complexity of the algorithm is linear growth.
We don't compare X and Y seconds. We analyze how the algorithm behaves as the input goes larger and larger

O(n) gives you an idea how much slower the same algorithm will be for a different n, not for comparing algorithms.
On the other hand there is also space complexity - how memory usage grows as a function of input n.

How do you show that one algorithm is more efficient than another algorithm?

I'm no professional programmer and I don't study it. I'm an aerospace student and did a numeric method for my diploma thesis and also coded a program to prove that it works.
I did several methods and implemented several algorithms and tried to show the proofs why different situations needed their own algorithm to solve the task.
I did this proof with a mathematical approach, but some algorithm was so specific that I do know what they do and they do it right, but it was very hard to find a mathematical function or something to show how many iterations or loops it has to do until it finishes.
So, I would like to know how you do this comparison. Do you also present a mathematical function, or do you just do a speedtest of both algorithms, and if you do it mathematically, how do you do that? Do you learn this during your university studies, or how?
Thank you in advance, Andreas

The standard way of comparing different algorithms is by comparing their complexity using Big O notation. In practice you would of course also benchmark the algorithms.
As an example the sorting algorithms bubble sort and heap sort has complexity O(n2) and O(n log n) respective.
As a final note it's very hard to construct representative benchmarks, see this interesting post from Christer Ericsson on the subject.

While big-O notation can provide you with a way of distinguishing an awful algorithm from a reasonable algorithm, it only tells you about a particular definition of computational complexity. In the real world, this won't actually allow you to choose between two algorithms, since:
1) Two algorithms at the same order of complexity, let's call them f and g, both with O(N^2) complexity might differ in runtime by several orders of magnitude. Big-O notation does not measure the number of individual steps associated with each iteration, so f might take 100 steps while g takes 10.
In addition, different compilers or programming languages might generate more or less instructions for each iteration of the algorithm, and subtle choices in the description of the algorithm can make cache or CPU hardware perform 10s to 1000s of times worse, without changing either the big-O order, or the number of steps!
2) An O(N) algorithm might outperform an O(log(N)) algorithm
Big-O notation does not measure the number of individual steps associated with each iteration, so if O(N) takes 100 steps, but O(log(N)) takes 1000 steps for each iteration, then for data sets up to a certain size O(N) will be better.
The same issues apply to compilers as above.
The solution is to do an initial mathematical analysis of Big-O notation, followed by a benchmark-driven performance tuning cycle, using time and hardware performance counter data, as well as a good dollop of experience.

Firstly one would need to define what more efficient means, does it mean quicker, uses less system resources (such as memory) etc... (these factors are sometimes mutually exclusive)
In terms of standard definitions of efficiency one would often utilize Big-0 Notation, however in the "real world" outside academia normally one would profile/benchmark both equations and then compare the results
It's often difficult to make general assumptions about Big-0 notation as this is primarily concerned with looping and assumes a fixed cost for the code within a loop so benchmarking would be the better way to go
One caveat to watch out for is that sometimes the result can vary significantly based on the dataset size you're working with - for small N in a loop one will sometimes not find much difference

You might get off easy when there is a significant difference in the asymptotic Big-O complexity class for the worst case or for the expected case. Even then you'll need to show that the hidden constant factors don't make the "better" (from the asymptotic perspective) algorithm slower for reasonably sized inputs.
If difference isn't large, then given the complexity of todays computers, benchmarking with various datasets is the only correct way. You cannot even begin to take into account all of the convoluted interplay that comes from branch prediction accuracy, data and code cache hit rates, lock contention and so on.

Running speed tests is not going to provide you with as good quality an answer as mathematics will. I think your outline approach is correct -- but perhaps your experience and breadth of knowledge let you down when analysing on of your algorithms. I recommend the book 'Concrete Mathematics' by Knuth and others, but there are a lot of other good (and even more not good) books covering the topic of analysing algorithms. Yes, I learned this during my university studies.
Having written all that, most algoritmic complexity is analysed in terms of worst-case execution time (so called big-O) and it is possible that your data sets do not approach worst-cases, in which case the speed tests you run may illuminate your actual performance rather than the algorithm's theoretical performance. So tests are not without their value. I'd say, though, that the value is secondary to that of the mathematics, which shouldn't cause you any undue headaches.

That depends. At the university you do learn to compare algorithms by calculating the number of operations it executes depending on the size / value of its arguments. (Compare analysis of algorithms and big O notation). I would require of every decent programmer to at least understand the basics of that.
However in practice this is useful only for small algorithms, or small parts of larger algorithms. You will have trouble to calculate this for, say, a parsing algorithm for an XML Document. But knowing the basics often keeps you from making braindead errors - see, for instance, Joel Spolskys amusing blog-entry "Back to the Basics".
If you have a larger system you will usually either compare algorithms by educated guessing, making time measurements, or find the troublesome spots in your system by using a profiling tool. In my experience this is rarely that important - fighting to reduce the complexity of the system helps more.

To answer your question: " Do you also present a mathematical function, or do you just do a speedtest of both algorithms."
Yes to both - let's summarize.
The "Big O" method discussed above refers to the worst case performance as Mark mentioned above. The "speedtest" you mention would be a way to estimate "average case performance". In practice, there may be a BIG difference between worst case performance and average case performance. This is why your question is interesting and useful.
Worst case performance has been the classical way of defining and classifying algorithm performance. More recently, research has been more concerned with average case performance or more precisely performance bounds like: 99% of the problems will require less than N operations. You can imagine why the second case is far more practical for most problems.
Depending on the application, you might have very different requirements. One application may require response time to be less than 3 seconds 95% of the time - this would lead to defining performance bounds. Another might require performance to NEVER exceed 5 seconds - this would lead to analyzing worst case performance.
In both cases this is taught at the university or grad school level. Anyone developing new algorithms used in real-time applications should learn about the difference between average and worst case performance and should also be prepared to develop simulations and analysis of algorithm performance as part of an implementation process.
Hope this helps.

Big O notation give you the complexity of an algoritm in the worst case, and is mainly usefull to know how the algoritm will grow in execution time when the ammount of data that have to proccess grow up. For example (C-style syntax, this is not important):
List<int> ls = new List<int>(); (1) O(1)
for (int i = 0; i < ls.count; i++) (2) O(1)
foo(i); (3) O(log n) (efficient function)
Cost analysis:
(1) cost: O(1), constant cost
(2) cost: O(1), (int i = 0;)
O(1), (i < ls.count)
O(1), (i++)
---- total: O(1) (constant cost), but it repeats n times (ls.count)
(3) cost: O(log n) (assume it, just an example),
but it repeats n times as it is inside the loop
So, in asymptotic notation, it will have a cost of: O(n log n) (not as efficient) wich in this example is a reasonable result, but take this example:
List<int> ls = new List<int>(); (1) O(1)
for (int i = 0; i < ls.count; i++) (2) O(1)
if ( (i mod 2) == 0) ) (*) O(1) (constant cost)
foo(i); (3) O(log n)
Same algorithm but with a little new line with a condition. In this case asymptotic notation will chose the worst case and will conclude same results as above O(n log n), when is easily detectable that the (3) step will execute only half the times.
Data an so are only examples and may not be exact, just trying to illustrate the behaviour of the Big O notation. It mainly gives you the behaviour of your algoritm when data grow up (you algoritm will be linear, exponencial, logarithmical, ...), but this is not what everybody knows as "efficiency", or almost, this is not the only "efficiency" meaning.
However, this methot can detect "impossible of process" (sorry, don't know the exact english word) algoritms, this is, algoritms that will need a gigantic amount of time to be processed in its early steps (think in factorials, for example, or very big matix).
If you want a real world efficiency study, may be you prefere catching up some real world data and doing a real world benchmark of the beaviour of you algoritm with this data. It is not a mathematical style, but it will be more precise in the majority of cases (but not in the worst case! ;) ).
Hope this helps.

Assuming speed (not memory) is your primary concern, and assuming you want an empirical (not theoretical) way to compare algorithms, I would suggest you prepare several datasets differing in size by a wide margin, like 3 orders of magnitude. Then run each algorithm against every dataset, clock them, and plot the results. The shape of each algorithm's time vs. dataset size curve will give a good idea of its big-O performance.
Now, if the size of your datasets in practice are pretty well known, an algorithm with better big-O performance is not necessarily faster. To determine which algorithm is faster for a given dataset size, you need to tune each one's performance until it is "as fast as possible" and then see which one wins. Performance tuning requires profiling, or single-stepping at the instruction level, or my favorite technique, stackshots.

As others have pointed out rightfully a common way is to use the Big O-notation.
But, the Big O is only good as long as you consider processing performance of algorithms that are clearly defined and scoped (such as a bubble sort).
It's when other hardware resources or other running software running in parallell comes into play that the part called engineering kicks in. The hardware has its constraints. Memory and disk are limited resources. Disk performance even depend on the mechanics involved.
An operating system scheduler will for instance differentiate on I/O bound and CPU bound resources to improve the total performance for a given application. A DBMS will take into account disk reads and writes, memory and CPU usage and even networking in the case of clusters.
These things are hard to prove mathematically but are often easily benchmarked against a set of usage patterns.
So I guess the answer is that developers both use theoretical methods such as Big O and benchmarking to determine the speed of algorithms and its implementations.

This is usually expressed with big O notation. Basically you pick a simple function (like n2 where n is the number of elements) that dominates the actual number of iterations.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio