Complexity comparision of SIMD based algorithm implemenation [closed]

Complexity comparision of SIMD based algorithm implemenation [closed] - algorithm

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am just curious to know, what is the way to measure the complexity of an algorithm implemented by two different SIMD say SSE3 and CUDA. Normally we compare algorithm complexity with Big-O notation. Is there any such way to compare runtime improvement with SIMD ?
If somebody ask how much an algorithm will improve if you run it on GPU. Can you measure it theoretically ? without running the benchmark on both CPU and GPU. ?
Note: I understand what Big-O is . So, all I want to know is how SSE3 performs compared to CUDA or the CPU based implementation for the same algorithm without raw bench-marking

Big-O notation is, for the most part, inapplicable to CPU instructions.
Instructions belong to the realm of the lowly hardware, the impure flesh of the computer. Computer science does not concern itself with such crude notions.
(Actually, the term "Computer Science" is a misnomer. There is this widely circulated quote that "Computer science is no more about computers than astronomy is about telescopes." It is misattributed to Edsger Dijkstra, but actually it originates from Michael R. Fellows, read about it here: https://en.wikiquote.org/wiki/Computer_science)
In any case, if you insist on thinking of algorithms as being executed by CPU instructions, and if you also insist on reasoning about the run-time complexity of instructions, then you have to think of memory accesses.
You need to first come up with some "unit of work" that is comparable between SSE3 and CUDA, and then you need to setup some mechanism so that you can measure
how the number of memory accesses of SSE3 increases in relation to the amount of work to do, and
how the number of memory accesses of CUDA increases in relation to the same amount of work to do.
That would be quite hard to accomplish, and my guess would be that the results would be quite linear, meaning that no matter which one takes more or less memory accesses, the number of accesses would change linearly with respect to the amount of work to do, in both cases.

Big-O generally speaks about how an algorithm scales as the number of items considered, N, grows.
So, for an O(N) algorithm, if you have 10 times the data, you expect roughly 10 times the run-time. If you have an O(n log₂ n) algorithm, 10 times the data gives you ~33x more work.
Roughly speaking, CUDA and GPUs parallelize operations across p cores. The total work done is then given by W=pt, where p is the number of cores and t the time complexity of each core's operation. You could, for instance, sort N items using √N processors each doing O(√N log N) work. The total time complexity of the algorithm is still O(N log N), but the algorithm is thought of as "work-efficient" because its total time complexity is the same (or less than) the best known serial algorithm. So Big-O analysis can tell you something about how algorithms run in parallel environments.
However, it is not the tool you are looking for. Big-O is for talking about how algorithms scale: it can't tell us how an algorithm will perform on different hardware.
For a theoretical approach to performance, you'll want to look into Amdahl's law and Gustafson's law, which provide simple metrics of how much improvement we can expected in programs when their resources are increased. They both boil down to acknowledging that the total increase in speed is a function of the portion of the program we can speed up and the amount we can speed it up by.
Can you do better than that? Yes. There are more sophisticated models you can use to determine how your code might perform in different environments, looking into the roofline model (alternative link) might get you started.
Beyond this, your question gets too specific to be answered well here. Running the benchmarks is likely to be your personal best option.

Some of the answers (and maybe your question) reflect basic misconceptions about Big-O. O(f(n)) is just a classification of mathematical functions. It has nothing to do a priori with run time or anything else. The definition of big-O makes this pretty clear.
With the right mathematical machinery, sure, what you ask is certainly possible.
To use big-O in the context of algorithms, you must first describe the function that you're classifying. A classical example is sorting algorithms. Here a function frequently used is the number of comparisons f(n) needed to sort a list of given length n. When we say (rather imprecisely) that a sorting algorithm is O(n log n), we're usually saying that when n is big enough, the number of comparisons needed to sort is capped by K n log(n) where K is some positive constant.
You can also use big-O in the context of run time, but you must also stipulate a formal machine model. This is a mathematically precise abstraction of a real machine. There are many of these. For many purposes that don't involve parallel processing, the Word RAM or Real RAM is used, but there are other options.
When we say "an algorithm is O(g(n))", what we really mean, most of the time, is "the number of Word RAM clock cycles needed to run the algorithm on input of size n is capped by K g(n) for some constant K, given that n is large enough.
That is, with an abstract machine model in hand, the function classified by big-O is just the number of clock cycles of the abstract machine as a function of input size.
For SIMD computation, the PRAM is one commonly used abstract machine model. As all abstract machine models, it makes simplifying assumptions. For example, both memory size and number of processors are unlimited. It also allows for different strategies to resolve memory contention among processors. Just as for the Word RAM, big-O performance of algorithms running on a PRAM are statements about the number of clock cycles needed with respect to input size.
There has been some work on abstract models of GPUs that allow big O to approximate useful functions describing algorithm performance. It remains a topic of research.
So comparing sequential vs. parallel performance or parallel performance on different architectures is a matter of choosing the right abstract models, describing a program for each that implements the algorithm you're interested in, then comparing big-O descriptions of the number of clock cycle each requires.

Using SIMD/SSE3 does not change Big-O complexity, because it executes constant number (upto 16) operations.
O(Function(N)/Const) === O(Function(N))
The same is true for CUDA, despite of thousand cores working (if we ignore different memory access model and CUDA implementation does not utilize another algorithm)

Related

Why is order of growth preferred as a benchmark for algorithm performance wrt runtime?

I learnt that growth rate is often used to gauge the runtime and efficiency of an algorithm. My question is why use growth rate instead of using the exact(or approximate) relation between the runtime and input size?
Edit:
Thanks for the responses. I would like to clarify what I meant by "relation between the runtime and input size" as it is a little vague.
From what I understand, growth rate is the gradient of the runtime against input. So a growth rate of n^2 would give an equation of the form t = k(n^3) + Constant. Given that the equation is more informative(as it includes constants) and shows a direct relation to the time needed, I thought it would be preferred.
I do understand that as n increase, constants soon becomes irrelevant and depending on different computation configuration, k will be different. Perhaps that is why it is sufficient to just work with the growth rate.

The algorithm isn't the only factor affecting actual running time
Things like programming language, optimizations, branch prediction, I/O speed, paging, processing speed, etc. all come into play.
One language / machine / whatever may certainly have advantages over another, so every algorithm needs to be executed under the exact same conditions.
Beyond that, one algorithm may outperform another in C when considering input and output residing in RAM, but the other may outperform the first in Python when considering input and output residing on disk.
There will no doubt be little to no chance of agreement on the exact conditions that should be used to perform all the benchmarking, and, even if such agreement could be reached, it would certainly be irresponsible to use 5-year-old benchmarking results today in the computing world, so these results would need to be recreated for all algorithms on a regular basis - this would be a massive, very time-consuming task.
Algorithms have varying constant factors
In the extreme case, the constant factors of certain algorithms are so high that other asymptotically slower algorithms outperform it on all reasonable inputs in the modern day. If we merely go by running time, the fact that these algorithms would outperform the others on larger inputs may be lost.
In the less extreme case, we'll get results that will be different at other input sizes because of the constant factors involved - we may see one algorithm as faster in all our tests, but as soon as we hit some input size, the other may become faster.
The running times of some algorithms depend greatly on the input
Basic quicksort on already sorted data, for example, takes O(n^2), while it takes O(n log n) on average.
One can certainly determine the best and worst cases and run the algorithm for those, but the average case is something that could only be determined through mathematical analysis - you can't run it for 'the average case' - you could run it a bunch of times for random input and get average of that, but that's fairly imprecise.
So a rough estimate is sufficient
Because of the above reasons, it makes sense to just say an algorithm is, for example, O(n^2), which very roughly means that, if we're dealing with large enough input size, it would take 4 times longer if the input size doubles. If you've been paying attention, you'll know that the actual time taken could be quite different from 4 times longer, but it at least gives us some idea - we won't expect it to take twice as long, nor 10 times longer (although it might under extreme circumstances). We can also reasonably expect, for example, an O(n log n) algorithm to outperform an O(n^2) algorithm for a large n, which is a useful comparison, and may be easier to see what's going on than some perhaps more exact representation.

You can use both types of measures. In practice, it can be useful to measure performance with specific inputs that you are likely to work with (benchmarking), but it is also quite useful to know the asymptotic behavior of algorithms, as that tells us the (space/time) cost in the general case of "very large inputs" (technically, n->infinity).
Remember that in many cases, the main term of the runtime often far outweighs the importance of lower-order terms, especially as n takes on large values. Therefore, we can summarize or abstract away information by giving a "growth rate" or bound on the algorithm's performance, instead of working with the "exact" runtime or space requirements. Exact in quotes because the constants for the various terms of your runtime may very much vary between runs, between machines - basically different conditions will produce different "constants". In summary, we are interested in asymptotic algorithm behavior because it is still very useful and machine-agnostic.

Growth rate is a relation between the run time of the algorithm and the size of its input. However, this measure is not expressed in units of time, because the technology quickly makes these units obsolete. Only 20 years ago, a microsecond wasn't a lot of time; if you work with embedded systems, it is still not all that much. On the other hand, on mainstream computers with clock speeds of over a gigahertz a microsecond is a lot of time.
An algorithm does not become faster if you run it on a faster hardware. If you say, for example, that an algorithm takes eight milliseconds for an input of size 100, the information would be meaningless until you say on what computer you run your computations: it could be a slow algorithm running on a fast hardware, a fast algorithm running on a slow hardware, or anything in between.
If you also say that it also takes, say, 32 milliseconds for an input of size 200, it would be more meaningful, because the reader would be able to derive the growth rate: the reader would know that doubling the input size quadruples the time, which is a nice thing to know. However, you might as well specify that your algorithm is O(n2).

Analyzing algorithms - Why only time complexity?

I was learning about algorithms and time complexity, and this quesiton just sprung into my mind.
Why do we only analyze an algorithm's time complexity?
My question is, shouldn't there be another metric for analyzing an algorithm? Say I have two algos A and B.
A takes 5s for 100 elements, B takes 1000s for 100 elements. But both have O(n) time.
So this means that the time for A and B both grow slower than cn grows for two separate constants c=c1 and c=c2. But in my very limited experience with algorithms, we've always ignored this constant term and just focused on the growth. But isn't it very important while choosing between my given example of A and B? Over here c1<<c2 so Algo A is much better than Algo B.
Or am I overthinking at an early stage and proper analysis will come later on? What is it called?
OR is my whole concept of time complexity wrong and in my given example both can't have O(n) time?

We worry about the order of growth because it provides a useful abstraction to the behaviour of the algorithm as the input size goes to infinity.
The constants "hidden" by the O notation are important, but they're also difficult to calculate because they depend on factors such as:
the particular programming language that is being used to implement the algorithm
the specific compiler that is being used
the underlying CPU architecture
We can try to estimate these, but in general it's a lost cause unless we make some simplifying assumptions and work on some well defined model of computation, like the RAM model.
But then, we're back into the world of abstractions, precisely where we started!

We measure lots of other types of complexity.
Space (Memory usage)
Circuit Depth / Size
Network Traffic / Amount of Interaction
IO / Cache Hits
But I guess you're talking more about a "don't the constants matter?" approach. Yes, they do. The reason it's useful to ignore the constants is that they keep changing. Different machines perform different operations at different speeds. You have to walk the line between useful in general and useful on your specific machine.

It's not always time. There's also space.
As for the asymptotic time cost/complexity, which O() gives you, if you have a lot of data, then, for example, O(n2)=n2 is going to be worse than O(n)=100*n for n>100. For smaller n you should prefer this O(n2).
And, obviously, O(n)=100*n is always worse than O(n)=10*n.
The details of your problem should contribute to your decision between several possible solutions (choices of algorithms) to it.

A takes 5s for 100 elements, B takes 1000s for 100 elements. But both
have O(n) time.
Why is that?
O(N) is an asymptotic measurement on the number of steps required to execute a program in relation to the programs input.
This means that for really large values of N the complexity of the algorithm is linear growth.
We don't compare X and Y seconds. We analyze how the algorithm behaves as the input goes larger and larger

O(n) gives you an idea how much slower the same algorithm will be for a different n, not for comparing algorithms.
On the other hand there is also space complexity - how memory usage grows as a function of input n.

Are algorithms rated on the big-o notation affected by parallelism?

I've just read an article about a breakthrough on matrix multiplication; an algorithm that is O(n^2.373). But I guess matrix multiplication is something that can be parallelized. So, if we ever start producing thousandth-cores processors, will this become irrelevant? How would things change?

Parallel execution doesn't change the basics of the complexity for a particular algorithm -- at best, you're just taking the time for some given size, and dividing by the number of cores. This may reduce time for a given size by a constant factor, but has no effect on the algorithm's complexity.
At the same time, parallel execution does sometimes change which algorithm(s) you want to use for particular tasks. Some algorithms that work well in serial code just don't split up into parallel tasks very well. Others that have higher complexity might be faster for practical-sized problems because they run better in parallel.
For an extremely large number of cores, the complexity of the calculation itself may become secondary to simply getting the necessary data to/from all the cores to do the calculation. most computations of big-O don't take these effects into account for a serial calculation, but it can become quite important for parallel calculations, especially for some models of parallel machines that don't give uniform access to all nodes.

If quantum computing comes to something practical some day, then yes, complexity of algorithms will change.
In the meantime, parallelizing an algorithm, with a fixed number of processors, just divides its runtime proportionally (and that, in the best case, when there are no dependencies between the tasks performed at every processor). That means, dividing the runtime by a constant, and so the complexity remains the same.

By Amdahl's law, for the same size of problem, parallelization will come to a point of diminishing return with the increase in the number of cores (theoretically). In reality, from a certain degree of parallelization, the overhead of parallelization will actually decrease the performance of the program.
However, by Gustafson's law, the increase of number of cores actually helps as the size of the problem increases. That is the motivation behind cluster computing. As we have more computing power, we can tackle problem at a larger scale or better precision with the help of parallelization.
Algorithms that we learn as is may or may not be paralellizable. Sometimes, a separate algorithm must be used to efficiently execute the same task in parallel. Either way, the Big-O notation must be re-analyze for the parallel case to take into consideration the effect of parallelization on the time complexity of the algorithm.

How do you show that one algorithm is more efficient than another algorithm?

I'm no professional programmer and I don't study it. I'm an aerospace student and did a numeric method for my diploma thesis and also coded a program to prove that it works.
I did several methods and implemented several algorithms and tried to show the proofs why different situations needed their own algorithm to solve the task.
I did this proof with a mathematical approach, but some algorithm was so specific that I do know what they do and they do it right, but it was very hard to find a mathematical function or something to show how many iterations or loops it has to do until it finishes.
So, I would like to know how you do this comparison. Do you also present a mathematical function, or do you just do a speedtest of both algorithms, and if you do it mathematically, how do you do that? Do you learn this during your university studies, or how?
Thank you in advance, Andreas

The standard way of comparing different algorithms is by comparing their complexity using Big O notation. In practice you would of course also benchmark the algorithms.
As an example the sorting algorithms bubble sort and heap sort has complexity O(n2) and O(n log n) respective.
As a final note it's very hard to construct representative benchmarks, see this interesting post from Christer Ericsson on the subject.

While big-O notation can provide you with a way of distinguishing an awful algorithm from a reasonable algorithm, it only tells you about a particular definition of computational complexity. In the real world, this won't actually allow you to choose between two algorithms, since:
1) Two algorithms at the same order of complexity, let's call them f and g, both with O(N^2) complexity might differ in runtime by several orders of magnitude. Big-O notation does not measure the number of individual steps associated with each iteration, so f might take 100 steps while g takes 10.
In addition, different compilers or programming languages might generate more or less instructions for each iteration of the algorithm, and subtle choices in the description of the algorithm can make cache or CPU hardware perform 10s to 1000s of times worse, without changing either the big-O order, or the number of steps!
2) An O(N) algorithm might outperform an O(log(N)) algorithm
Big-O notation does not measure the number of individual steps associated with each iteration, so if O(N) takes 100 steps, but O(log(N)) takes 1000 steps for each iteration, then for data sets up to a certain size O(N) will be better.
The same issues apply to compilers as above.
The solution is to do an initial mathematical analysis of Big-O notation, followed by a benchmark-driven performance tuning cycle, using time and hardware performance counter data, as well as a good dollop of experience.

Firstly one would need to define what more efficient means, does it mean quicker, uses less system resources (such as memory) etc... (these factors are sometimes mutually exclusive)
In terms of standard definitions of efficiency one would often utilize Big-0 Notation, however in the "real world" outside academia normally one would profile/benchmark both equations and then compare the results
It's often difficult to make general assumptions about Big-0 notation as this is primarily concerned with looping and assumes a fixed cost for the code within a loop so benchmarking would be the better way to go
One caveat to watch out for is that sometimes the result can vary significantly based on the dataset size you're working with - for small N in a loop one will sometimes not find much difference

You might get off easy when there is a significant difference in the asymptotic Big-O complexity class for the worst case or for the expected case. Even then you'll need to show that the hidden constant factors don't make the "better" (from the asymptotic perspective) algorithm slower for reasonably sized inputs.
If difference isn't large, then given the complexity of todays computers, benchmarking with various datasets is the only correct way. You cannot even begin to take into account all of the convoluted interplay that comes from branch prediction accuracy, data and code cache hit rates, lock contention and so on.

Running speed tests is not going to provide you with as good quality an answer as mathematics will. I think your outline approach is correct -- but perhaps your experience and breadth of knowledge let you down when analysing on of your algorithms. I recommend the book 'Concrete Mathematics' by Knuth and others, but there are a lot of other good (and even more not good) books covering the topic of analysing algorithms. Yes, I learned this during my university studies.
Having written all that, most algoritmic complexity is analysed in terms of worst-case execution time (so called big-O) and it is possible that your data sets do not approach worst-cases, in which case the speed tests you run may illuminate your actual performance rather than the algorithm's theoretical performance. So tests are not without their value. I'd say, though, that the value is secondary to that of the mathematics, which shouldn't cause you any undue headaches.

That depends. At the university you do learn to compare algorithms by calculating the number of operations it executes depending on the size / value of its arguments. (Compare analysis of algorithms and big O notation). I would require of every decent programmer to at least understand the basics of that.
However in practice this is useful only for small algorithms, or small parts of larger algorithms. You will have trouble to calculate this for, say, a parsing algorithm for an XML Document. But knowing the basics often keeps you from making braindead errors - see, for instance, Joel Spolskys amusing blog-entry "Back to the Basics".
If you have a larger system you will usually either compare algorithms by educated guessing, making time measurements, or find the troublesome spots in your system by using a profiling tool. In my experience this is rarely that important - fighting to reduce the complexity of the system helps more.

To answer your question: " Do you also present a mathematical function, or do you just do a speedtest of both algorithms."
Yes to both - let's summarize.
The "Big O" method discussed above refers to the worst case performance as Mark mentioned above. The "speedtest" you mention would be a way to estimate "average case performance". In practice, there may be a BIG difference between worst case performance and average case performance. This is why your question is interesting and useful.
Worst case performance has been the classical way of defining and classifying algorithm performance. More recently, research has been more concerned with average case performance or more precisely performance bounds like: 99% of the problems will require less than N operations. You can imagine why the second case is far more practical for most problems.
Depending on the application, you might have very different requirements. One application may require response time to be less than 3 seconds 95% of the time - this would lead to defining performance bounds. Another might require performance to NEVER exceed 5 seconds - this would lead to analyzing worst case performance.
In both cases this is taught at the university or grad school level. Anyone developing new algorithms used in real-time applications should learn about the difference between average and worst case performance and should also be prepared to develop simulations and analysis of algorithm performance as part of an implementation process.
Hope this helps.

Big O notation give you the complexity of an algoritm in the worst case, and is mainly usefull to know how the algoritm will grow in execution time when the ammount of data that have to proccess grow up. For example (C-style syntax, this is not important):
List<int> ls = new List<int>(); (1) O(1)
for (int i = 0; i < ls.count; i++) (2) O(1)
foo(i); (3) O(log n) (efficient function)
Cost analysis:
(1) cost: O(1), constant cost
(2) cost: O(1), (int i = 0;)
O(1), (i < ls.count)
O(1), (i++)
---- total: O(1) (constant cost), but it repeats n times (ls.count)
(3) cost: O(log n) (assume it, just an example),
but it repeats n times as it is inside the loop
So, in asymptotic notation, it will have a cost of: O(n log n) (not as efficient) wich in this example is a reasonable result, but take this example:
List<int> ls = new List<int>(); (1) O(1)
for (int i = 0; i < ls.count; i++) (2) O(1)
if ( (i mod 2) == 0) ) (*) O(1) (constant cost)
foo(i); (3) O(log n)
Same algorithm but with a little new line with a condition. In this case asymptotic notation will chose the worst case and will conclude same results as above O(n log n), when is easily detectable that the (3) step will execute only half the times.
Data an so are only examples and may not be exact, just trying to illustrate the behaviour of the Big O notation. It mainly gives you the behaviour of your algoritm when data grow up (you algoritm will be linear, exponencial, logarithmical, ...), but this is not what everybody knows as "efficiency", or almost, this is not the only "efficiency" meaning.
However, this methot can detect "impossible of process" (sorry, don't know the exact english word) algoritms, this is, algoritms that will need a gigantic amount of time to be processed in its early steps (think in factorials, for example, or very big matix).
If you want a real world efficiency study, may be you prefere catching up some real world data and doing a real world benchmark of the beaviour of you algoritm with this data. It is not a mathematical style, but it will be more precise in the majority of cases (but not in the worst case! ;) ).
Hope this helps.

Assuming speed (not memory) is your primary concern, and assuming you want an empirical (not theoretical) way to compare algorithms, I would suggest you prepare several datasets differing in size by a wide margin, like 3 orders of magnitude. Then run each algorithm against every dataset, clock them, and plot the results. The shape of each algorithm's time vs. dataset size curve will give a good idea of its big-O performance.
Now, if the size of your datasets in practice are pretty well known, an algorithm with better big-O performance is not necessarily faster. To determine which algorithm is faster for a given dataset size, you need to tune each one's performance until it is "as fast as possible" and then see which one wins. Performance tuning requires profiling, or single-stepping at the instruction level, or my favorite technique, stackshots.

As others have pointed out rightfully a common way is to use the Big O-notation.
But, the Big O is only good as long as you consider processing performance of algorithms that are clearly defined and scoped (such as a bubble sort).
It's when other hardware resources or other running software running in parallell comes into play that the part called engineering kicks in. The hardware has its constraints. Memory and disk are limited resources. Disk performance even depend on the mechanics involved.
An operating system scheduler will for instance differentiate on I/O bound and CPU bound resources to improve the total performance for a given application. A DBMS will take into account disk reads and writes, memory and CPU usage and even networking in the case of clusters.
These things are hard to prove mathematically but are often easily benchmarked against a set of usage patterns.
So I guess the answer is that developers both use theoretical methods such as Big O and benchmarking to determine the speed of algorithms and its implementations.

This is usually expressed with big O notation. Basically you pick a simple function (like n2 where n is the number of elements) that dominates the actual number of iterations.

When does Big-O notation fail?

What are some examples where Big-O notation[1] fails in practice?
That is to say: when will the Big-O running time of algorithms predict algorithm A to be faster than algorithm B, yet in practice algorithm B is faster when you run it?
Slightly broader: when do theoretical predictions about algorithm performance mismatch observed running times? A non-Big-O prediction might be based on the average/expected number of rotations in a search tree, or the number of comparisons in a sorting algorithm, expressed as a factor times the number of elements.
Clarification:
Despite what some of the answers say, the Big-O notation is meant to predict algorithm performance. That said, it's a flawed tool: it only speaks about asymptotic performance, and it blurs out the constant factors. It does this for a reason: it's meant to predict algorithmic performance independent of which computer you execute the algorithm on.
What I want to know is this: when do the flaws of this tool show themselves? I've found Big-O notation to be reasonably useful, but far from perfect. What are the pitfalls, the edge cases, the gotchas?
An example of what I'm looking for: running Dijkstra's shortest path algorithm with a Fibonacci heap instead of a binary heap, you get O(m + n log n) time versus O((m+n) log n), for n vertices and m edges. You'd expect a speed increase from the Fibonacci heap sooner or later, yet said speed increase never materialized in my experiments.
(Experimental evidence, without proof, suggests that binary heaps operating on uniformly random edge weights spend O(1) time rather than O(log n) time; that's one big gotcha for the experiments. Another one that's a bitch to count is the expected number of calls to DecreaseKey).
[1] Really it isn't the notation that fails, but the concepts the notation stands for, and the theoretical approach to predicting algorithm performance. </anti-pedantry>
On the accepted answer:
I've accepted an answer to highlight the kind of answers I was hoping for. Many different answers which are just as good exist :) What I like about the answer is that it suggests a general rule for when Big-O notation "fails" (when cache misses dominate execution time) which might also increase understanding (in some sense I'm not sure how to best express ATM).

It fails in exactly one case: When people try to use it for something it's not meant for.
It tells you how an algorithm scales. It does not tell you how fast it is.
Big-O notation doesn't tell you which algorithm will be faster in any specific case. It only tells you that for sufficiently large input, one will be faster than the other.

When N is small, the constant factor dominates. Looking up an item in an array of five items is probably faster than looking it up in a hash table.

Short answer: When n is small. The Traveling Salesman Problem is quickly solved when you only have three destinations (however, finding the smallest number in a list of a trillion elements can last a while, although this is O(n). )

the canonical example is Quicksort, which has a worst time of O(n^2), while Heapsort's is O(n logn). in practice however, Quicksort is usually faster then Heapsort. why? two reasons:
each iteration in Quicksort is a lot simpler than Heapsort. Even more, it's easily optimized by simple cache strategies.
the worst case is very hard to hit.
But IMHO, this doesn't mean 'big O fails' in any way. the first factor (iteration time) is easy to incorporate into your estimates. after all, big O numbers should be multiplied by this almost-constant facto.
the second factor melts away if you get the amortized figures instead of average. They can be harder to estimate, but tell a more complete story

One area where Big O fails is memory access patterns. Big O only counts operations that need to be performed - it can't keep track if an algorithm results in more cache misses or data that needs to be paged in from disk. For small N, these effects will typically dominate. For instance, a linear search through an array of 100 integers will probably beat out a search through a binary tree of 100 integers due to memory accesses, even though the binary tree will most likely require fewer operations. Each tree node would result in a cache miss, whereas the linear search would mostly hit the cache for each lookup.

Big-O describes the efficiency/complexity of the algorithm and not necessarily the running time of the implementation of a given block of code. This doesn't mean Big-O fails. It just means that it's not meant to predict running time.
Check out the answer to this question for a great definition of Big-O.

For most algorithms there is an "average case" and a "worst case". If your data routinely falls into the "worst case" scenario, it is possible that another algorithm, while theoretically less efficient in the average case, might prove more efficient for your data.
Some algorithms also have best cases that your data can take advantage of. For example, some sorting algorithms have a terrible theoretical efficiency, but are actually very fast if the data is already sorted (or nearly so). Another algorithm, while theoretically faster in the general case, may not take advantage of the fact that the data is already sorted and in practice perform worse.
For very small data sets sometimes an algorithm that has a better theoretical efficiency may actually be less efficient because of a large "k" value.

One example (that I'm not an expert on) is that simplex algorithms for linear programming have exponential worst-case complexity on arbitrary inputs, even though they perform well in practice. An interesting solution to this is considering "smoothed complexity", which blends worst-case and average-case performance by looking at small random perturbations of arbitrary inputs.
Spielman and Teng (2004) were able to show that the shadow-vertex simplex algorithm has polynomial smoothed complexity.

Big O does not say e.g. that algorithm A runs faster than algorithm B. It can say that the time or space used by algorithm A grows at a different rate than algorithm B, when the input grows. However, for any specific input size, big O notation does not say anything about the performance of one algorithm relative to another.
For example, A may be slower per operation, but have a better big-O than B. B is more performant for smaller input, but if the data size increases, there will be some cut-off point where A becomes faster. Big-O in itself does not say anything about where that cut-off point is.

The general answer is that Big-O allows you to be really sloppy by hiding the constant factors. As mentioned in the question, the use of Fibonacci Heaps is one example. Fibonacci Heaps do have great asymptotic runtimes, but in practice the constants factors are way too large to be useful for the sizes of data sets encountered in real life.
Fibonacci Heaps are often used in proving a good lower bound for asymptotic complexity of graph-related algorithms.
Another similar example is the Coppersmith-Winograd algorithm for matrix multiplication. It is currently the algorithm with the fastest known asymptotic running time for matrix multiplication, O(n2.376). However, its constant factor is far too large to be useful in practice. Like Fibonacci Heaps, it's frequently used as a building block in other algorithms to prove theoretical time bounds.

This somewhat depends on what the Big-O is measuring - when it's worst case scenarios, it will usually "fail" in that the runtime performance will be much better than the Big-O suggests. If it's average case, then it may be much worse.
Big-O notation typically "fails" if the input data to the algorithm has some prior information. Often, the Big-O notation refers to the worst case complexity - which will often happen if the data is either completely random or completely non-random.
As an example, if you feed data to an algorithm that's profiled and the big-o is based on randomized data, but your data has a very well-defined structure, your result times may be much faster than expected. On the same token, if you're measuring average complexity, and you feed data that is horribly randomized, the algorithm may perform much worse than expected.

Small N - And for todays computers, 100 is likely too small to worry.
Hidden Multipliers - IE merge vs quick sort.
Pathological Cases - Again, merge vs quick

One broad area where Big-Oh notation fails is when the amount of data exceeds the available amount of RAM.
Using sorting as an example, the amount of time it takes to sort is not dominated by the number of comparisons or swaps (of which there are O(n log n) and O(n), respectively, in the optimal case). The amount of time is dominated by the number of disk operations: block writes and block reads.
To better analyze algorithms which handle data in excess of available RAM, the I/O-model was born, where you count the number of disk reads. In that, you consider three parameters:
The number of elements, N;
The amount of memory (RAM), M (the number of elements that can be in memory); and
The size of a disk block, B (the number of elements per block).
Notably absent is the amount of disk space; this is treated as if it were infinite. A typical extra assumption is that M > B2.
Continuing the sorting example, you typically favor merge sort in the I/O case: divide the elements into chunks of size θ(M) and sort them in memory (with, say, quicksort). Then, merge θ(M/B) of them by reading the first block from each chunk into memory, stuff all the elements into a heap, and repeatedly pick the smallest element until you have picked B of them. Write this new merge block out and continue. If you ever deplete one of the blocks you read into memory, read a new block from the same chunk and put it into the heap.
(All expressions should be read as being big θ). You form N/M sorted chunks which you then merge. You merge log (base M/B) of N/M times; each time you read and write all the N/B blocks, so it takes you N/B * (log base M/B of N/M) time.
You can analyze in-memory sorting algorithms (suitably modified to include block reads and block writes) and see that they're much less efficient than the merge sort I've presented.
This knowledge is courtesy of my I/O-algorithms course, by Arge and Brodal (http://daimi.au.dk/~large/ioS08/); I also conducted experiments which validate the theory: heap sort takes "almost infinite" time once you exceed memory. Quick sort becomes unbearably slow, merge sort barely bearably slow, I/O-efficient merge sort performs well (the best of the bunch).

I've seen a few cases where, as the data set grew, the algorithmic complexity became less important than the memory access pattern. Navigating a large data structure with a smart algorithm can, in some cases, cause far more page faults or cache misses, than an algorithm with a worse big-O.
For small n, two algorithms may be comparable. As n increases, the smarter algorithm outperforms. But, at some point, n grows big enough that the system succumbs to memory pressure, in which case the "worse" algorithm may actually perform better because the constants are essentially reset.
This isn't particularly interesting, though. By the time you reach this inversion point, the performance of both algorithms is usually unacceptable, and you have to find a new algorithm that has a friendlier memory access pattern AND a better big-O complexity.

This question is like asking, "When does a person's IQ fail in practice?" It's clear that having a high IQ does not mean you'll be successful in life and having a low IQ does not mean you'll perish. Yet, we measure IQ as a means of assessing potential, even if its not an absolute.
In algorithms, the Big-Oh notation gives you the algorithm's IQ. It doesn't necessarily mean that the algorithm will perform best for your particular situation, but there's some mathematical basis that says this algorithm has some good potential. If Big-Oh notation were enough to measure performance you would see a lot more of it and less runtime testing.
Think of Big-Oh as a range instead of a specific measure of better-or-worse. There's best case scenarios and worst case scenarios and a huge set of scenarios in between. Choose your algorithms by how well they fit within the Big-Oh range, but don't rely on the notation as an absolute for measuring performance.

When your data doesn't fit the model, big-o notation will still work, but you're going to see an overlap from best and worst case scenarios.
Also, some operations are tuned for linear data access vs. random data access, so one algorithm while superior in terms of cycles, might be doggedly slow if the method of calling it changes from design. Similarly, if an algorithm causes page/cache misses due to the way it access memory, Big-O isn't going to going to give an accurate estimate of the cost of running a process.
Apparently, as I've forgotten, also when N is small :)

The short answer: always on modern hardware when you start using a lot of memory. The textbooks assume memory access is uniform, and it is no longer. You can of course do Big O analysis for a non-uniform access model, but that is somewhat more complex.
The small n cases are obvious but not interesting: fast enough is fast enough.
In practice I've had problems using the standard collections in Delphi, Java, C# and Smalltalk with a few million objects. And with smaller ones where the dominant factor proved to be the hash function or the compare

Robert Sedgewick talks about shortcomings of the big-O notation in his Coursera course on Analysis of Algorithms. He calls particularly egregious examples galactic algorithms because while they may have a better complexity class than their predecessors, it would take inputs of astronomical sizes for it to show in practice.
https://www.cs.princeton.edu/~rs/talks/AlgsMasses.pdf

Big O and its brothers are used to compare asymptotic mathematical function growth. I would like to emphasize on the mathematical part. Its entirely about being able reduce your problem to a function where the input grows a.k.a scales. It gives you a nice plot where your input (x axis) related to the number of operations performed(y-axis). This is purely based on the mathematical function and as such requires us to accurately model the algorithm used into a polynomial of sorts. Then the assumption of scaling.
Big O immediately loses its relevance when the data is finite, fixed and constant size. Which is why nearly all embedded programmers don't even bother with big O. Mathematically this will always come out to O(1) but we know that we need to optimize our code for space and Mhz timing budget at a level that big O simply doesn't work. This is optimization is on the same order where the individual components matter due to their direct performance dependence on the system.
Big O's other failure is in its assumption that hardware differences do not matter. A CPU that has a MAC, MMU and/or a bit shift low latency math operations will outperform some tasks which may be falsely identified as higher order in the asymptotic notation. This is simply because of the limitation of the model itself.
Another common case where big O becomes absolutely irrelevant is where we falsely identify the nature of the problem to be solved and end up with a binary tree when in reality the solution is actually a state machine. The entire algorithm regimen often overlooks finite state machine problems. This is because a state machine complexity grows based on the number of states and not the number of inputs or data which in most cases are constant.
The other aspect here is the memory access itself which is an extension of the problem of being disconnected from hardware and execution environment. Many times the memory optimization gives performance optimization and vice-versa. They are not necessarily mutually exclusive. These relations cannot be easily modeled into simple polynomials. A theoretically bad algorithm running on heap (region of memory not algorithm heap) data will usually outperform a theoretically good algorithm running on data in stack. This is because there is a time and space complexity to memory access and storage efficiency that is not part of the mathematical model in most cases and even if attempted to model often get ignored as lower order terms that can have high impact. This is because these will show up as a long series of lower order terms which can have a much larger impact when there are sufficiently large number of lower order terms which are ignored by the model.
Imagine n3+86n2+5*106n2+109n
It's clear that the lower order terms that have high multiples will likely together have larger significance than the highest order term which the big O model tends to ignore. It would have us ignore everything other than n3. The term "sufficiently large n' is completely abused to imagine unrealistic scenarios to justify the algorithm. For this case, n has to be so large that you will run out of physical memory long before you have to worry about the algorithm itself. The algorithm doesn't matter if you can't even store the data. When memory access is modeled in; the lower order terms may end up looking like the above polynomial with over a 100 highly scaled lower order terms. However for all practical purposes these terms are never even part of the equation that the algorithm is trying to define.
Most scientific notations are generally the description of mathematical functions and used to model something. They are tools. As such the utility of the tool is constrained and only as good as the model itself. If the model cannot describe or is an ill fit to the problem at hand, then the model simply doesn't serve the purpose. This is when a different model needs to be used and when that doesn't work, a direct approach may serve your purpose well.
In addition many of the original algorithms were models of Turing machine that has a completely different working mechanism and all computing today are RASP models. Before you go into big O or any other model, ask yourself this question first "Am I choosing the right model for the task at hand and do I have the most practically accurate mathematical function ?". If the answer is 'No', then go with your experience, intuition and ignore the fancy stuff.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio