Analysis with Parallel Algorithms

Analysis with Parallel Algorithms - algorithm

Everyone knows that bubblesort is O(n^2), but this is based on the number of comparisons needed to sort this. I have a question in which, if I didn't care about the number of comparisons, but the output time, then how do you do analysis of this? Is there a way to do analysis on output time instead of comparisons?
For example, if you could have bubble sort and have parallel comparisons happening at for all pairs (even then odd comparisons), then the throughput time would be something like 2n-1 throughput time. The number of comparisons would be high, but I don't care as the final throughput time is quick.
So in essence, is there a common analysis for overall parallel performance time? If so, just give me some key terms and I'll learn the rest from google.

Parallel programming is a bit of red herring here. Making assumptions about run time only on big O notation can be misleading. To compare run times of algorithms you need the full equation not just the big O notation.
The problem is that big O notation tells you the dominating term as n goes to infinity. But the run time is on finite ranges of n. This is easy to understand from continuous mathematics (my background).
Consider y=Ax and y=Bx^2 Big O notation would tell you that y=Bx^2 is slower. However, between (0,A/B) it's less than y=Ax. In this case it could be faster to use the O(x^2) algorithm than the O(x) algorithm for x<A/B.
In fact I have heard of sorting algorithms which start off with a O(nlogn) algorithm and then switch to a O(n^2) logarithm when n is sufficiently small.
The best example is matrix multiplication. The naïve algorithm is O(n^3) but there are algorithms that get that down to O(n^2.3727). However, every algorithm I have seen has such a large constant that the naïve O(n^3) is still the fastest algorithm for all particle values of n and that does not look likely to change any time soon.
So really what you need is the full equation to compare. Something like An^3 (let's ignore lower order terms) and Bn^2.3727. In this case B is so incredibly large that the O(n^3) method always wins.
Parallel programming usually just lowers the constant. For example when I do matrix multiplication using four cores my time goes from An^3 to A/4 n^3. The same thing will happen with your parallel bubble sort. You will decrease the constant. So it's possible that for some range of values of n that your parallel bubble sort will beat a non-parallel (or possibly even parallel) merge sort. Though, unlike matrix multiplication, I think the range will be pretty small.

Algorithm analysis is not meant to give actual run times. That's what benchmarks are for. Analysis tells you how much relative complexity is in a program, but the actual run time for that program depends on so many other factors that strict analysis can't guarantee real-world times. For example, what happens if your OS decides to suspend your program to install updates? Your run time will be shot. Even running the same program over and over yields different results due to the complexity of computer systems (memory page faults, virtual machine garbage collection, IO interrupts, etc). Analysis can't take these into account.
This is why parallel processing doesn't usually come under consideration during analysis. The mechanism for "parallelizing" a program's components is usually external to your code, and usually based on a probabilistic algorithm for scheduling. I don't know of a good way to do static analysis on that. Once again, you can run a bunch of benchmarks and that will give you an average run time.

The time efficiency we get by parallel steps can be measured by round complexity. Where each round consists of parallel steps occurring at the same time. By doing so, we can see how effective the throughput time is, in similar analysis that we are used to.

Related

Right way to discuss computational complexity for small n

When discussing computational complexity, it seems everyone generally goes straight to Big O.
Lets say for example I have a hybrid algorithm such as merge sort which uses insertion sort for smaller subarrays (I believe this is called tiled merge sort). It's still ultimately merge sort with O(n log n), but I want to discuss the behaviour/characteristics of the algorithm for small n, in cases where no merging actually takes place.
For all intents and purposes the tiled merge sort is insertion sort, executing exactly the same instructions for the domain of my small n. However, Big O deals with the large and asymptotic cases and discussing Big O for small n is pretty much an oxymoron. People have yelled at me for even thinking the words "behaves like an O(n^2) algorithm in such cases". What is the correct way to describe the algorithm's behaviour in cases of small n within the context of formal theoretical computational analysis? To clarify, not just in the case where n is small, but in the case where n is never big.
One might say that for such small n it doesn't matter but I'm interested in the cases where it does, for example with a large constant such as being executed many times, and where in practice it would show a clear trend and be the dominant factor. For example the initial quadratic growth seen in the graph below. I'm not dismissing Big O, more asking for a way to properly tell both sides of the story.
[EDIT]
If for "small n", constants can easily remove all trace of a growth rate then either
only the asymptotic case is discussed, in which case there is less relevance to any practical application, or
there must be a threshold at which we agree n is no longer "small".
What about the cases where n is not "small" (n is sufficiently big that the growth rate will not to affected significantly by any practical constant), but not yet big enough to show the final asymptotic growth rate so only sub growth rates are visible (for example the shape in the image above)?
Are there no practical algorithms that exhibit this behaviour? Even if there aren't, theoretical discussion should still be possible. Do we measure instead of discussing the theory purely because that's "what one should do"? If some behaviour is observed in all practical cases, why can't there be theory that's meaningful?
Let me turn the question around the other way. I have a graph that shows segmented super-linear steps. It sounds like many people would say "this is a pure coincidence, it could be any shape imaginable" (at the extreme of course) and wouldn't bat an eyelid if it were a sine wave instead. I know in many cases the shape could be hidden by constants, but here it's quite obvious. How can I give a formal explanation of why the graph produces this shape?
I particularly like #Sneftel's words "imprecise but useful guidance".
I know Big O and asymptotic analysis isn't applicable. What is? How far can I take it?
Discuss in chat

For small n, computation complexity - how things change as n increases towards infinity - isn't meaningful as other effects dominate.
Papers I've seen which discuss behaviour for small values of n do so by measuring the algorithms on real systems, and discuss how the algorithms perform in practice rather than from a theoretical viewpoint. For example, for the graph you've added to your post I would say 'this graph demonstrates an O(N) asymptotic behaviour overall, but the growth within each tile is bounded quadratic'.
I don't know of a situation where a discussion of such behaviour from a theoretical viewpoint would be meaningful - it is well known that for small n the practical effects outweigh the effects of scaling.

It's important to remember that asymptotic analysis is an analytic simplification, not a mandate for analyzing algorithms. Take selection sort, for instance. Yes, it executes in O(n^2) time. But it is also true that it performs precisely n*(n-1)/2 comparisons, and n-1-k swaps, where k is the number of elements (other than the maximum) which start in the correct position. Asymptotic analysis is a tool for simplifying the (otherwise generally impractical) task of performance analysis, and one we can put aside if we're not interested in the "really big n" segment.
And you can choose how you express your bounds. Say a function requires precisely n + floor(0.01*2^n) operations. That's exponential time, of course. But one can also say "for data sizes up to 10 elements, this algorithm requires between n and 2*n operations." The latter describes not the shape of the curve, but an envelope around that curve, giving imprecise but useful guidance about the practicalities of the algorithm within a particular range of use cases.

You are right.
For small n, i.e. when only insertion sort is performed, the asymptotic behavior is quadratic O(n^2).
And for larger n, when tiled merge sort enters into play, the behavior switches to O(n.Log(n)).
There is no contradiction if you remember that every behavior has its domain of validity, before the switching threshold, let N, and after it.
In practice there will be a smooth blend between the curves around N. But in practice too, that value of N is so small that the quadratic behavior does not have enough "room" to manifest itself.
Another way to deal with this analysis is to say that N being a constant, the insertion sorts take constant time. But I would disagree to say that this is a must.

Let's unpack things a bit. Big-O is a tool for describing the growth rate of a function. One of the functions to which it is commonly applied is the worst-case running time of an algorithm on inputs of length n, executing on a particular abstract machine. We often forget about the last part because there is a large class of machines with random-access memory that can emulate one another with only constant-factor slowdown, and the class of problems solvable within a particular big-O running-time bound is equivalent across these machines.
If you want to talk about complexity on small inputs, then you need to measure constant factors. One way is to measure running times of actual implementations. People usually do this on physical machines, but if you're hardcore like Knuth, you invent your own assembly language complete with instruction timings. Another way is to measure something that's readily identifiable but also a useful proxy for the other work performed. For comparison sorts, this could be comparisons. For numerical algorithms, this is often floating-point operations.

Complexity is not about execution time for one n on one machine, so there is no need to consider it even if constant is large. Complexity tells you how the size of the input affects execution time. For small n, you can treat execution time as constant. This is the one side.
From the second side you are saying that:
You have a hybrid algorithm working in O(n log n) for n larger than some k and O(n^2) for n smaller than k.
The constant k is so large that algorithm works slowly.
There is no sense in such algorithm, because you could easily improve it.
Lets take Red-black tree. Operations on this tree are performed in O(n log n) time complexity, but there is a large constant. So, on normal machines, it could work slowly (i.e. slower than simpler solutions) in some cases. There is no need to consider it in analyzing complexity. You need to consider it when you are implementing it in your system: you need to check if it's the best choice considering the actual machine(s) on which it will be working and what problems it will be solving.

Read Knuth's "The Art of Computer Programming series", starting with "Volume 1. Fundamental Algorithms", section "1.2.10: Analysis of an Algorithm". There he shows (and in all the rest of his seminal work) how exact analysis can be conducted for arbitrary problem sizes, using a suitable reference machine, by taking a detailed census of every processor instruction.
Such analyses have to take into account not only the problem size, but also any relevant aspect of the input data distribution which will influence the running time. For simplification, the analysis are often limited to the study of the worst case, the expected case or the output-sensitive case, rather than a general statistical characterization. And for further simplification, asymptotic analysis is used.
Not counting the fact that except for trivial algorithms the detailed approach is mathematically highly challenging, it has become unrealistic on modern machines. Indeed, it relies on a processor behavior similar to the so-called RAM model, which assumes constant time per instruction and per memory access (http://en.wikipedia.org/wiki/Random-access_machine). Except maybe for special hardware combined to a hard real-time OS, these assumptions are nowadays completely wrong.

When you have an algorithm with a time complexity say O(n^2).And you also have an another algorithm with a time complexity, say O(n).Then from these two time complexity you can't conclude that the latter algorithm is faster than the former one for all input values.You can only say latter algorithm is asymptotically(means for sufficiently large input values)faster than the former one.Here you have to keep in mind the fact that in case of asymptotic notations constant factors are generally ignored to increase the understand-ability of the time complexity of the algorithm.As example: marge sort runs in O(nlogn) worst-case time and insertion sort runs in O(n^2) worst case time.But as the hidden constant factors in insertion sort is smaller than that of marge sort, in practice insertion sort can be faster than marge sort for small problem sizes on many machines.
Asymptotic notation does not describe the actual running-time of an algorithm.Actual running time is dependent on machine as different machine has different architecture and different Instruction Cycle Execution time.Asymptotic notation just describes asymptotically how fast an algorithm is with respect to other algorithms.But it does not describe the behavior of the algorithm in case of small input values(n<=no).The value of no (threshold) is dependent on the hidden constant factors and lower order terms.And hidden constant factors are dependent on the machine on which it will be executed.

Analyzing algorithms - Why only time complexity?

I was learning about algorithms and time complexity, and this quesiton just sprung into my mind.
Why do we only analyze an algorithm's time complexity?
My question is, shouldn't there be another metric for analyzing an algorithm? Say I have two algos A and B.
A takes 5s for 100 elements, B takes 1000s for 100 elements. But both have O(n) time.
So this means that the time for A and B both grow slower than cn grows for two separate constants c=c1 and c=c2. But in my very limited experience with algorithms, we've always ignored this constant term and just focused on the growth. But isn't it very important while choosing between my given example of A and B? Over here c1<<c2 so Algo A is much better than Algo B.
Or am I overthinking at an early stage and proper analysis will come later on? What is it called?
OR is my whole concept of time complexity wrong and in my given example both can't have O(n) time?

We worry about the order of growth because it provides a useful abstraction to the behaviour of the algorithm as the input size goes to infinity.
The constants "hidden" by the O notation are important, but they're also difficult to calculate because they depend on factors such as:
the particular programming language that is being used to implement the algorithm
the specific compiler that is being used
the underlying CPU architecture
We can try to estimate these, but in general it's a lost cause unless we make some simplifying assumptions and work on some well defined model of computation, like the RAM model.
But then, we're back into the world of abstractions, precisely where we started!

We measure lots of other types of complexity.
Space (Memory usage)
Circuit Depth / Size
Network Traffic / Amount of Interaction
IO / Cache Hits
But I guess you're talking more about a "don't the constants matter?" approach. Yes, they do. The reason it's useful to ignore the constants is that they keep changing. Different machines perform different operations at different speeds. You have to walk the line between useful in general and useful on your specific machine.

It's not always time. There's also space.
As for the asymptotic time cost/complexity, which O() gives you, if you have a lot of data, then, for example, O(n2)=n2 is going to be worse than O(n)=100*n for n>100. For smaller n you should prefer this O(n2).
And, obviously, O(n)=100*n is always worse than O(n)=10*n.
The details of your problem should contribute to your decision between several possible solutions (choices of algorithms) to it.

A takes 5s for 100 elements, B takes 1000s for 100 elements. But both
have O(n) time.
Why is that?
O(N) is an asymptotic measurement on the number of steps required to execute a program in relation to the programs input.
This means that for really large values of N the complexity of the algorithm is linear growth.
We don't compare X and Y seconds. We analyze how the algorithm behaves as the input goes larger and larger

O(n) gives you an idea how much slower the same algorithm will be for a different n, not for comparing algorithms.
On the other hand there is also space complexity - how memory usage grows as a function of input n.

Estimation of program execution time from complexity

I want to know, how i can estimate the time that my program will take to execute on my machine (for example a 2.5 Ghz machine), if i have an estimation of its worst case time complexity?
For Example : - If I have a program which is O(n^2), in worst case, and n<100000, how can i know /estimate before writing the actual program/procedure, the time that it will take to execute in seconds?
Wouldn't it be good to know how a program actually performs, and it will also save writing code which eventually turns out to be inefficient!
Help greatly appreciated.

Since big O complexity ignores linear coefficients and smaller terms, it is impossible to estimate the performance of an algorithm given only its big o complexity.
In fact, for any specific N, you cannot predict which of two given algorithms will execute faster.
For example, O(N) is not always faster than O(N*N) since an algorithm that takes 100000000*n steps is O(N) is slower than an algorithm than takes N*N steps for many small values of N.
These linear coefficients and asymptotically smaller terms vary from platform to platform and even amongst algorithms of the same equivalence class (in terms of big O measure). 3
The problem you are trying to use big O notation for is not the one it is designed to solve.

Instead of dealing with complecity, you might want to have a look at Worst Case Execution Time (WCET). This area of research most likely corresponds to what you are looking for.
http://en.wikipedia.org/wiki/Worst-case_execution_time

Multiply N^2 by the time You spend in an iteration of the innermost loop, and You have a ballpark estimate.

How do you show that one algorithm is more efficient than another algorithm?

I'm no professional programmer and I don't study it. I'm an aerospace student and did a numeric method for my diploma thesis and also coded a program to prove that it works.
I did several methods and implemented several algorithms and tried to show the proofs why different situations needed their own algorithm to solve the task.
I did this proof with a mathematical approach, but some algorithm was so specific that I do know what they do and they do it right, but it was very hard to find a mathematical function or something to show how many iterations or loops it has to do until it finishes.
So, I would like to know how you do this comparison. Do you also present a mathematical function, or do you just do a speedtest of both algorithms, and if you do it mathematically, how do you do that? Do you learn this during your university studies, or how?
Thank you in advance, Andreas

The standard way of comparing different algorithms is by comparing their complexity using Big O notation. In practice you would of course also benchmark the algorithms.
As an example the sorting algorithms bubble sort and heap sort has complexity O(n2) and O(n log n) respective.
As a final note it's very hard to construct representative benchmarks, see this interesting post from Christer Ericsson on the subject.

While big-O notation can provide you with a way of distinguishing an awful algorithm from a reasonable algorithm, it only tells you about a particular definition of computational complexity. In the real world, this won't actually allow you to choose between two algorithms, since:
1) Two algorithms at the same order of complexity, let's call them f and g, both with O(N^2) complexity might differ in runtime by several orders of magnitude. Big-O notation does not measure the number of individual steps associated with each iteration, so f might take 100 steps while g takes 10.
In addition, different compilers or programming languages might generate more or less instructions for each iteration of the algorithm, and subtle choices in the description of the algorithm can make cache or CPU hardware perform 10s to 1000s of times worse, without changing either the big-O order, or the number of steps!
2) An O(N) algorithm might outperform an O(log(N)) algorithm
Big-O notation does not measure the number of individual steps associated with each iteration, so if O(N) takes 100 steps, but O(log(N)) takes 1000 steps for each iteration, then for data sets up to a certain size O(N) will be better.
The same issues apply to compilers as above.
The solution is to do an initial mathematical analysis of Big-O notation, followed by a benchmark-driven performance tuning cycle, using time and hardware performance counter data, as well as a good dollop of experience.

Firstly one would need to define what more efficient means, does it mean quicker, uses less system resources (such as memory) etc... (these factors are sometimes mutually exclusive)
In terms of standard definitions of efficiency one would often utilize Big-0 Notation, however in the "real world" outside academia normally one would profile/benchmark both equations and then compare the results
It's often difficult to make general assumptions about Big-0 notation as this is primarily concerned with looping and assumes a fixed cost for the code within a loop so benchmarking would be the better way to go
One caveat to watch out for is that sometimes the result can vary significantly based on the dataset size you're working with - for small N in a loop one will sometimes not find much difference

You might get off easy when there is a significant difference in the asymptotic Big-O complexity class for the worst case or for the expected case. Even then you'll need to show that the hidden constant factors don't make the "better" (from the asymptotic perspective) algorithm slower for reasonably sized inputs.
If difference isn't large, then given the complexity of todays computers, benchmarking with various datasets is the only correct way. You cannot even begin to take into account all of the convoluted interplay that comes from branch prediction accuracy, data and code cache hit rates, lock contention and so on.

Running speed tests is not going to provide you with as good quality an answer as mathematics will. I think your outline approach is correct -- but perhaps your experience and breadth of knowledge let you down when analysing on of your algorithms. I recommend the book 'Concrete Mathematics' by Knuth and others, but there are a lot of other good (and even more not good) books covering the topic of analysing algorithms. Yes, I learned this during my university studies.
Having written all that, most algoritmic complexity is analysed in terms of worst-case execution time (so called big-O) and it is possible that your data sets do not approach worst-cases, in which case the speed tests you run may illuminate your actual performance rather than the algorithm's theoretical performance. So tests are not without their value. I'd say, though, that the value is secondary to that of the mathematics, which shouldn't cause you any undue headaches.

That depends. At the university you do learn to compare algorithms by calculating the number of operations it executes depending on the size / value of its arguments. (Compare analysis of algorithms and big O notation). I would require of every decent programmer to at least understand the basics of that.
However in practice this is useful only for small algorithms, or small parts of larger algorithms. You will have trouble to calculate this for, say, a parsing algorithm for an XML Document. But knowing the basics often keeps you from making braindead errors - see, for instance, Joel Spolskys amusing blog-entry "Back to the Basics".
If you have a larger system you will usually either compare algorithms by educated guessing, making time measurements, or find the troublesome spots in your system by using a profiling tool. In my experience this is rarely that important - fighting to reduce the complexity of the system helps more.

To answer your question: " Do you also present a mathematical function, or do you just do a speedtest of both algorithms."
Yes to both - let's summarize.
The "Big O" method discussed above refers to the worst case performance as Mark mentioned above. The "speedtest" you mention would be a way to estimate "average case performance". In practice, there may be a BIG difference between worst case performance and average case performance. This is why your question is interesting and useful.
Worst case performance has been the classical way of defining and classifying algorithm performance. More recently, research has been more concerned with average case performance or more precisely performance bounds like: 99% of the problems will require less than N operations. You can imagine why the second case is far more practical for most problems.
Depending on the application, you might have very different requirements. One application may require response time to be less than 3 seconds 95% of the time - this would lead to defining performance bounds. Another might require performance to NEVER exceed 5 seconds - this would lead to analyzing worst case performance.
In both cases this is taught at the university or grad school level. Anyone developing new algorithms used in real-time applications should learn about the difference between average and worst case performance and should also be prepared to develop simulations and analysis of algorithm performance as part of an implementation process.
Hope this helps.

Big O notation give you the complexity of an algoritm in the worst case, and is mainly usefull to know how the algoritm will grow in execution time when the ammount of data that have to proccess grow up. For example (C-style syntax, this is not important):
List<int> ls = new List<int>(); (1) O(1)
for (int i = 0; i < ls.count; i++) (2) O(1)
foo(i); (3) O(log n) (efficient function)
Cost analysis:
(1) cost: O(1), constant cost
(2) cost: O(1), (int i = 0;)
O(1), (i < ls.count)
O(1), (i++)
---- total: O(1) (constant cost), but it repeats n times (ls.count)
(3) cost: O(log n) (assume it, just an example),
but it repeats n times as it is inside the loop
So, in asymptotic notation, it will have a cost of: O(n log n) (not as efficient) wich in this example is a reasonable result, but take this example:
List<int> ls = new List<int>(); (1) O(1)
for (int i = 0; i < ls.count; i++) (2) O(1)
if ( (i mod 2) == 0) ) (*) O(1) (constant cost)
foo(i); (3) O(log n)
Same algorithm but with a little new line with a condition. In this case asymptotic notation will chose the worst case and will conclude same results as above O(n log n), when is easily detectable that the (3) step will execute only half the times.
Data an so are only examples and may not be exact, just trying to illustrate the behaviour of the Big O notation. It mainly gives you the behaviour of your algoritm when data grow up (you algoritm will be linear, exponencial, logarithmical, ...), but this is not what everybody knows as "efficiency", or almost, this is not the only "efficiency" meaning.
However, this methot can detect "impossible of process" (sorry, don't know the exact english word) algoritms, this is, algoritms that will need a gigantic amount of time to be processed in its early steps (think in factorials, for example, or very big matix).
If you want a real world efficiency study, may be you prefere catching up some real world data and doing a real world benchmark of the beaviour of you algoritm with this data. It is not a mathematical style, but it will be more precise in the majority of cases (but not in the worst case! ;) ).
Hope this helps.

Assuming speed (not memory) is your primary concern, and assuming you want an empirical (not theoretical) way to compare algorithms, I would suggest you prepare several datasets differing in size by a wide margin, like 3 orders of magnitude. Then run each algorithm against every dataset, clock them, and plot the results. The shape of each algorithm's time vs. dataset size curve will give a good idea of its big-O performance.
Now, if the size of your datasets in practice are pretty well known, an algorithm with better big-O performance is not necessarily faster. To determine which algorithm is faster for a given dataset size, you need to tune each one's performance until it is "as fast as possible" and then see which one wins. Performance tuning requires profiling, or single-stepping at the instruction level, or my favorite technique, stackshots.

As others have pointed out rightfully a common way is to use the Big O-notation.
But, the Big O is only good as long as you consider processing performance of algorithms that are clearly defined and scoped (such as a bubble sort).
It's when other hardware resources or other running software running in parallell comes into play that the part called engineering kicks in. The hardware has its constraints. Memory and disk are limited resources. Disk performance even depend on the mechanics involved.
An operating system scheduler will for instance differentiate on I/O bound and CPU bound resources to improve the total performance for a given application. A DBMS will take into account disk reads and writes, memory and CPU usage and even networking in the case of clusters.
These things are hard to prove mathematically but are often easily benchmarked against a set of usage patterns.
So I guess the answer is that developers both use theoretical methods such as Big O and benchmarking to determine the speed of algorithms and its implementations.

This is usually expressed with big O notation. Basically you pick a simple function (like n2 where n is the number of elements) that dominates the actual number of iterations.

When does Big-O notation fail?

What are some examples where Big-O notation[1] fails in practice?
That is to say: when will the Big-O running time of algorithms predict algorithm A to be faster than algorithm B, yet in practice algorithm B is faster when you run it?
Slightly broader: when do theoretical predictions about algorithm performance mismatch observed running times? A non-Big-O prediction might be based on the average/expected number of rotations in a search tree, or the number of comparisons in a sorting algorithm, expressed as a factor times the number of elements.
Clarification:
Despite what some of the answers say, the Big-O notation is meant to predict algorithm performance. That said, it's a flawed tool: it only speaks about asymptotic performance, and it blurs out the constant factors. It does this for a reason: it's meant to predict algorithmic performance independent of which computer you execute the algorithm on.
What I want to know is this: when do the flaws of this tool show themselves? I've found Big-O notation to be reasonably useful, but far from perfect. What are the pitfalls, the edge cases, the gotchas?
An example of what I'm looking for: running Dijkstra's shortest path algorithm with a Fibonacci heap instead of a binary heap, you get O(m + n log n) time versus O((m+n) log n), for n vertices and m edges. You'd expect a speed increase from the Fibonacci heap sooner or later, yet said speed increase never materialized in my experiments.
(Experimental evidence, without proof, suggests that binary heaps operating on uniformly random edge weights spend O(1) time rather than O(log n) time; that's one big gotcha for the experiments. Another one that's a bitch to count is the expected number of calls to DecreaseKey).
[1] Really it isn't the notation that fails, but the concepts the notation stands for, and the theoretical approach to predicting algorithm performance. </anti-pedantry>
On the accepted answer:
I've accepted an answer to highlight the kind of answers I was hoping for. Many different answers which are just as good exist :) What I like about the answer is that it suggests a general rule for when Big-O notation "fails" (when cache misses dominate execution time) which might also increase understanding (in some sense I'm not sure how to best express ATM).

It fails in exactly one case: When people try to use it for something it's not meant for.
It tells you how an algorithm scales. It does not tell you how fast it is.
Big-O notation doesn't tell you which algorithm will be faster in any specific case. It only tells you that for sufficiently large input, one will be faster than the other.

When N is small, the constant factor dominates. Looking up an item in an array of five items is probably faster than looking it up in a hash table.

Short answer: When n is small. The Traveling Salesman Problem is quickly solved when you only have three destinations (however, finding the smallest number in a list of a trillion elements can last a while, although this is O(n). )

the canonical example is Quicksort, which has a worst time of O(n^2), while Heapsort's is O(n logn). in practice however, Quicksort is usually faster then Heapsort. why? two reasons:
each iteration in Quicksort is a lot simpler than Heapsort. Even more, it's easily optimized by simple cache strategies.
the worst case is very hard to hit.
But IMHO, this doesn't mean 'big O fails' in any way. the first factor (iteration time) is easy to incorporate into your estimates. after all, big O numbers should be multiplied by this almost-constant facto.
the second factor melts away if you get the amortized figures instead of average. They can be harder to estimate, but tell a more complete story

One area where Big O fails is memory access patterns. Big O only counts operations that need to be performed - it can't keep track if an algorithm results in more cache misses or data that needs to be paged in from disk. For small N, these effects will typically dominate. For instance, a linear search through an array of 100 integers will probably beat out a search through a binary tree of 100 integers due to memory accesses, even though the binary tree will most likely require fewer operations. Each tree node would result in a cache miss, whereas the linear search would mostly hit the cache for each lookup.

Big-O describes the efficiency/complexity of the algorithm and not necessarily the running time of the implementation of a given block of code. This doesn't mean Big-O fails. It just means that it's not meant to predict running time.
Check out the answer to this question for a great definition of Big-O.

For most algorithms there is an "average case" and a "worst case". If your data routinely falls into the "worst case" scenario, it is possible that another algorithm, while theoretically less efficient in the average case, might prove more efficient for your data.
Some algorithms also have best cases that your data can take advantage of. For example, some sorting algorithms have a terrible theoretical efficiency, but are actually very fast if the data is already sorted (or nearly so). Another algorithm, while theoretically faster in the general case, may not take advantage of the fact that the data is already sorted and in practice perform worse.
For very small data sets sometimes an algorithm that has a better theoretical efficiency may actually be less efficient because of a large "k" value.

One example (that I'm not an expert on) is that simplex algorithms for linear programming have exponential worst-case complexity on arbitrary inputs, even though they perform well in practice. An interesting solution to this is considering "smoothed complexity", which blends worst-case and average-case performance by looking at small random perturbations of arbitrary inputs.
Spielman and Teng (2004) were able to show that the shadow-vertex simplex algorithm has polynomial smoothed complexity.

Big O does not say e.g. that algorithm A runs faster than algorithm B. It can say that the time or space used by algorithm A grows at a different rate than algorithm B, when the input grows. However, for any specific input size, big O notation does not say anything about the performance of one algorithm relative to another.
For example, A may be slower per operation, but have a better big-O than B. B is more performant for smaller input, but if the data size increases, there will be some cut-off point where A becomes faster. Big-O in itself does not say anything about where that cut-off point is.

The general answer is that Big-O allows you to be really sloppy by hiding the constant factors. As mentioned in the question, the use of Fibonacci Heaps is one example. Fibonacci Heaps do have great asymptotic runtimes, but in practice the constants factors are way too large to be useful for the sizes of data sets encountered in real life.
Fibonacci Heaps are often used in proving a good lower bound for asymptotic complexity of graph-related algorithms.
Another similar example is the Coppersmith-Winograd algorithm for matrix multiplication. It is currently the algorithm with the fastest known asymptotic running time for matrix multiplication, O(n2.376). However, its constant factor is far too large to be useful in practice. Like Fibonacci Heaps, it's frequently used as a building block in other algorithms to prove theoretical time bounds.

This somewhat depends on what the Big-O is measuring - when it's worst case scenarios, it will usually "fail" in that the runtime performance will be much better than the Big-O suggests. If it's average case, then it may be much worse.
Big-O notation typically "fails" if the input data to the algorithm has some prior information. Often, the Big-O notation refers to the worst case complexity - which will often happen if the data is either completely random or completely non-random.
As an example, if you feed data to an algorithm that's profiled and the big-o is based on randomized data, but your data has a very well-defined structure, your result times may be much faster than expected. On the same token, if you're measuring average complexity, and you feed data that is horribly randomized, the algorithm may perform much worse than expected.

Small N - And for todays computers, 100 is likely too small to worry.
Hidden Multipliers - IE merge vs quick sort.
Pathological Cases - Again, merge vs quick

One broad area where Big-Oh notation fails is when the amount of data exceeds the available amount of RAM.
Using sorting as an example, the amount of time it takes to sort is not dominated by the number of comparisons or swaps (of which there are O(n log n) and O(n), respectively, in the optimal case). The amount of time is dominated by the number of disk operations: block writes and block reads.
To better analyze algorithms which handle data in excess of available RAM, the I/O-model was born, where you count the number of disk reads. In that, you consider three parameters:
The number of elements, N;
The amount of memory (RAM), M (the number of elements that can be in memory); and
The size of a disk block, B (the number of elements per block).
Notably absent is the amount of disk space; this is treated as if it were infinite. A typical extra assumption is that M > B2.
Continuing the sorting example, you typically favor merge sort in the I/O case: divide the elements into chunks of size θ(M) and sort them in memory (with, say, quicksort). Then, merge θ(M/B) of them by reading the first block from each chunk into memory, stuff all the elements into a heap, and repeatedly pick the smallest element until you have picked B of them. Write this new merge block out and continue. If you ever deplete one of the blocks you read into memory, read a new block from the same chunk and put it into the heap.
(All expressions should be read as being big θ). You form N/M sorted chunks which you then merge. You merge log (base M/B) of N/M times; each time you read and write all the N/B blocks, so it takes you N/B * (log base M/B of N/M) time.
You can analyze in-memory sorting algorithms (suitably modified to include block reads and block writes) and see that they're much less efficient than the merge sort I've presented.
This knowledge is courtesy of my I/O-algorithms course, by Arge and Brodal (http://daimi.au.dk/~large/ioS08/); I also conducted experiments which validate the theory: heap sort takes "almost infinite" time once you exceed memory. Quick sort becomes unbearably slow, merge sort barely bearably slow, I/O-efficient merge sort performs well (the best of the bunch).

I've seen a few cases where, as the data set grew, the algorithmic complexity became less important than the memory access pattern. Navigating a large data structure with a smart algorithm can, in some cases, cause far more page faults or cache misses, than an algorithm with a worse big-O.
For small n, two algorithms may be comparable. As n increases, the smarter algorithm outperforms. But, at some point, n grows big enough that the system succumbs to memory pressure, in which case the "worse" algorithm may actually perform better because the constants are essentially reset.
This isn't particularly interesting, though. By the time you reach this inversion point, the performance of both algorithms is usually unacceptable, and you have to find a new algorithm that has a friendlier memory access pattern AND a better big-O complexity.

This question is like asking, "When does a person's IQ fail in practice?" It's clear that having a high IQ does not mean you'll be successful in life and having a low IQ does not mean you'll perish. Yet, we measure IQ as a means of assessing potential, even if its not an absolute.
In algorithms, the Big-Oh notation gives you the algorithm's IQ. It doesn't necessarily mean that the algorithm will perform best for your particular situation, but there's some mathematical basis that says this algorithm has some good potential. If Big-Oh notation were enough to measure performance you would see a lot more of it and less runtime testing.
Think of Big-Oh as a range instead of a specific measure of better-or-worse. There's best case scenarios and worst case scenarios and a huge set of scenarios in between. Choose your algorithms by how well they fit within the Big-Oh range, but don't rely on the notation as an absolute for measuring performance.

When your data doesn't fit the model, big-o notation will still work, but you're going to see an overlap from best and worst case scenarios.
Also, some operations are tuned for linear data access vs. random data access, so one algorithm while superior in terms of cycles, might be doggedly slow if the method of calling it changes from design. Similarly, if an algorithm causes page/cache misses due to the way it access memory, Big-O isn't going to going to give an accurate estimate of the cost of running a process.
Apparently, as I've forgotten, also when N is small :)

The short answer: always on modern hardware when you start using a lot of memory. The textbooks assume memory access is uniform, and it is no longer. You can of course do Big O analysis for a non-uniform access model, but that is somewhat more complex.
The small n cases are obvious but not interesting: fast enough is fast enough.
In practice I've had problems using the standard collections in Delphi, Java, C# and Smalltalk with a few million objects. And with smaller ones where the dominant factor proved to be the hash function or the compare

Robert Sedgewick talks about shortcomings of the big-O notation in his Coursera course on Analysis of Algorithms. He calls particularly egregious examples galactic algorithms because while they may have a better complexity class than their predecessors, it would take inputs of astronomical sizes for it to show in practice.
https://www.cs.princeton.edu/~rs/talks/AlgsMasses.pdf

Big O and its brothers are used to compare asymptotic mathematical function growth. I would like to emphasize on the mathematical part. Its entirely about being able reduce your problem to a function where the input grows a.k.a scales. It gives you a nice plot where your input (x axis) related to the number of operations performed(y-axis). This is purely based on the mathematical function and as such requires us to accurately model the algorithm used into a polynomial of sorts. Then the assumption of scaling.
Big O immediately loses its relevance when the data is finite, fixed and constant size. Which is why nearly all embedded programmers don't even bother with big O. Mathematically this will always come out to O(1) but we know that we need to optimize our code for space and Mhz timing budget at a level that big O simply doesn't work. This is optimization is on the same order where the individual components matter due to their direct performance dependence on the system.
Big O's other failure is in its assumption that hardware differences do not matter. A CPU that has a MAC, MMU and/or a bit shift low latency math operations will outperform some tasks which may be falsely identified as higher order in the asymptotic notation. This is simply because of the limitation of the model itself.
Another common case where big O becomes absolutely irrelevant is where we falsely identify the nature of the problem to be solved and end up with a binary tree when in reality the solution is actually a state machine. The entire algorithm regimen often overlooks finite state machine problems. This is because a state machine complexity grows based on the number of states and not the number of inputs or data which in most cases are constant.
The other aspect here is the memory access itself which is an extension of the problem of being disconnected from hardware and execution environment. Many times the memory optimization gives performance optimization and vice-versa. They are not necessarily mutually exclusive. These relations cannot be easily modeled into simple polynomials. A theoretically bad algorithm running on heap (region of memory not algorithm heap) data will usually outperform a theoretically good algorithm running on data in stack. This is because there is a time and space complexity to memory access and storage efficiency that is not part of the mathematical model in most cases and even if attempted to model often get ignored as lower order terms that can have high impact. This is because these will show up as a long series of lower order terms which can have a much larger impact when there are sufficiently large number of lower order terms which are ignored by the model.
Imagine n3+86n2+5*106n2+109n
It's clear that the lower order terms that have high multiples will likely together have larger significance than the highest order term which the big O model tends to ignore. It would have us ignore everything other than n3. The term "sufficiently large n' is completely abused to imagine unrealistic scenarios to justify the algorithm. For this case, n has to be so large that you will run out of physical memory long before you have to worry about the algorithm itself. The algorithm doesn't matter if you can't even store the data. When memory access is modeled in; the lower order terms may end up looking like the above polynomial with over a 100 highly scaled lower order terms. However for all practical purposes these terms are never even part of the equation that the algorithm is trying to define.
Most scientific notations are generally the description of mathematical functions and used to model something. They are tools. As such the utility of the tool is constrained and only as good as the model itself. If the model cannot describe or is an ill fit to the problem at hand, then the model simply doesn't serve the purpose. This is when a different model needs to be used and when that doesn't work, a direct approach may serve your purpose well.
In addition many of the original algorithms were models of Turing machine that has a completely different working mechanism and all computing today are RASP models. Before you go into big O or any other model, ask yourself this question first "Am I choosing the right model for the task at hand and do I have the most practically accurate mathematical function ?". If the answer is 'No', then go with your experience, intuition and ignore the fancy stuff.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio