Combinations of Asymptotic Time Complexities with Best, Average and Worst Case Inputs - algorithm

I'm confused by numerous claims that asymptotic notation has nothing to do with best-case, average-case and worst-case time complexity. If this is the case, then presumably the following combinations are all valid:
Time Complexity O(n)
Best case - upper bound for the best case input
For the best possible input, the number of basic operations carried out by this algorithm will never exceed some constant multiple of n.
Average case - upper bound for average case input
For an average input, the number of basic operations carried out by this algorithm will never exceed some constant multiple of n.
Worst case - upper bound for worst case input
For the worst possible input, the number of basic operations carried out by this algorithm will never exceed some constant multiple of n.
Time Complexity: Ө(n)
Best case - tight bound for the best case input
For the best possible input, the number of basic operations carried out by this algorithm will never exceed or be less than some constant multiple of n.
Average case - tight bound for average case input
For an average input, the number of basic operations carried out by this algorithm will never exceed or be less than some constant multiple of n.
Worst case - tight bound for worst case input
For the worst possible input, the number of basic operations carried out by this algorithm will never exceed or be less than some constant multiple of n.
Time Complexity: Ω(n)
Best case - lower bound for the best case input
For the best possible input, the number of basic operations carried out by this algorithm will never be less than some constant multiple of n.
Average case - lower bound for average case input
For an average input, the number of basic operations carried out by this algorithm will never be less than some constant multiple of n.
Worst case - lower bound for worst case input
For the worst possible input, the number of basic operations carried out by this algorithm will never be less than some constant multiple of n.
Which of the above make sense? Which are generally used in practice when assessing the efficiency of an algorithm in terms of time taken to execute as input grows? As far as I can tell, several of them are redundant and/or contradictory.
I'm really not seeing how the concepts of upper, tight and lower bounds have nothing to do with best, average and worst case inputs. This is one of those topics that the further I look into it, the more confused I become. I would be very grateful if someone could provide some clarity on the matter for me.

All of the statements make sense.
In practice, unless otherwise stated, we should be talking about the worst case input in all cases.
We're often concerned about the average case input, although the definition of "average case" gets a little dodgy. It's usually better to talk about the "expected time", since that is a more precise mathematical definition that corresponds to what people usually mean by "average case". People are unfortunately often sloppy here, and you'll often see complexity statements that refer to expected time instead of worst case time, but don't mention it.
We're rarely concerned about the best case input.
Unfortunately, it's also common to see people confusing the notions of best-vs-worst-case input vs upper-vs-lower bounds, especially on SO and other informal sites.
You can always go back to the definitions, as you have already done, to figure out what statements really mean.
"This algorithm runs in X(n2) time" Means that given the function f(n) for the worst-case execution time vs problem size in any environment, that function will be in the set X(n2).
"This algorithm runs in X(n2) expected time" Means that given the function f(n) for the mathematical expectation of execution time vs problem size in any environment, that function will be in the set X(n2).
Finally, note that in any environment actually only applies to specific computing models. We usually assume a random access machine, so complexity statements are not valid for, e.g., Turing machine implementations.

Related

Why big-Oh is not always a worst case analysis of an algorithm?

I am trying to learn analysis of algorithms and I am stuck with relation between asymptotic notation(big O...) and cases(best, worst and average).
I learn that the Big O notation defines an upper bound of an algorithm, i.e. it defines function can not grow more than its upper bound.
At first it sound to me as it calculates the worst case.
I google about(why worst case is not big O?) and got ample of answers which were not so simple to understand for beginner.
I concluded it as follows:
Big O is not always used to represent worst case analysis of algorithm because, suppose a algorithm which takes O(n) execution steps for best, average and worst input then it's best, average and worst case can be expressed as O(n).
Please tell me if I am correct or I am missing something as I don't have anyone to validate my understanding.
Please suggest a better example to understand why Big O is not always worst case.
Big-O?
First let us see what Big O formally means:
In computer science, big O notation is used to classify algorithms
according to how their running time or space requirements grow as the
input size grows.
This means that, Big O notation characterizes functions according to their growth rates: different functions with the same growth rate may be represented using the same O notation. Here, O means order of the function, and it only provides an upper bound on the growth rate of the function.
Now let us look at the rules of Big O:
If f(x) is a sum of several terms, if there is one with largest
growth rate, it can be kept, and all others omitted
If f(x) is a product of several factors, any constants (terms in the
product that do not depend on x) can be omitted.
Example:
f(x) = 6x^4 − 2x^3 + 5
Using the 1st rule we can write it as, f(x) = 6x^4
Using the 2nd rule it will give us, O(x^4)
What is Worst Case?
Worst case analysis gives the maximum number of basic operations that
have to be performed during execution of the algorithm. It assumes
that the input is in the worst possible state and maximum work has to
be done to put things right.
For example, for a sorting algorithm which aims to sort an array in ascending order, the worst case occurs when the input array is in descending order. In this case maximum number of basic operations (comparisons and assignments) have to be done to set the array in ascending order.
It depends on a lot of things like:
CPU (time) usage
memory usage
disk usage
network usage
What's the difference?
Big-O is often used to make statements about functions that measure the worst case behavior of an algorithm, but big-O notation doesn’t imply anything of the sort.
The important point here is we're talking in terms of growth, not number of operations. However, with algorithms we do talk about the number of operations relative to the input size.
Big-O is used for making statements about functions. The functions can measure time or space or cache misses or rabbits on an island or anything or nothing. Big-O notation doesn’t care.
In fact, when used for algorithms, big-O is almost never about time. It is about primitive operations.
When someone says that the time complexity of MergeSort is O(nlogn), they usually mean that the number of comparisons that MergeSort makes is O(nlogn). That in itself doesn’t tell us what the time complexity of any particular MergeSort might be because that would depend how much time it takes to make a comparison. In other words, the O(nlogn) refers to comparisons as the primitive operation.
The important point here is that when big-O is applied to algorithms, there is always an underlying model of computation. The claim that the time complexity of MergeSort is O(nlogn), is implicitly referencing an model of computation where a comparison takes constant time and everything else is free.
Example -
If we are sorting strings that are kk bytes long, we might take “read a byte” as a primitive operation that takes constant time with everything else being free.
In this model, MergeSort makes O(nlogn) string comparisons each of which makes O(k) byte comparisons, so the time complexity is O(k⋅nlogn). One common implementation of RadixSort will make k passes over the n strings with each pass reading one byte, and so has time complexity O(nk).
The two are not the same thing.  Worst-case analysis as other have said is identifying instances for which the algorithm takes the longest to complete (i.e., takes the most number of steps), then formulating a growth function using this.  One can analyze the worst-case time complexity using Big-Oh, or even other variants such as Big-Omega and Big-Theta (in fact, Big-Theta is usually what you want, though often Big-Oh is used for ease of comprehension by those not as much into theory).  One important detail and why worst-case analysis is useful is that the algorithm will run no slower than it does in the worst case.  Worst-case analysis is a method of analysis we use in analyzing algorithms.
Big-Oh itself is an asymptotic measure of a growth function; this can be totally independent as people can use Big-Oh to not even measure an algorithm's time complexity; its origins stem from Number Theory.  You are correct to say it is the asymptotic upper bound of a growth function; but the manner you prescribe and construct the growth function comes from your analysis.  The Big-Oh of a growth function itself means little to nothing without context as it only says something about the function you are analyzing.  Keep in mind there can be infinitely many algorithms that could be constructed that share the same time complexity (by the definition of Big-Oh, Big-Oh is a set of growth functions).
In short, worst-case analysis is how you build your growth function, Big-Oh notation is one method of analyzing said growth function.  Then, we can compare that result against other worst-case time complexities of competing algorithms for a given problem.  Worst-case analysis if done correctly yields the worst-case running time if done exactly (you can cut a lot of corners and still get the correct asymptotics if you use a barometer), and using this growth function yields the worst-case time complexity of the algorithm.  Big-Oh alone doesn't guarantee the worst-case time complexity as you had to make the growth function itself.  For instance, I could utilize Big-Oh notation for any other kind of analysis (e.g., best case, average case).  It really depends on what you're trying to capture.  For instance, Big-Omega is great for lower bounds.
Imagine a hypothetical algorithm that in best case only needs to do 1 step, in the worst case needs to do n2 steps, but in average (expected) case, only needs to do n steps. With n being the input size.
For each of these 3 cases you could calculate a function that describes the time complexity of this algorithm.
1 Best case has O(1) because the function f(x)=1 is really the highest we can go, but also the lowest we can go in this case, omega(1). Since Omega is equal to O (the upper bound and lower bound), we state that this function, in the best case, behaves like theta(1).
2 We could do the same analysis for the worst case and figure out that O(n2 ) = omega(n2 ) =theta(n2 ).
3 Same counts for the average case but with theta( n ).
So in theory you could determine 3 cases of an algorithm and for those 3 cases calculate the lower/upper/thight bounds. I hope this clears things up a bit.
https://www.google.co.in/amp/s/amp.reddit.com/r/learnprogramming/comments/3qtgsh/how_is_big_o_not_the_same_as_worst_case_or_big/
Big O notation shows how an algorithm grows with respect to input size. It says nothing of which algorithm is faster because it doesn't account for constant set up time (which can dominate if you have small input sizes). So when you say
which takes O(n) execution steps
this almost doesn't mean anything. Big O doesn't say how many execution steps there are. There are C + O(n) steps (where C is a constant) and this algorithm grows at rate n depending on input size.
Big O can be used for best, worst, or average cases. Let's take sorting as an example. Bubble sort is a naive O(n^2) sorting algorithm, but when the list is sorted it takes O(n). Quicksort is often used for sorting (the GNU standard C library uses it with some modifications). It preforms at O(n log n), however this is only true if the pivot chosen splits the array in to two equal sized pieces (on average). In the worst case we get an empty array one side of the pivot and Quicksort performs at O(n^2).
As Big O shows how an algorithm grows with respect to size, you can look at any aspect of an algorithm. Its best case, average case, worst case in both time and/or memory usage. And it tells you how these grow when the input size grows - but it doesn't say which is faster.
If you deal with small sizes then Big O won't matter - but an analysis can tell you how things will go when your input sizes increase.
One example of where the worst case might not be the asymptotic limit: suppose you have an algorithm that works on the set difference between some set and the input. It might run in O(N) time, but get faster as the input gets larger and knocks more values out of the working set.
Or, to get more abstract, f(x) = 1/x for x > 0 is a decreasing O(1) function.
I'll focus on time as a fairly common item of interest, but Big-O can also be used to evaluate resource requirements such as memory. It's essential for you to realize that Big-O tells how the runtime or resource requirements of a problem scale (asymptotically) as the problem size increases. It does not give you a prediction of the actual time required. Predicting the actual runtimes would require us to know the constants and lower order terms in the prediction formula, which are dependent on the hardware, operating system, language, compiler, etc. Using Big-O allows us to discuss algorithm behaviors while sidestepping all of those dependencies.
Let's talk about how to interpret Big-O scalability using a few examples. If a problem is O(1), it takes the same amount of time regardless of the problem size. That may be a nanosecond or a thousand seconds, but in the limit doubling or tripling the size of the problem does not change the time. If a problem is O(n), then doubling or tripling the problem size will (asymptotically) double or triple the amounts of time required, respectively. If a problem is O(n^2), then doubling or tripling the problem size will (asymptotically) take 4 or 9 times as long, respectively. And so on...
Lots of algorithms have different performance for their best, average, or worst cases. Sorting provides some fairly straightforward examples of how best, average, and worst case analyses may differ.
I'll assume that you know how insertion sort works. In the worst case, the list could be reverse ordered, in which case each pass has to move the value currently being considered as far to the left as possible, for all items. That yields O(n^2) behavior. Doubling the list size will take four times as long. More likely, the list of inputs is in randomized order. In that case, on average each item has to move half the distance towards the front of the list. That's less than in the worst case, but only by a constant. It's still O(n^2), so sorting a randomized list that's twice as large as our first randomized list will quadruple the amount of time required, on average. It will be faster than the worst case (due to the constants involved), but it scales in the same way. The best case, however, is when the list is already sorted. In that case, you check each item to see if it needs to be slid towards the front, and immediately find the answer is "no," so after checking each of the n values you're done in O(n) time. Consequently, using insertion sort for an already ordered list that is twice the size only takes twice as long rather than four times as long.
You are right, in that you can say certainly say that an algorithm runs in O(f(n)) time in the best or average case. We do that all the time for, say, quicksort, which is O(N log N) on average, but only O(N^2) worst case.
Unless otherwise specified, however, when you say that an algorithm runs in O(f(n)) time, you are saying the algorithm runs in O(f(n)) time in the worst case. At least that's the way it should be. Sometimes people get sloppy, and you will often hear that a hash table is O(1) when in the worst case it is actually worse.
The other way in which a big O definition can fail to characterize the worst case is that it's an upper bound only. Any function in O(N) is also in O(N^2) and O(2^N), so we would be entirely correct to say that quicksort takes O(2^N) time. We just don't say that because it isn't useful to do so.
Big Theta and Big Omega are there to specify lower bounds and tight bounds respectively.
There are two "different" and most important tools:
the best, worst, and average-case complexity are for generating numerical function over the size of possible problem instances (e.g. f(x) = 2x^2 + 8x - 4) but it is very difficult to work precisely with these functions
big O notation extract the main point; "how efficient the algorithm is", it ignore a lot of non important things like constants and ... and give you a big picture

Right way to discuss computational complexity for small n

When discussing computational complexity, it seems everyone generally goes straight to Big O.
Lets say for example I have a hybrid algorithm such as merge sort which uses insertion sort for smaller subarrays (I believe this is called tiled merge sort). It's still ultimately merge sort with O(n log n), but I want to discuss the behaviour/characteristics of the algorithm for small n, in cases where no merging actually takes place.
For all intents and purposes the tiled merge sort is insertion sort, executing exactly the same instructions for the domain of my small n. However, Big O deals with the large and asymptotic cases and discussing Big O for small n is pretty much an oxymoron. People have yelled at me for even thinking the words "behaves like an O(n^2) algorithm in such cases". What is the correct way to describe the algorithm's behaviour in cases of small n within the context of formal theoretical computational analysis? To clarify, not just in the case where n is small, but in the case where n is never big.
One might say that for such small n it doesn't matter but I'm interested in the cases where it does, for example with a large constant such as being executed many times, and where in practice it would show a clear trend and be the dominant factor. For example the initial quadratic growth seen in the graph below. I'm not dismissing Big O, more asking for a way to properly tell both sides of the story.
[EDIT]
If for "small n", constants can easily remove all trace of a growth rate then either
only the asymptotic case is discussed, in which case there is less relevance to any practical application, or
there must be a threshold at which we agree n is no longer "small".
What about the cases where n is not "small" (n is sufficiently big that the growth rate will not to affected significantly by any practical constant), but not yet big enough to show the final asymptotic growth rate so only sub growth rates are visible (for example the shape in the image above)?
Are there no practical algorithms that exhibit this behaviour? Even if there aren't, theoretical discussion should still be possible. Do we measure instead of discussing the theory purely because that's "what one should do"? If some behaviour is observed in all practical cases, why can't there be theory that's meaningful?
Let me turn the question around the other way. I have a graph that shows segmented super-linear steps. It sounds like many people would say "this is a pure coincidence, it could be any shape imaginable" (at the extreme of course) and wouldn't bat an eyelid if it were a sine wave instead. I know in many cases the shape could be hidden by constants, but here it's quite obvious. How can I give a formal explanation of why the graph produces this shape?
I particularly like #Sneftel's words "imprecise but useful guidance".
I know Big O and asymptotic analysis isn't applicable. What is? How far can I take it?
Discuss in chat
For small n, computation complexity - how things change as n increases towards infinity - isn't meaningful as other effects dominate.
Papers I've seen which discuss behaviour for small values of n do so by measuring the algorithms on real systems, and discuss how the algorithms perform in practice rather than from a theoretical viewpoint. For example, for the graph you've added to your post I would say 'this graph demonstrates an O(N) asymptotic behaviour overall, but the growth within each tile is bounded quadratic'.
I don't know of a situation where a discussion of such behaviour from a theoretical viewpoint would be meaningful - it is well known that for small n the practical effects outweigh the effects of scaling.
It's important to remember that asymptotic analysis is an analytic simplification, not a mandate for analyzing algorithms. Take selection sort, for instance. Yes, it executes in O(n^2) time. But it is also true that it performs precisely n*(n-1)/2 comparisons, and n-1-k swaps, where k is the number of elements (other than the maximum) which start in the correct position. Asymptotic analysis is a tool for simplifying the (otherwise generally impractical) task of performance analysis, and one we can put aside if we're not interested in the "really big n" segment.
And you can choose how you express your bounds. Say a function requires precisely n + floor(0.01*2^n) operations. That's exponential time, of course. But one can also say "for data sizes up to 10 elements, this algorithm requires between n and 2*n operations." The latter describes not the shape of the curve, but an envelope around that curve, giving imprecise but useful guidance about the practicalities of the algorithm within a particular range of use cases.
You are right.
For small n, i.e. when only insertion sort is performed, the asymptotic behavior is quadratic O(n^2).
And for larger n, when tiled merge sort enters into play, the behavior switches to O(n.Log(n)).
There is no contradiction if you remember that every behavior has its domain of validity, before the switching threshold, let N, and after it.
In practice there will be a smooth blend between the curves around N. But in practice too, that value of N is so small that the quadratic behavior does not have enough "room" to manifest itself.
Another way to deal with this analysis is to say that N being a constant, the insertion sorts take constant time. But I would disagree to say that this is a must.
Let's unpack things a bit. Big-O is a tool for describing the growth rate of a function. One of the functions to which it is commonly applied is the worst-case running time of an algorithm on inputs of length n, executing on a particular abstract machine. We often forget about the last part because there is a large class of machines with random-access memory that can emulate one another with only constant-factor slowdown, and the class of problems solvable within a particular big-O running-time bound is equivalent across these machines.
If you want to talk about complexity on small inputs, then you need to measure constant factors. One way is to measure running times of actual implementations. People usually do this on physical machines, but if you're hardcore like Knuth, you invent your own assembly language complete with instruction timings. Another way is to measure something that's readily identifiable but also a useful proxy for the other work performed. For comparison sorts, this could be comparisons. For numerical algorithms, this is often floating-point operations.
Complexity is not about execution time for one n on one machine, so there is no need to consider it even if constant is large. Complexity tells you how the size of the input affects execution time. For small n, you can treat execution time as constant. This is the one side.
From the second side you are saying that:
You have a hybrid algorithm working in O(n log n) for n larger than some k and O(n^2) for n smaller than k.
The constant k is so large that algorithm works slowly.
There is no sense in such algorithm, because you could easily improve it.
Lets take Red-black tree. Operations on this tree are performed in O(n log n) time complexity, but there is a large constant. So, on normal machines, it could work slowly (i.e. slower than simpler solutions) in some cases. There is no need to consider it in analyzing complexity. You need to consider it when you are implementing it in your system: you need to check if it's the best choice considering the actual machine(s) on which it will be working and what problems it will be solving.
Read Knuth's "The Art of Computer Programming series", starting with "Volume 1. Fundamental Algorithms", section "1.2.10: Analysis of an Algorithm". There he shows (and in all the rest of his seminal work) how exact analysis can be conducted for arbitrary problem sizes, using a suitable reference machine, by taking a detailed census of every processor instruction.
Such analyses have to take into account not only the problem size, but also any relevant aspect of the input data distribution which will influence the running time. For simplification, the analysis are often limited to the study of the worst case, the expected case or the output-sensitive case, rather than a general statistical characterization. And for further simplification, asymptotic analysis is used.
Not counting the fact that except for trivial algorithms the detailed approach is mathematically highly challenging, it has become unrealistic on modern machines. Indeed, it relies on a processor behavior similar to the so-called RAM model, which assumes constant time per instruction and per memory access (http://en.wikipedia.org/wiki/Random-access_machine). Except maybe for special hardware combined to a hard real-time OS, these assumptions are nowadays completely wrong.
When you have an algorithm with a time complexity say O(n^2).And you also have an another algorithm with a time complexity, say O(n).Then from these two time complexity you can't conclude that the latter algorithm is faster than the former one for all input values.You can only say latter algorithm is asymptotically(means for sufficiently large input values)faster than the former one.Here you have to keep in mind the fact that in case of asymptotic notations constant factors are generally ignored to increase the understand-ability of the time complexity of the algorithm.As example: marge sort runs in O(nlogn) worst-case time and insertion sort runs in O(n^2) worst case time.But as the hidden constant factors in insertion sort is smaller than that of marge sort, in practice insertion sort can be faster than marge sort for small problem sizes on many machines.
Asymptotic notation does not describe the actual running-time of an algorithm.Actual running time is dependent on machine as different machine has different architecture and different Instruction Cycle Execution time.Asymptotic notation just describes asymptotically how fast an algorithm is with respect to other algorithms.But it does not describe the behavior of the algorithm in case of small input values(n<=no).The value of no (threshold) is dependent on the hidden constant factors and lower order terms.And hidden constant factors are dependent on the machine on which it will be executed.

What is the purpose of Big-O notation in computer science if it doesn't give all the information needed?

What is the use of Big-O notation in computer science if it doesn't give all the information needed?
For example, if one algorithm runs at 1000n and one at n, it is true that they are both O(n). But I still may make a foolish choice based on this information, since one algorithm takes 1000 times as long as the other for any given input.
I still need to know all the parts of the equation, including the constant, to make an informed choice, so what is the importance of this "intermediate" comparison? I end up loosing important information when it gets reduced to this form, and what do I gain?
What does that constant factor represent? You can't say with certainty, for example, that an algorithm that is O(1000n) will be slower than an algorithm that's O(5n). It might be that the 1000n algorithm loads all data into memory and makes 1,000 passes over that data, and the 5n algorithm makes five passes over a file that's stored on a slow I/O device. The 1000n algorithm will run faster even though its "constant" is much larger.
In addition, some computers perform some operations more quickly than other computers do. It's quite common, given two O(n) algorithms (call them A and B), for A to execute faster on one computer and B to execute faster on the other computer. Or two different implementations of the same algorithm can have widely varying runtimes on the same computer.
Asymptotic analysis, as others have said, gives you an indication of how an algorithm's running time varies with the size of the input. It's useful for giving you a good starting place in algorithm selection. Quick reference will tell you that a particular algorithm is O(n) or O(n log n) or whatever, but it's very easy to find more detailed information on most common algorithms. Still, that more detailed analysis will only give you a constant number without saying how that number relates to real running time.
In the end, the only way you can determine which algorithm is right for you is to study it yourself and then test it against your expected data.
In short, I think you're expecting too much from asymptotic analysis. It's a useful "first line" filter. But when you get beyond that you have to look for more information.
As you correctly noted, it does not give you information on the exact running time of an algorithm. It is mainly used to indicate the complexity of an algorithm, to indicate if it is linear in the input size, quadratic, exponential, etc. This is important when choosing between algorithms if you know that your input size is large, since even a 1000n algorithm well beat a 1.23 exp(n) algorithm for large enough n.
In real world algorithms, the hidden 'scaling factor' is of course important. It is therefore not uncommon to use an algorithm with a 'worse' complexity if it has a lower scaling factor. Many practical implementations of sorting algorithms are for example 'hybrid' and will resort to some 'bad' algorithm like insertion sort (which is O(n^2) but very simple to implement) for n < 10, while changing to quicksort (which is O(n log(n)) but more complex) for n >= 10.
Big-O tells you how the runtime or memory consumption of a process changes as the size of its input changes. O(n) and O(1000n) are both still O(n) -- if you double the size of the input, then for all practical purposes the runtime doubles too.
Now, we can have an O(n) algorithm and an O(n2) algorithm where the coefficient of n is 1000000 and the coefficient of n2 is 1, in which case the O(n2) algorithm would outperform the O(n) for smaller n values. This doesn't change the fact, however, that the second algorithm's runtime grows more rapidly than the first's, and this is the information that big-O tells us. There will be some input size at which the O(n) algorithm begins to outperform the O(n2) algorithm.
In addition to the hidden impact of the constant term, complexity notation also only considers the worst case instance of a problem.
Case in point, the simplex method (linear programming) has exponential complexity for all known implementations. However, the simplex method works much faster in practice than the provably polynomial-time interior point methods.
Complexity notation has much value for theoretical problem classification. If you want some more information on practical consequences check out "Smoothed Analysis" by Spielman: http://www.cs.yale.edu/homes/spielman
This is what you are looking for.
It's main purpose is for rough comparisons of logic. The difference of O(n) and O(1000n) is large for n ~ 1000 (n roughly equal to 1000) and n < 1000, but when you compare it to values where n >> 1000 (n much larger than 1000) the difference is miniscule.
You are right in saying they both scale linearly and knowing the coefficient helps in a detailed analysis but generally in computing the difference between linear (O(cn)) and exponential (O(cn^x)) performance is more important to note than the difference between two linear times. There is a larger value in the comparisons of runtime of higher orders such as and Where the performance difference scales exponentially.
The overall purpose of Big O notation is to give a sense of relative performance time in order to compare and further optimize algorithms.
You're right that it doesn't give you all information, but there's no single metric in any field that does that.
Big-O notation tells you how quickly the performance gets worse, as your dataset gets larger. In other words, it describes the type of performance curve, but not the absolute performance.
Generally, Big-O notation is useful to express an algorithm's scaling performance as it falls into one of three basic categories:
Linear
Logarithmic (or "linearithmic")
Exponential
It is possible to do deep analysis of an algorithm for very accurate performance measurements, but it is time consuming and not really necessary to get a broad indication of performance.

Expected running time vs. worst-case running time

I am studying the randomized-quicksort algorithm. I realized that the running time of this algorithm is always represented as "expected running time".
What is the reason for specifying or using the "expected running time"? Why don't we calculate the worst- or average-case?
Sometimes the expected running time means the average running time for a randomly chosen input. But if it's a randomized algorithm, then what is often meant is the expected running time with respect to the random choices made by the algorithm, for every input. That is what is meant here. For every input of n items, randomized quicksort runs in time O(n(log n)) on average, averaged only over its coin flips.
In this limited sense, expected running time is a very defensible measure of the running time of a randomized algorithm. If you're only worried about the worst that could happen when the algorithm flips coins internally, then why bother flipping coins at all? You might as well just make them all heads. (Which in the case of randomized quicksort, would reduce it to ordinary quicksort.)
Average case vs worst case is a much more serious matter when it's an average over inputs rather than an average over coin flips. In this case, the average running time is at best a figure that is not adaptable to changes in the type of input --- different uses of the algorithm could have different distributions of inputs. I say at best because you may not know that the hypothetical distribution of inputs is ever the true usage.
Looking at the worst case with respect to coin flips only makes sense when you would both like to run quickly when your coin flips aren't unlucky, and not run too slowly even when your coin flips are unlucky. For instance, imagine that you need a sorting algorithm for the regulator for an oxygen supply (for a medical patient or a scuba diver). Then randomized quicksort would only make sense if you both want the result to be very fast usually, for the user's convenience, AND if the worst case wouldn't suffocate the user no matter what. This is a contrived scenario for sorting algorithms, because there are non-random sorting algorithms (like merge sort) that are not much slower than quicksort even is on average. It is less contrived for a problem like primality testing, where randomized algorithms are much faster than non-randomized algorithms. Then you might want to make a run for it with a randomized algorithm --- while at the same time running a non-randomized algorithm as a backup.
(Okay, you might wonder why an oxygen regulator would want to know whether particular numbers are prime. Maybe it needs to communicate with a medical computer network, and the communication needs to be secure for medical privacy reasons...)
When we say "expected running time", we are talking about the running time for the average case. We might be talking about an asymptotically upper or lower bound (or both). Similarly, we can talk about asymptotically upper and lower bounds on the running time of the best or worst cases. In other words, the bound is orthogonal to the case.
In the case of randomized quicksort, people talk about the expected running time (O(n log n)) since this makes the algorithm seem better than worst-case O(n^2) algorithms (which it is, though not asymptotically in the worst case). In other words, randomized quicksort is much asymptotically faster than e.g. Bubblesort for almost all inputs, and people want a way to make that fact clear; so people emphasize the average-case running time of randomized quicksort, rather than the fact that it is as asymptotically bad as Bubblesort in the worst case.
As pointed out in the comments and in Greg's excellent answer, it may also be common to speak of the expected runtime with respect to the set of sequences of random choices made during the algorithm's execution on a fixed, arbitrary input. This may be more natural since we think of the inputs as being passively acted upon by an active algorithm. In fact, it is equivalent to speak of an average over random inputs and an algorithm whose execution does not consider structural differences. Both of these formulations are easier to visualize than the true average over the set of pairs of inputs and random choices, but you do get the same answers no matter which way you approach it.
An algorithm is randomized if its behavior is determined not only by its input but also by values produced by a random-number generator.
That is why you analyze what is expected.
A worst case analysis is on the input only.
A bit late and it's more of a long comment but it's something I think is important to add. Any algorithm with expected time T can become worst case O(T) algorithm, markov's inequality tells us that if the expected time is T then with a probability of at least 1/2 the algorithm will take less than 2T operations, so we can just run the algorithm and if it takes more than 2T operations we stop and re run this, doing this at most log(1/delta) times will give us a probability of failure of at most delta. So we get a time complexity O(T*log(1/delta)) with a probability of failure delta. But since log is so small this is for all practical reasons an O(T) algorithm with probability of failure 0. For example if we choose delta to be the probability 2 randomly selected atoms from the observable universe would be the same atom we get log(1/delta)=260 so we can just say we get O(T) with 0 probability of failure.

Difference between average case and amortized analysis

I am reading an article on amortized analysis of algorithms. The following is a text snippet.
Amortized analysis is similar to average-case analysis in that it is
concerned with the cost averaged over a sequence of operations.
However, average case analysis relies on probabilistic assumptions
about the data structures and operations in order to compute an
expected running time of an algorithm. Its applicability is therefore
dependent on certain assumptions about the probability distribution of
algorithm inputs.
An average case bound does not preclude the possibility that one will
get “unlucky” and encounter an input that requires more-than-expected
time even if the assumptions for probability distribution of inputs are
valid.
My questions about above text snippet are:
In the first paragraph, how does average-case analysis “rely on probabilistic assumptions about data structures and operations?” I know average-case analysis depends on probability of input, but what does the above statement mean?
What does the author mean in the second paragraph that average case is not valid even if the input distribution is valid?
Thanks!
Average case analysis makes assumptions about the input that may not be met in certain cases. Therefore, if your input is not random, in the worst case the actual performace of an algorithm may be much slower than the average case.
Amortized analysis makes no such assumptions, but it considers the total performance of a sequence of operations instead of just one operation.
Dynamic array insertion provides a simple example of amortized analysis. One algorithm is to allocate a fixed size array, and as new elements are inserted, allocate a fixed size array of double the old length when necessary. In the worst case a insertion can require time proportional to the length of the entire list, so in the worst case insertion is an O(n) operation. However, you can guarantee that such a worst case is infrequent, so insertion is an O(1) operation using amortized analysis. Amortized analysis holds no matter what the input is.
To get the average-case time complexity, you need to make assumptions about what the "average case" is. If inputs are strings, what's the "average string"? Does only length matter? If so, what is the average length of strings I will get? If not, what is the average character(s) in these strings? It becomes difficult to answer these questions definitively if the strings are, for instance, last names. What is the average last name?
In most interesting statistical samples, the maximum value is greater than the mean. This means that your average case analysis will sometimes underestimate the time/resources needed for certain inputs (which are problematic). If you think about it, for a symmetrical PDF, average case analysis should underestimate as much as it overestimates. Worst case analysis, OTOH, considers only the most problematic case(s), and so is guaranteed to overestimate.
Consider the computation of the minimum in an unsorted array. Maybe you know that it has O(n) running time but if we want be more precise it does n/2 comparison in the average case. Why this? because we are doing an assumption on data; we are assuming that the minimum can be in every position with the same probability.
if we change this assumption, and we say for example that the probability of being in the i position is for example increasing with i, we could prove a different comparison number, even a different asymptotical bound.
In the second paragraph the author say that with average case analysis we can be very unlucky and have a measured average case greater than the therotical case; recalling the previous example, if we are unlucky on m different arrays of size n, and the minimum is every time in the last position, than we'll measure a n average case and not a n/2. This can't just happen when a amortized bound is proven.

Resources