Selection Algorithm Runtime

Selection Algorithm Runtime - algorithm

I am trying to figure out the most optimal way to compute a top-k query on some aggregation of data, lets say an array. I used to think the best way was to run through the array and maintain a heap or balanced binary tree of size k, leveraging that to compute the top-k value. Now, I have run across the Selection Algorithm which supposedly runs even faster. I understand how the Selection Algorithm works and how to implement it, I am just a little confused as to how it runs in O(n). I feel like in order for it to run in O(n) you would have to be extremely lucky. If you keep picking a random pivot point and partitioning around it, it could very well be the case that you just end up basically sorting almost the entire array before stumbling upon your kth index. Are there any optimizations such as maybe not picking a random pivot? Or is my maintaining a heap/tree method good enough for most cases.

What you're talking about there is quickselect, also known as Hoare's selection algorithm.
It does have O(n) average case performance, but its worst-case performance is O(n2).
Like quicksort, the quickselect has good average performance, but is sensitive to the pivot that is chosen. If good pivots are chosen, meaning ones that consistently decrease the search set by a given fraction, then the search set decreases in size exponentially and by induction (or summing the geometric series) one sees that performance is linear, as each step is linear and the overall time is a constant times this (depending on how quickly the search set reduces). However, if bad pivots are consistently chosen, such as decreasing by only a single element each time, then worst-case performance is quadratic: O(n2).
In terms of choosing pivots:
The easiest solution is to choose a random pivot, which yields almost certain linear time. Deterministically, one can use median-of-3 pivot strategy (as in quicksort), which yields linear performance on partially sorted data, as is common in the real world. However, contrived sequences can still cause worst-case complexity; David Musser describes a "median-of-3 killer" sequence that allows an attack against that strategy, which was one motivation for his introselect algorithm.
One can assure linear performance even in the worst case by using a more sophisticated pivot strategy; this is done in the median of medians algorithm. However, the overhead of computing the pivot is high, and thus this is generally not used in practice. One can combine basic quickselect with median of medians as fallback to get both fast average case performance and linear worst-case performance; this is done in introselect.
(quotes from Wikipedia)
So you're fairly likely to get O(n) performance with random pivots, but, if k is small and n is large, or if you're just unlikely, the O(n log k) solution using a size k heap or BST could outperform this.
We can't tell you with certainty which one will be faster when - it depends on (1) the exact implementations, (2) the machine it's run on, (3) the exact sizes of n and k and finally (4) the actual data. The O(n log k) solution should be sufficient for most purposes.

Related

Quick select with random pick index or with median of medians?

To avoid the O(n^2) worst case scenario for quick select, I am aware of 2 options:
Randomly choose a pivot index
Use median of medians (MoM) to select an approximate median and pivot around that
When using MoM with quick select, we can guarantee worst case O(n). When using (1), we can't guarantee worst case O(n), but the probability of the algorithm going to O(n^2) should be extremely small. The overhead cost of (2) is much more than (1), where the latter adds little to no additional complexity.
So when should we use one over the other?

As you've noted, the median-of-medians approach is slower than quickselect, but has a better worst-case runtime. Assuming quickselect is truly using a random choice of pivot at each step, you can prove that not only is the expected runtime O(n), but that the probability that its runtime exceeds Θ(n log n) is very, very small (at most 1 / nk for any choice of constant k). So in that sense, if you have the ability to select pivots at random, quickselect will likely be faster.
However, not all implementations of quickselect use true randomness for the pivots, and some use deterministic pivot selection algorithms. This, unfortunately, can lead to pathological inputs that trigger the Θ(n2) worst-case runtime, which is a problem if you have adversarially-chosen inputs.
Once nice compromise between the two is introselect. The basic idea behind introselect is to use quickselect with a deterministic pivot selection algorithm. As the algorithm is running, it keeps track of how many times it's picked a pivot without throwing away at least 30% the input array. If that number exceeds some threshold, it stops using a random pivot choice and switches to the median-of-medians approach to select a good pivot, forcing a 30% size reduction. This approach means that in the common case when quickselect rapidly reduces the input size, introselect is basically identical to quickselect with a tiny bookkeeping overhead. However, in cases where quickselect would degrade to quadratic, introselect stops and switches to the worst-case efficient median-of-medians approach, ensuring the worst-case runtime is O(n). This gives you, essentially, the best of both worlds - it's fast on average, and its worst-case is never worse than O(n).

Are there any cases where you would prefer a higher big-O time complexity algorithm over the lower one?

Are there are any cases where you would prefer O(log n) time complexity to O(1) time complexity? Or O(n) to O(log n)?
Do you have any examples?

There can be many reasons to prefer an algorithm with higher big O time complexity over the lower one:
most of the time, lower big-O complexity is harder to achieve and requires skilled implementation, a lot of knowledge and a lot of testing.
big-O hides the details about a constant: algorithm that performs in 10^5 is better from big-O point of view than 1/10^5 * log(n) (O(1) vs O(log(n)), but for most reasonable n the first one will perform better. For example the best complexity for matrix multiplication is O(n^2.373) but the constant is so high that no (to my knowledge) computational libraries use it.
big-O makes sense when you calculate over something big. If you need to sort array of three numbers, it matters really little whether you use O(n*log(n)) or O(n^2) algorithm.
sometimes the advantage of the lowercase time complexity can be really negligible. For example there is a data structure tango tree which gives a O(log log N) time complexity to find an item, but there is also a binary tree which finds the same in O(log n). Even for huge numbers of n = 10^20 the difference is negligible.
time complexity is not everything. Imagine an algorithm that runs in O(n^2) and requires O(n^2) memory. It might be preferable over O(n^3) time and O(1) space when the n is not really big. The problem is that you can wait for a long time, but highly doubt you can find a RAM big enough to use it with your algorithm
parallelization is a good feature in our distributed world. There are algorithms that are easily parallelizable, and there are some that do not parallelize at all. Sometimes it makes sense to run an algorithm on 1000 commodity machines with a higher complexity than using one machine with a slightly better complexity.
in some places (security) a complexity can be a requirement. No one wants to have a hash algorithm that can hash blazingly fast (because then other people can bruteforce you way faster)
although this is not related to switch of complexity, but some of the security functions should be written in a manner to prevent timing attack. They mostly stay in the same complexity class, but are modified in a way that it always takes worse case to do something. One example is comparing that strings are equal. In most applications it makes sense to break fast if the first bytes are different, but in security you will still wait for the very end to tell the bad news.
somebody patented the lower-complexity algorithm and it is more economical for a company to use higher complexity than to pay money.
some algorithms adapt well to particular situations. Insertion sort, for example, has an average time-complexity of O(n^2), worse than quicksort or mergesort, but as an online algorithm it can efficiently sort a list of values as they are received (as user input) where most other algorithms can only efficiently operate on a complete list of values.

There is always the hidden constant, which can be lower on the O(log n) algorithm. So it can work faster in practice for real-life data.
There are also space concerns (e.g. running on a toaster).
There's also developer time concern - O(log n) may be 1000× easier to implement and verify.

I'm surprised nobody has mentioned memory-bound applications yet.
There may be an algorithm that has less floating point operations either due to its complexity (i.e. O(1) < O(log n)) or because the constant in front of the complexity is smaller (i.e. 2n2 < 6n2). Regardless, you might still prefer the algorithm with more FLOP if the lower FLOP algorithm is more memory-bound.
What I mean by "memory-bound" is that you are often accessing data that is constantly out-of-cache. In order to fetch this data, you have to pull the memory from your actually memory space into your cache before you can perform your operation on it. This fetching step is often quite slow - much slower than your operation itself.
Therefore, if your algorithm requires more operations (yet these operations are performed on data that is already in cache [and therefore no fetching required]), it will still out-perform your algorithm with fewer operations (which must be performed on out-of-cache data [and therefore require a fetch]) in terms of actual wall-time.

In contexts where data security is a concern, a more complex algorithm may be preferable to a less complex algorithm if the more complex algorithm has better resistance to timing attacks.

Alistra nailed it but failed to provide any examples so I will.
You have a list of 10,000 UPC codes for what your store sells. 10 digit UPC, integer for price (price in pennies) and 30 characters of description for the receipt.
O(log N) approach: You have a sorted list. 44 bytes if ASCII, 84 if Unicode. Alternately, treat the UPC as an int64 and you get 42 & 72 bytes. 10,000 records--in the highest case you're looking at a bit under a megabyte of storage.
O(1) approach: Don't store the UPC, instead you use it as an entry into the array. In the lowest case you're looking at almost a third of a terabyte of storage.
Which approach you use depends on your hardware. On most any reasonable modern configuration you're going to use the log N approach. I can picture the second approach being the right answer if for some reason you're running in an environment where RAM is critically short but you have plenty of mass storage. A third of a terabyte on a disk is no big deal, getting your data in one probe of the disk is worth something. The simple binary approach takes 13 on average. (Note, however, that by clustering your keys you can get this down to a guaranteed 3 reads and in practice you would cache the first one.)

Consider a red-black tree. It has access, search, insert, and delete of O(log n). Compare to an array, which has access of O(1) and the rest of the operations are O(n).
So given an application where we insert, delete, or search more often than we access and a choice between only these two structures, we would prefer the red-black tree. In this case, you might say we prefer the red-black tree's more cumbersome O(log n) access time.
Why? Because the access is not our overriding concern. We are making a trade off: the performance of our application is more heavily influenced by factors other than this one. We allow this particular algorithm to suffer performance because we make large gains by optimizing other algorithms.
So the answer to your question is simply this: when the algorithm's growth rate isn't what we want to optimize, when we want to optimize something else. All of the other answers are special cases of this. Sometimes we optimize the run time of other operations. Sometimes we optimize for memory. Sometimes we optimize for security. Sometimes we optimize maintainability. Sometimes we optimize for development time. Even the overriding constant being low enough to matter is optimizing for run time when you know the growth rate of the algorithm isn't the greatest impact on run time. (If your data set was outside this range, you would optimize for the growth rate of the algorithm because it would eventually dominate the constant.) Everything has a cost, and in many cases, we trade the cost of a higher growth rate for the algorithm to optimize something else.

Yes.
In a real case, we ran some tests on doing table lookups with both short and long string keys.
We used a std::map, a std::unordered_map with a hash that samples at most 10 times over the length of the string (our keys tend to be guid-like, so this is decent), and a hash that samples every character (in theory reduced collisions), an unsorted vector where we do a == compare, and (if I remember correctly) an unsorted vector where we also store a hash, first compare the hash, then compare the characters.
These algorithms range from O(1) (unordered_map) to O(n) (linear search).
For modest sized N, quite often the O(n) beat the O(1). We suspect this is because the node-based containers required our computer to jump around in memory more, while the linear-based containers did not.
O(lg n) exists between the two. I don't remember how it did.
The performance difference wasn't that large, and on larger data sets the hash-based one performed much better. So we stuck with the hash-based unordered map.
In practice, for reasonable sized n, O(lg n) is O(1). If your computer only has room for 4 billion entries in your table, then O(lg n) is bounded above by 32. (lg(2^32)=32) (in computer science, lg is short hand for log based 2).
In practice, lg(n) algorithms are slower than O(1) algorithms not because of the logarithmic growth factor, but because the lg(n) portion usually means there is a certain level of complexity to the algorithm, and that complexity adds a larger constant factor than any of the "growth" from the lg(n) term.
However, complex O(1) algorithms (like hash mapping) can easily have a similar or larger constant factor.

The possibility to execute an algorithm in parallel.
I don't know if there is an example for the classes O(log n) and O(1), but for some problems, you choose an algorithm with a higher complexity class when the algorithm is easier to execute in parallel.
Some algorithms cannot be parallelized but have so low complexity class. Consider another algorithm which achieves the same result and can be parallelized easily, but has a higher complexity class. When executed on one machine, the second algorithm is slower, but when executed on multiple machines, the real execution time gets lower and lower while the first algorithm cannot speed up.

Let's say you're implementing a blacklist on an embedded system, where numbers between 0 and 1,000,000 might be blacklisted. That leaves you two possible options:
Use a bitset of 1,000,000 bits
Use a sorted array of the blacklisted integers and use a binary search to access them
Access to the bitset will have guaranteed constant access. In terms of time complexity, it is optimal. Both from a theoretical and from a practical point view (it is O(1) with an extremely low constant overhead).
Still, you might want to prefer the second solution. Especially if you expect the number of blacklisted integers to be very small, as it will be more memory efficient.
And even if you do not develop for an embedded system where memory is scarce, I just can increase the arbitrary limit of 1,000,000 to 1,000,000,000,000 and make the same argument. Then the bitset would require about 125G of memory. Having a guaranteed worst-case complexitity of O(1) might not convince your boss to provide you such a powerful server.
Here, I would strongly prefer a binary search (O(log n)) or binary tree (O(log n)) over the O(1) bitset. And probably, a hash table with its worst-case complexity of O(n) will beat all of them in practice.

My answer here Fast random weighted selection across all rows of a stochastic matrix is an example where an algorithm with complexity O(m) is faster than one with complexity O(log(m)), when m is not too big.

A more general question is if there are situations where one would prefer an O(f(n)) algorithm to an O(g(n)) algorithm even though g(n) << f(n) as n tends to infinity. As others have already mentioned, the answer is clearly "yes" in the case where f(n) = log(n) and g(n) = 1. It is sometimes yes even in the case that f(n) is polynomial but g(n) is exponential. A famous and important example is that of the Simplex Algorithm for solving linear programming problems. In the 1970s it was shown to be O(2^n). Thus, its worse-case behavior is infeasible. But -- its average case behavior is extremely good, even for practical problems with tens of thousands of variables and constraints. In the 1980s, polynomial time algorithms (such a Karmarkar's interior-point algorithm) for linear programming were discovered, but 30 years later the simplex algorithm still seems to be the algorithm of choice (except for certain very large problems). This is for the obvious reason that average-case behavior is often more important than worse-case behavior, but also for a more subtle reason that the simplex algorithm is in some sense more informative (e.g. sensitivity information is easier to extract).

People have already answered your exact question, so I'll tackle a slightly different question that people may actually be thinking of when coming here.
A lot of the "O(1) time" algorithms and data structures actually only take expected O(1) time, meaning that their average running time is O(1), possibly only under certain assumptions.
Common examples: hashtables, expansion of "array lists" (a.k.a. dynamically sized arrays/vectors).
In such scenarios, you may prefer to use data structures or algorithms whose time is guaranteed to be absolutely bounded logarithmically, even though they may perform worse on average.
An example might therefore be a balanced binary search tree, whose running time is worse on average but better in the worst case.

To put my 2 cents in:
Sometimes a worse complexity algorithm is selected in place of a better one, when the algorithm runs on a certain hardware environment. Suppose our O(1) algorithm non-sequentially accesses every element of a very big, fixed-size array to solve our problem. Then put that array on a mechanical hard drive, or a magnetic tape.
In that case, the O(logn) algorithm (suppose it accesses disk sequentially), becomes more favourable.

There is a good use case for using a O(log(n)) algorithm instead of an O(1) algorithm that the numerous other answers have ignored: immutability. Hash maps have O(1) puts and gets, assuming good distribution of hash values, but they require mutable state. Immutable tree maps have O(log(n)) puts and gets, which is asymptotically slower. However, immutability can be valuable enough to make up for worse performance and in the case where multiple versions of the map need to be retained, immutability allows you to avoid having to copy the map, which is O(n), and therefore can improve performance.

Simply: Because the coefficient - the costs associated with setup, storage, and the execution time of that step - can be much, much larger with a smaller big-O problem than with a larger one. Big-O is only a measure of the algorithms scalability.
Consider the following example from the Hacker's Dictionary, proposing a sorting algorithm relying on the Multiple Worlds Interpretation of Quantum Mechanics:
Permute the array randomly using a quantum process,
If the array is not sorted, destroy the universe.
All remaining universes are now sorted [including the one you are in].
(Source: http://catb.org/~esr/jargon/html/B/bogo-sort.html)
Notice that the big-O of this algorithm is O(n), which beats any known sorting algorithm to date on generic items. The coefficient of the linear step is also very low (since it's only a comparison, not a swap, that is done linearly). A similar algorithm could, in fact, be used to solve any problem in both NP and co-NP in polynomial time, since each possible solution (or possible proof that there is no solution) can be generated using the quantum process, then verified in polynomial time.
However, in most cases, we probably don't want to take the risk that Multiple Worlds might not be correct, not to mention that the act of implementing step 2 is still "left as an exercise for the reader".

At any point when n is bounded and the constant multiplier of O(1) algorithm is higher than the bound on log(n). For example, storing values in a hashset is O(1), but may require an expensive computation of a hash function. If the data items can be trivially compared (with respect to some order) and the bound on n is such that log n is significantly less than the hash computation on any one item, then storing in a balanced binary tree may be faster than storing in a hashset.

In a realtime situation where you need a firm upper bound you would select e.g. a heapsort as opposed to a Quicksort, because heapsort's average behaviour is also its worst-case behaviour.

Adding to the already good answers.A practical example would be Hash indexes vs B-tree indexes in postgres database.
Hash indexes form a hash table index to access the data on the disk while btree as the name suggests uses a Btree data structure.
In Big-O time these are O(1) vs O(logN).
Hash indexes are presently discouraged in postgres since in a real life situation particularly in database systems, achieving hashing without collision is very hard(can lead to a O(N) worst case complexity) and because of this, it is even more harder to make them crash safe (called write ahead logging - WAL in postgres).
This tradeoff is made in this situation since O(logN) is good enough for indexes and implementing O(1) is pretty hard and the time difference would not really matter.

When n is small, and O(1) is constantly slow.

When the "1" work unit in O(1) is very high relative to the work unit in O(log n) and the expected set size is small-ish. For example, it's probably slower to compute Dictionary hash codes than iterate an array if there are only two or three items.
or
When the memory or other non-time resource requirements in the O(1) algorithm are exceptionally large relative to the O(log n) algorithm.

when redesigning a program, a procedure is found to be optimized with O(1) instead of O(lgN), but if it's not the bottleneck of this program, and it's hard to understand the O(1) alg. Then you would not have to use O(1) algorithm
when O(1) needs much memory that you cannot supply, while the time of O(lgN) can be accepted.

This is often the case for security applications that we want to design problems whose algorithms are slow on purpose in order to stop someone from obtaining an answer to a problem too quickly.
Here are a couple of examples off the top of my head.
Password hashing is sometimes made arbitrarily slow in order to make it harder to guess passwords by brute-force. This Information Security post has a bullet point about it (and much more).
Bit Coin uses a controllably slow problem for a network of computers to solve in order to "mine" coins. This allows the currency to be mined at a controlled rate by the collective system.
Asymmetric ciphers (like RSA) are designed to make decryption without the keys intentionally slow in order to prevent someone else without the private key to crack the encryption. The algorithms are designed to be cracked in hopefully O(2^n) time where n is the bit-length of the key (this is brute force).
Elsewhere in CS, Quick Sort is O(n^2) in the worst case but in the general case is O(n*log(n)). For this reason, "Big O" analysis sometimes isn't the only thing you care about when analyzing algorithm efficiency.

There are plenty of good answers, a few of which mention the constant factor, the input size and memory constraints, among many other reasons complexity is only a theoretical guideline rather than the end-all determination of real-world fitness for a given purpose or speed.
Here's a simple, concrete example to illustrate these ideas. Let's say we want to figure out whether an array has a duplicate element. The naive quadratic approach is to write a nested loop:
const hasDuplicate = arr => {
for (let i = 0; i < arr.length; i++) {
for (let j = i + 1; j < arr.length; j++) {
if (arr[i] === arr[j]) {
return true;
}
}
}
return false;
};
console.log(hasDuplicate([1, 2, 3, 4]));
console.log(hasDuplicate([1, 2, 4, 4]));
But this can be done in linear time by creating a set data structure (i.e. removing duplicates), then comparing its size to the length of the array:
const hasDuplicate = arr => new Set(arr).size !== arr.length;
console.log(hasDuplicate([1, 2, 3, 4]));
console.log(hasDuplicate([1, 2, 4, 4]));
Big O tells us is that the new Set approach will scale a great deal better from a time complexity standpoint.
However, it turns out that the "naive" quadratic approach has a lot going for it that Big O can't account for:
No additional memory usage
No heap memory allocation (no new)
No garbage collection for the temporary Set
Early bailout; in a case when the duplicate is known to be likely in the front of the array, there's no need to check more than a few elements.
If our use case is on bounded small arrays, we have a resource-constrained environment and/or other known common-case properties allow us to establish through benchmarks that the nested loop is faster on our particular workload, it might be a good idea.
On the other hand, maybe the set can be created one time up-front and used repeatedly, amortizing its overhead cost across all of the lookups.
This leads inevitably to maintainability/readability/elegance and other "soft" costs. In this case, the new Set() approach is probably more readable, but it's just as often (if not more often) that achieving the better complexity comes at great engineering cost.
Creating and maintaining a persistent, stateful Set structure can introduce bugs, memory/cache pressure, code complexity, and all other manner of design tradeoffs. Negotiating these tradeoffs optimally is a big part of software engineering, and time complexity is just one factor to help guide that process.
A few other examples that I don't see mentioned yet:
In real-time environments, for example resource-constrained embedded systems, sometimes complexity sacrifices are made (typically related to caches and memory or scheduling) to avoid incurring occasional worst-case penalties that can't be tolerated because they might cause jitter.
Also in embedded programming, the size of the code itself can cause cache pressure, impacting memory performance. If an algorithm has worse complexity but will result in massive code size savings, that might be a reason to choose it over an algorithm that's theoretically better.
In most implementations of recursive linearithmic algorithms like quicksort, when the array is small enough, a quadratic sorting algorithm like insertion sort is often called because the overhead of recursive function calls on increasingly tiny arrays tends to outweigh the cost of nested loops. Insertion sort is also fast on mostly-sorted arrays as the inner loop won't run much. This answer discusses this in an older version of Chrome's V8 engine before they moved to Timsort.

O(n log n) vs O(n) -- practical differences in time complexity

n log n > n -- but this is like a pseudo-linear relationship. If n=1 billion, log n ~ 30;
So n log n will be 30 billion, which is 30 X n, order of n.
I am wondering if this time complexity difference between n log n and n are significant in real life.
Eg: A quick select on finding kth element in an unsorted array is O(n) using quickselect algorithm.
If I sort the array and find the kth element, it is O(n log n). To sort an array with 1 trillion elements, I will be 60 times slower if I do quicksort and index it.

The main purpose of the Big-O notation is to let you do the estimates like the ones you did in your post, and decide for yourself if spending your effort coding a typically more advanced algorithm is worth the additional CPU cycles that you are going to buy with that code improvement. Depending on the circumstances, you may get a different answer, even when your data set is relatively small:
If you are running on a mobile device, and the algorithm represents a significant portion of the execution time, cutting down the use of CPU translates into extending the battery life
If you are running in an all-or-nothing competitive environment, such as a high-frequency trading system, a micro-optimization may differentiate between making money and losing money
When your profiling shows that the algorithm in question dominates the execution time in a server environment, switching to a faster algorithm may improve performance for all your clients.
Another thing the Big-O notation hides is the constant multiplication factor. For example, Quick Select has very reasonable multiplier, making the time savings from employing it on extremely large data sets well worth the trouble of implementing it.
Another thing that you need to keep in mind is the space complexity. Very often, algorithms with O(N*Log N) time complexity would have an O(Log N) space complexity. This may present a problem for extremely large data sets, for example when a recursive function runs on a system with a limited stack capacity.

It depends.
I was working at amazon, there was a method, which was doing linear search on a list. We could use a Hashtable and do the look up in O(1) compared to O(n).
I suggested the change, and it wasn't approved. because the input was small, it wouldn't really make a huge difference.
However, if the input is large, then it would make a difference.
In another company, where the data/input was huge, using a Tree, Compared to List made a huge difference. So it depends on the data and architecture of the application.
It is always good to know your options and how you can optimize.

There are times when you will work with billions of elements (and more), where that difference will certainly be significant.
There are other times when you will be working with less than a thousand elements, in which case the difference probably won't be all that significant.
If you have a decent idea what your data will look like, you should have a decent idea which one to pick from the start, but the difference between O(n) and O(n log n) is small enough that it's probably best to start off with whichever one is simplest, benchmark it and only try to improve it if you see it's too slow.
However, note that O(n) may actually be slower than O(n log n) for any given value of n (especially, but not necessarily, for small values of n) because of the constant factors involved, since big-O ignores those (it only considers what happens when n tends to infinity), so, if you're looking purely at the time complexity, what you think may be an improvement may actually slow things down.

Darth Vader is correct. It always depends. Its also important to rememeber that complexities are asymptotic, worst-case (usually) and that constants are dropped. Each of these is important to consider.
So you could have two algorithms, one of which is O(n) and one of which is O(nlogn), and for every value up to the number of atoms in the universe and beyond (to some finite value of n), the O(nlogn) algorithm outperforms the O(n) algorithm. It could be because lower order terms are dominating, or it could be because in the average case, the O(nlogn) algorithm is actually O(n), or because the actual number of steps is something like 5,000,000n vs 3nlogn.

PriorityQueue Sorts each element that you add each time while using Collections.sort() will sort all the elements in a single go. But if you have a problem where you want to get the biggest element as soon as possible use PriorityQueue on the other hand if you need to perform some computations but requires the element to be sorted then using ArrayList with Collections.Sort is best

O(log N) == O(1) - Why not?

Whenever I consider algorithms/data structures I tend to replace the log(N) parts by constants. Oh, I know log(N) diverges - but does it matter in real world applications?
log(infinity) < 100 for all practical purposes.
I am really curious for real world examples where this doesn't hold.
To clarify:
I understand O(f(N))
I am curious about real world examples where the asymptotic behaviour matters more than the constants of the actual performance.
If log(N) can be replaced by a constant it still can be replaced by a constant in O( N log N).
This question is for the sake of (a) entertainment and (b) to gather arguments to use if I run (again) into a controversy about the performance of a design.

Big O notation tells you about how your algorithm changes with growing input. O(1) tells you it doesn't matter how much your input grows, the algorithm will always be just as fast. O(logn) says that the algorithm will be fast, but as your input grows it will take a little longer.
O(1) and O(logn) makes a big diference when you start to combine algorithms.
Take doing joins with indexes for example. If you could do a join in O(1) instead of O(logn) you would have huge performance gains. For example with O(1) you can join any amount of times and you still have O(1). But with O(logn) you need to multiply the operation count by logn each time.
For large inputs, if you had an algorithm that was O(n^2) already, you would much rather do an operation that was O(1) inside, and not O(logn) inside.
Also remember that Big-O of anything can have a constant overhead. Let's say that constant overhead is 1 million. With O(1) that constant overhead does not amplify the number of operations as much as O(logn) does.
Another point is that everyone thinks of O(logn) representing n elements of a tree data structure for example. But it could be anything including bytes in a file.

I think this is a pragmatic approach; O(logN) will never be more than 64. In practice, whenever terms get as 'small' as O(logN), you have to measure to see if the constant factors win out. See also
Uses of Ackermann function?
To quote myself from comments on another answer:
[Big-Oh] 'Analysis' only matters for factors
that are at least O(N). For any
smaller factor, big-oh analysis is
useless and you must measure.
and
"With O(logN) your input size does
matter." This is the whole point of
the question. Of course it matters...
in theory. The question the OP asks
is, does it matter in practice? I
contend that the answer is no, there
is not, and never will be, a data set
for which logN will grow so fast as to
always be beaten a constant-time
algorithm. Even for the largest
practical dataset imaginable in the
lifetimes of our grandchildren, a logN
algorithm has a fair chance of beating
a constant time algorithm - you must
always measure.
EDIT
A good talk:
http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey
about halfway through, Rich discusses Clojure's hash tries, which are clearly O(logN), but the base of the logarithm is large and so the depth of the trie is at most 6 even if it contains 4 billion values. Here "6" is still an O(logN) value, but it is an incredibly small value, and so choosing to discard this awesome data structure because "I really need O(1)" is a foolish thing to do. This emphasizes how most of the other answers to this question are simply wrong from the perspective of the pragmatist who wants their algorithm to "run fast" and "scale well", regardless of what the "theory" says.
EDIT
See also
http://queue.acm.org/detail.cfm?id=1814327
which says
What good is an O(log2(n)) algorithm
if those operations cause page faults
and slow disk operations? For most
relevant datasets an O(n) or even an
O(n^2) algorithm, which avoids page
faults, will run circles around it.
(but go read the article for context).

This is a common mistake - remember Big O notation is NOT telling you about the absolute performance of an algorithm at a given value, it's simply telling you the behavior of an algorithm as you increase the size of the input.
When you take it in that context it becomes clear why an algorithm A ~ O(logN) and an algorithm B ~ O(1) algorithm are different:
if I run A on an input of size a, then on an input of size 1000000*a, I can expect the second input to take log(1,000,000) times as long as the first input
if I run B on an input of size a, then on an input of size 1000000*a, I can expect the second input to take about the same amount of time as the first input
EDIT: Thinking over your question some more, I do think there's some wisdom to be had in it. While I would never say it's correct to say O(lgN) == O(1), It IS possible that an O(lgN) algorithm might be used over an O(1) algorithm. This draws back to the point about absolute performance above: Just knowing one algorithm is O(1) and another algorithm is O(lgN) is NOT enough to declare you should use the O(1) over the O(lgN), it's certainly possible given your range of possible inputs an O(lgN) might serve you best.

You asked for a real-world example. I'll give you one. Computational biology. One strand of DNA encoded in ASCII is somewhere on the level of gigabytes in space. A typical database will obviously have many thousands of such strands.
Now, in the case of an indexing/searching algorithm, that log(n) multiple makes a large difference when coupled with constants. The reason why? This is one of the applications where the size of your input is astronomical. Additionally, the input size will always continue to grow.
Admittedly, these type of problems are rare. There are only so many applications this large. In those circumstances, though... it makes a world of difference.

Equality, the way you're describing it, is a common abuse of notation.
To clarify: we usually write f(x) = O(logN) to imply "f(x) is O(logN)".
At any rate, O(1) means a constant number of steps/time (as an upper bound) to perform an action regardless of how large the input set is. But for O(logN), number of steps/time still grows as a function of the input size (the logarithm of it), it just grows very slowly. For most real world applications you may be safe in assuming that this number of steps will not exceed 100, however I'd bet there are multiple examples of datasets large enough to mark your statement both dangerous and void (packet traces, environmental measurements, and many more).

For small enough N, O(N^N) can in practice be replaced with 1. Not O(1) (by definition), but for N=2 you can see it as one operation with 4 parts, or a constant-time operation.
What if all operations take 1hour? The difference between O(log N) and O(1) is then large, even with small N.
Or if you need to run the algorithm ten million times? Ok, that took 30minutes, so when I run it on a dataset a hundred times as large it should still take 30minutes because O(logN) is "the same" as O(1).... eh...what?
Your statement that "I understand O(f(N))" is clearly false.
Real world applications, oh... I don't know.... EVERY USE OF O()-notation EVER?
Binary search in sorted list of 10 million items for example. It's the very REASON we use hash tables when the data gets big enough. If you think O(logN) is the same as O(1), then why would you EVER use a hash instead of a binary tree?

As many have already said, for the real world, you need to look at the constant factors first, before even worrying about factors of O(log N).
Then, consider what you will expect N to be. If you have good reason to think that N<10, you can use a linear search instead of a binary one. That's O(N) instead of O(log N), which according to your lights would be significant -- but a linear search that moves found elements to the front may well outperform a more complicated balanced tree, depending on the application.
On the other hand, note that, even if log N is not likely to exceed 50, a performance factor of 10 is really huge -- if you're compute-bound, a factor like that can easily make or break your application. If that's not enough for you, you'll frequently see factors of (log N)^2 or (logN)^3 in algorithms, so even if you think you can ignore one factor of (log N), that doesn't mean you can ignore more of them.
Finally, note that the simplex algorithm for linear programming has a worst case performance of O(2^n). However, for practical problems, the worst case never comes up; in practice, the simplex algorithm is fast, relatively simple, and consequently very popular.
About 30 years ago, someone developed a polynomial-time algorithm for linear programming, but it was not initially practical because the result was too slow.
Nowadays, there are practical alternative algorithms for linear programming (with polynomial-time wost-case, for what that's worth), which can outperform the simplex method in practice. But, depending on the problem, the simplex method is still competitive.

The observation that O(log n) is oftentimes indistinguishable from O(1) is a good one.
As a familiar example, suppose we wanted to find a single element in a sorted array of one 1,000,000,000,000 elements:
with linear search, the search takes on average 500,000,000,000 steps
with binary search, the search takes on average 40 steps
Suppose we added a single element to the array we are searching, and now we must search for another element:
with linear search, the search takes on average 500,000,000,001 steps (indistinguishable change)
with binary search, the search takes on average 40 steps (indistinguishable change)
Suppose we doubled the number of elements in the array we are searching, and now we must search for another element:
with linear search, the search takes on average 1,000,000,000,000 steps (extraordinarily noticeable change)
with binary search, the search takes on average 41 steps (indistinguishable change)
As we can see from this example, for all intents and purposes, an O(log n) algorithm like binary search is oftentimes indistinguishable from an O(1) algorithm like omniscience.
The takeaway point is this: *we use O(log n) algorithms because they are often indistinguishable from constant time, and because they often perform phenomenally better than linear time algorithms.
Obviously, these examples assume reasonable constants. Obviously, these are generic observations and do not apply to all cases. Obviously, these points apply at the asymptotic end of the curve, not the n=3 end.
But this observation explains why, for example, we use such techniques as tuning a query to do an index seek rather than a table scan - because an index seek operates in nearly constant time no matter the size of the dataset, while a table scan is crushingly slow on sufficiently large datasets. Index seek is O(log n).

You might be interested in Soft-O, which ignores logarithmic cost. Check this paragraph in Wikipedia.

What do you mean by whether or not it "matters"?
If you're faced with the choice of an O(1) algorithm and a O(lg n) one, then you should not assume they're equal. You should choose the constant-time one. Why wouldn't you?
And if no constant-time algorithm exists, then the logarithmic-time one is usually the best you can get. Again, does it then matter? You just have to take the fastest you can find.
Can you give me a situation where you'd gain anything by defining the two as equal? At best, it'd make no difference, and at worst, you'd hide some real scalability characteristics. Because usually, a constant-time algorithm will be faster than a logarithmic one.
Even if, as you say, lg(n) < 100 for all practical purposes, that's still a factor 100 on top of your other overhead. If I call your function, N times, then it starts to matter whether your function runs logarithmic time or constant, because the total complexity is then O(n lg n) or O(n).
So rather than asking if "it matters" that you assume logarithmic complexity to be constant in "the real world", I'd ask if there's any point in doing that.
Often you can assume that logarithmic algorithms are fast enough, but what do you gain by considering them constant?

O(logN)*O(logN)*O(logN) is very different. O(1) * O(1) * O(1) is still constant.
Also a simple quicksort-style O(nlogn) is different than O(n O(1))=O(n). Try sorting 1000 and 1000000 elements. The latter isn't 1000 times slower, it's 2000 times, because log(n^2)=2log(n)

The title of the question is misleading (well chosen to drum up debate, mind you).
O(log N) == O(1) is obviously wrong (and the poster is aware of this). Big O notation, by definition, regards asymptotic analysis. When you see O(N), N is taken to approach infinity. If N is assigned a constant, it's not Big O.
Note, this isn't just a nitpicky detail that only theoretical computer scientists need to care about. All of the arithmetic used to determine the O function for an algorithm relies on it. When you publish the O function for your algorithm, you might be omitting a lot of information about it's performance.
Big O analysis is cool, because it lets you compare algorithms without getting bogged down in platform specific issues (word sizes, instructions per operation, memory speed versus disk speed). When N goes to infinity, those issues disappear. But when N is 10000, 1000, 100, those issues, along with all of the other constants that we left out of the O function, start to matter.
To answer the question of the poster: O(log N) != O(1), and you're right, algorithms with O(1) are sometimes not much better than algorithms with O(log N), depending on the size of the input, and all of those internal constants that got omitted during Big O analysis.
If you know you're going to be cranking up N, then use Big O analysis. If you're not, then you'll need some empirical tests.

In theory
Yes, in practical situations log(n) is bounded by a constant, we'll say 100. However, replacing log(n) by 100 in situations where it's correct is still throwing away information, making the upper bound on operations that you have calculated looser and less useful. Replacing an O(log(n)) by an O(1) in your analysis could result in your large n case performing 100 times worse than you expected based on your small n case. Your theoretical analysis could have been more accurate and could have predicted an issue before you'd built the system.
I would argue that the practical purpose of big-O analysis is to try and predict the execution time of your algorithm as early as possible. You can make your analysis easier by crossing out the log(n) terms, but then you've reduced the predictive power of the estimate.
In practice
If you read the original papers by Larry Page and Sergey Brin on the Google architecture, they talk about using hash tables for everything to ensure that e.g. the lookup of a cached web page only takes one hard-disk seek. If you used B-tree indices to lookup you might need four or five hard-disk seeks to do an uncached lookup [*]. Quadrupling your disk requirements on your cached web page storage is worth caring about from a business perspective, and predictable if you don't cast out all the O(log(n)) terms.
P.S. Sorry for using Google as an example, they're like Hitler in the computer science version of Godwin's law.
[*] Assuming 4KB reads from disk, 100bn web pages in the index, ~ 16 bytes per key in a B-tree node.

As others have pointed out, Big-O tells you about how the performance of your problem scales. Trust me - it matters. I have encountered several times algorithms that were just terrible and failed to meet the customers demands because they were too slow. Understanding the difference and finding an O(1) solution is a lot of times a huge improvement.
However, of course, that is not the whole story - for instance, you may notice that quicksort algorithms will always switch to insertion sort for small elements (Wikipedia says 8 - 20) because of the behaviour of both algorithms on small datasets.
So it's a matter of understanding what tradeoffs you will be doing which involves a thorough understanding of the problem, the architecture, & experience to understand which to use, and how to adjust the constants involved.
No one is saying that O(1) is always better than O(log N). However, I can guarantee you that an O(1) algorithm will also scale way better, so even if you make incorrect assumptions about how many users will be on the system, or the size of the data to process, it won't matter to the algorithm.

Yes, log(N) < 100 for most practical purposes, and No, you can not always replace it by constant.
For example, this may lead to serious errors in estimating performance of your program. If O(N) program processed array of 1000 elements in 1 ms, then you are sure it will process 106 elements in 1 second (or so). If, though, the program is O(N*logN), then it will take it ~2 secs to process 106 elements. This difference may be crucial - for example, you may think you've got enough server power because you get 3000 requests per hour and you think your server can handle up to 3600.
Another example. Imagine you have function f() working in O(logN), and on each iteration calling function g(), which works in O(logN) as well. Then, if you replace both logs by constants, you think that your program works in constant time. Reality will be cruel though - two logs may give you up to 100*100 multiplicator.

The rules of determining the Big-O notation are simpler when you don't decide that O(log n) = O(1).
As krzysio said, you may accumulate O(log n)s and then they would make a very noticeable difference. Imagine you do a binary search: O(log n) comparisons, and then imagine that each comparison's complexity O(log n). If you neglect both you get O(1) instead of O(log2n). Similarly you may somehow arrive at O(log10n) and then you'll notice a big difference for not too large "n"s.

Assume that in your entire application, one algorithm accounts for 90% of the time the user waits for the most common operation.
Suppose in real time the O(1) operation takes a second on your architecture, and the O(logN) operation is basically .5 seconds * log(N). Well, at this point I'd really like to draw you a graph with an arrow at the intersection of the curve and the line, saying, "It matters right here." You want to use the log(N) op for small datasets and the O(1) op for large datasets, in such a scenario.
Big-O notation and performance optimization is an academic exercise rather than delivering real value to the user for operations that are already cheap, but if it's an expensive operation on a critical path, then you bet it matters!

For any algorithm that can take inputs of different sizes N, the number of operations it takes is upper-bounded by some function f(N).
All big-O tells you is the shape of that function.
O(1) means there is some number A such that f(N) < A for large N.
O(N) means there is some A such that f(N) < AN for large N.
O(N^2) means there is some A such that f(N) < AN^2 for large N.
O(log(N)) means there is some A such that f(N) < AlogN for large N.
Big-O says nothing about how big A is (i.e. how fast the algorithm is), or where these functions cross each other. It only says that when you are comparing two algorithms, if their big-Os differ, then there is a value of N (which may be small or it may be very large) where one algorithm will start to outperform the other.

you are right, in many cases it does not matter for pracitcal purposes. but the key question is "how fast GROWS N". most algorithms we know of take the size of the input, so it grows linearily.
but some algorithms have the value of N derived in a complex way. if N is "the number of possible lottery combinations for a lottery with X distinct numbers" it suddenly matters if your algorithm is O(1) or O(logN)

Big-OH tells you that one algorithm is faster than another given some constant factor. If your input implies a sufficiently small constant factor, you could see great performance gains by going with a linear search rather than a log(n) search of some base.

O(log N) can be misleading. Take for example the operations on Red-Black trees.
The operations are O(logN) but rather complex, which means many low level operations.

Whenever N is the amount of objects that is stored in some kind of memory, you're correct. After all, a binary search through EVERY byte representable by a 64-bit pointer can be achieved in just 64 steps. Actually, it's possible to do a binary search of all Planck volumes in the observable universe in just 618 steps.
So in almost all cases, it's safe to approximate O(log N) with O(N) as long as N is (or could be) a physical quantity, and we know for certain that as long as N is (or could be) a physical quantity, then log N < 618
But that is assuming N is that. It may represent something else. Note that it's not always clear what it is. Just as an example, take matrix multiplication, and assume square matrices for simplicity. The time complexity for matrix multiplication is O(N^3) for a trivial algorithm. But what is N here? It is the side length. It is a reasonable way of measuring the input size, but it would also be quite reasonable to use the number of elements in the matrix, which is N^2. Let M=N^2, and now we can say that the time complexity for trivial matrix multiplication is O(M^(3/2)) where M is the number of elements in a matrix.
Unfortunately, I don't have any real world problem per se, which was what you asked. But at least I can make up something that makes some sort of sense:
Let f(S) be a function that returns the sum of the hashes of all the elements in the power set of S. Here is some pesudo:
f(S):
ret = 0
for s = powerset(S))
ret += hash(s)
Here, hash is simply the hash function, and powerset is a generator function. Each time it's called, it will generate the next (according to some order) subset of S. A generator is necessary, because we would not be able to store the lists for huge data otherwise. Btw, here is a python example of such a power set generator:
def powerset(seq):
"""
Returns all the subsets of this set. This is a generator.
"""
if len(seq) <= 1:
yield seq
yield []
else:
for item in powerset(seq[1:]):
yield [seq[0]]+item
yield item
https://www.technomancy.org/python/powerset-generator-python/
So what is the time complexity for f? As with the matrix multiplication, we can choose N to represent many things, but at least two makes a lot of sense. One is number of elements in S, in which case the time complexity is O(2^N), but another sensible way of measuring it is that N is the number of element in the power set of S. In this case the time complexity is O(N)
So what will log N be for sensible sizes of S? Well, list with a million elements are not unusual. If n is the size of S and N is the size of P(S), then N=2^n. So O(log N) = O(log 2^n) = O(n * log 2) = O(n)
In this case it would matter, because it's rare that O(n) == O(log n) in the real world.

I do not believe algorithms where you can freely choose between O(1) with a large constant and O(logN) really exists. If there is N elements to work with at the beginning, it is just plain impossible to make it O(1), the only thing that is possible is move your N to some other part of your code.
What I try to say is that in all real cases I know off you have some space/time tradeoff, or some pre-treatment such as compiling data to a more efficient form.
That is, you do not really go O(1), you just move the N part elsewhere. Either you exchange performance of some part of your code with some memory amount either you exchange performance of one part of your algorithm with another one. To stay sane you should always look at the larger picture.
My point is that if you have N items they can't disappear. In other words you can choose between inefficient O(n^2) algorithms or worse and O(n.logN) : it's a real choice. But you never really go O(1).
What I try to point out is that for every problem and initial data state there is a 'best' algorithm. You can do worse but never better. With some experience you can have a good guessing of what is this intrisic complexity. Then if your overall treatment match that complexity you know you have something. You won't be able to reduce that complexity, but only to move it around.
If problem is O(n) it won't become O(logN) or O(1), you'll merely add some pre-treatment such that the overall complexity is unchanged or worse, and potentially a later step will be improved. Say you want the smaller element of an array, you can search in O(N) or sort the array using any common O(NLogN) sort treatment then have the first using O(1).
Is it a good idea to do that casually ? Only if your problem asked also for second, third, etc. elements. Then your initial problem was truly O(NLogN), not O(N).
And it's not the same if you wait ten times or twenty times longer for your result because you simplified saying O(1) = O(LogN).
I'm waiting for a counter-example ;-) that is any real case where you have choice between O(1) and O(LogN) and where every O(LogN) step won't compare to the O(1). All you can do is take a worse algorithm instead of the natural one or move some heavy treatment to some other part of the larger pictures (pre-computing results, using storage space, etc.)

Let's say you use an image-processing algorithm that runs in O(log N), where N is the number of images. Now... stating that it runs in constant time would make one believe that no matter how many images there are, it would still complete its task it about the same amount of time. If running the algorithm on a single image would hypothetically take a whole day, and assuming that O(logN) will never be more than 100... imagine the surprise of that person that would try to run the algorithm on a very large image database - he would expect it to be done in a day or so... yet it'll take months for it to finish.

Why is quicksort better than mergesort?

I was asked this question during an interview. They're both O(nlogn) and yet most people use Quicksort instead of Mergesort. Why is that?

Quicksort has O(n2) worst-case runtime and O(nlogn) average case runtime. However, it’s superior to merge sort in many scenarios because many factors influence an algorithm’s runtime, and, when taking them all together, quicksort wins out.
In particular, the often-quoted runtime of sorting algorithms refers to the number of comparisons or the number of swaps necessary to perform to sort the data. This is indeed a good measure of performance, especially since it’s independent of the underlying hardware design. However, other things – such as locality of reference (i.e. do we read lots of elements which are probably in cache?) – also play an important role on current hardware. Quicksort in particular requires little additional space and exhibits good cache locality, and this makes it faster than merge sort in many cases.
In addition, it’s very easy to avoid quicksort’s worst-case run time of O(n2) almost entirely by using an appropriate choice of the pivot – such as picking it at random (this is an excellent strategy).
In practice, many modern implementations of quicksort (in particular libstdc++’s std::sort) are actually introsort, whose theoretical worst-case is O(nlogn), same as merge sort. It achieves this by limiting the recursion depth, and switching to a different algorithm (heapsort) once it exceeds logn.

As many people have noted, the average case performance for quicksort is faster than mergesort. But this is only true if you are assuming constant time to access any piece of memory on demand.
In RAM this assumption is generally not too bad (it is not always true because of caches, but it is not too bad). However if your data structure is big enough to live on disk, then quicksort gets killed by the fact that your average disk does something like 200 random seeks per second. But that same disk has no trouble reading or writing megabytes per second of data sequentially. Which is exactly what mergesort does.
Therefore if data has to be sorted on disk, you really, really want to use some variation on mergesort. (Generally you quicksort sublists, then start merging them together above some size threshold.)
Furthermore if you have to do anything with datasets of that size, think hard about how to avoid seeks to disk. For instance this is why it is standard advice that you drop indexes before doing large data loads in databases, and then rebuild the index later. Maintaining the index during the load means constantly seeking to disk. By contrast if you drop the indexes, then the database can rebuild the index by first sorting the information to be dealt with (using a mergesort of course!) and then loading it into a BTREE datastructure for the index. (BTREEs are naturally kept in order, so you can load one from a sorted dataset with few seeks to disk.)
There have been a number of occasions where understanding how to avoid disk seeks has let me make data processing jobs take hours rather than days or weeks.

Actually, QuickSort is O(n2). Its average case running time is O(nlog(n)), but its worst-case is O(n2), which occurs when you run it on a list that contains few unique items. Randomization takes O(n). Of course, this doesn't change its worst case, it just prevents a malicious user from making your sort take a long time.
QuickSort is more popular because it:
Is in-place (MergeSort requires extra memory linear to number of elements to be sorted).
Has a small hidden constant.

"and yet most people use Quicksort instead of Mergesort. Why is that?"
One psychological reason that has not been given is simply that Quicksort is more cleverly named. ie good marketing.
Yes, Quicksort with triple partioning is probably one of the best general purpose sort algorithms, but theres no getting over the fact that "Quick" sort sounds much more powerful than "Merge" sort.

As others have noted, worst case of Quicksort is O(n^2), while mergesort and heapsort stay at O(nlogn). On the average case, however, all three are O(nlogn); so they're for the vast majority of cases comparable.
What makes Quicksort better on average is that the inner loop implies comparing several values with a single one, while on the other two both terms are different for each comparison. In other words, Quicksort does half as many reads as the other two algorithms. On modern CPUs performance is heavily dominated by access times, so in the end Quicksort ends up being a great first choice.

I'd like to add that of the three algoritms mentioned so far (mergesort, quicksort and heap sort) only mergesort is stable. That is, the order does not change for those values which have the same key. In some cases this is desirable.
But, truth be told, in practical situations most people need only good average performance and quicksort is... quick =)
All sort algorithms have their ups and downs. See Wikipedia article for sorting algorithms for a good overview.

From the Wikipedia entry on Quicksort:
Quicksort also competes with
mergesort, another recursive sort
algorithm but with the benefit of
worst-case Θ(nlogn) running time.
Mergesort is a stable sort, unlike
quicksort and heapsort, and can be
easily adapted to operate on linked
lists and very large lists stored on
slow-to-access media such as disk
storage or network attached storage.
Although quicksort can be written to
operate on linked lists, it will often
suffer from poor pivot choices without
random access. The main disadvantage
of mergesort is that, when operating
on arrays, it requires Θ(n) auxiliary
space in the best case, whereas the
variant of quicksort with in-place
partitioning and tail recursion uses
only Θ(logn) space. (Note that when
operating on linked lists, mergesort
only requires a small, constant amount
of auxiliary storage.)

Mu!
Quicksort is not better, it is well suited for a different kind of application, than mergesort.
Mergesort is worth considering if speed is of the essence, bad worst-case performance cannot be tolerated, and extra space is available.1
You stated that they «They're both O(nlogn) […]». This is wrong. «Quicksort uses about n^2/2 comparisons in the worst case.»1.
However the most important property according to my experience is the easy implementation of sequential access you can use while sorting when using programming languages with the imperative paradigm.
1 Sedgewick, Algorithms

I would like to add to the existing great answers some math about how QuickSort performs when diverging from best case and how likely that is, which I hope will help people understand a little better why the O(n^2) case is not of real concern in the more sophisticated implementations of QuickSort.
Outside of random access issues, there are two main factors that can impact the performance of QuickSort and they are both related to how the pivot compares to the data being sorted.
1) A small number of keys in the data. A dataset of all the same value will sort in n^2 time on a vanilla 2-partition QuickSort because all of the values except the pivot location are placed on one side each time. Modern implementations address this by methods such as using a 3-partition sort. These methods execute on a dataset of all the same value in O(n) time. So using such an implementation means that an input with a small number of keys actually improves performance time and is no longer a concern.
2) Extremely bad pivot selection can cause worst case performance. In an ideal case, the pivot will always be such that 50% the data is smaller and 50% the data is larger, so that the input will be broken in half during each iteration. This gives us n comparisons and swaps times log-2(n) recursions for O(n*logn) time.
How much does non-ideal pivot selection affect execution time?
Let's consider a case where the pivot is consistently chosen such that 75% of the data is on one side of the pivot. It's still O(n*logn) but now the base of the log has changed to 1/0.75 or 1.33. The relationship in performance when changing base is always a constant represented by log(2)/log(newBase). In this case, that constant is 2.4. So this quality of pivot choice takes 2.4 times longer than the ideal.
How fast does this get worse?
Not very fast until the pivot choice gets (consistently) very bad:
50% on one side: (ideal case)
75% on one side: 2.4 times as long
90% on one side: 6.6 times as long
95% on one side: 13.5 times as long
99% on one side: 69 times as long
As we approach 100% on one side the log portion of the execution approaches n and the whole execution asymptotically approaches O(n^2).
In a naive implementation of QuickSort, cases such as a sorted array (for 1st element pivot) or a reverse-sorted array (for last element pivot) will reliably produce a worst-case O(n^2) execution time. Additionally, implementations with a predictable pivot selection can be subjected to DoS attack by data that is designed to produce worst case execution. Modern implementations avoid this by a variety of methods, such as randomizing the data before sort, choosing the median of 3 randomly chosen indexes, etc. With this randomization in the mix, we have 2 cases:
Small data set. Worst case is reasonably possible but O(n^2) is not catastrophic because n is small enough that n^2 is also small.
Large data set. Worst case is possible in theory but not in practice.
How likely are we to see terrible performance?
The chances are vanishingly small. Let's consider a sort of 5,000 values:
Our hypothetical implementation will choose a pivot using a median of 3 randomly chosen indexes. We will consider pivots that are in the 25%-75% range to be "good" and pivots that are in the 0%-25% or 75%-100% range to be "bad". If you look at the probability distribution using the median of 3 random indexes, each recursion has an 11/16 chance of ending up with a good pivot. Let us make 2 conservative (and false) assumptions to simplify the math:
Good pivots are always exactly at a 25%/75% split and operate at 2.4*ideal case. We never get an ideal split or any split better than 25/75.
Bad pivots are always worst case and essentially contribute nothing to the solution.
Our QuickSort implementation will stop at n=10 and switch to an insertion sort, so we require 22 25%/75% pivot partitions to break the 5,000 value input down that far. (10*1.333333^22 > 5000) Or, we require 4990 worst case pivots. Keep in mind that if we accumulate 22 good pivots at any point then the sort will complete, so worst case or anything near it requires extremely bad luck. If it took us 88 recursions to actually achieve the 22 good pivots required to sort down to n=10, that would be 4*2.4*ideal case or about 10 times the execution time of the ideal case. How likely is it that we would not achieve the required 22 good pivots after 88 recursions?
Binomial probability distributions can answer that, and the answer is about 10^-18. (n is 88, k is 21, p is 0.6875) Your user is about a thousand times more likely to be struck by lightning in the 1 second it takes to click [SORT] than they are to see that 5,000 item sort run any worse than 10*ideal case. This chance gets smaller as the dataset gets larger. Here are some array sizes and their corresponding chances to run longer than 10*ideal:
Array of 640 items: 10^-13 (requires 15 good pivot points out of 60 tries)
Array of 5,000 items: 10^-18 (requires 22 good pivots out of 88 tries)
Array of 40,000 items:10^-23 (requires 29 good pivots out of 116)
Remember that this is with 2 conservative assumptions that are worse than reality. So actual performance is better yet, and the balance of the remaining probability is closer to ideal than not.
Finally, as others have mentioned, even these absurdly unlikely cases can be eliminated by switching to a heap sort if the recursion stack goes too deep. So the TLDR is that, for good implementations of QuickSort, the worst case does not really exist because it has been engineered out and execution completes in O(n*logn) time.

This is a common question asked in the interviews that despite of better worst case performance of merge sort, quicksort is considered better than merge sort, especially for a large input. There are certain reasons due to which quicksort is better:
1- Auxiliary Space: Quick sort is an in-place sorting algorithm. In-place sorting means no additional storage space is needed to perform sorting. Merge sort on the other hand requires a temporary array to merge the sorted arrays and hence it is not in-place.
2- Worst case: The worst case of quicksort O(n^2) can be avoided by using randomized quicksort. It can be easily avoided with high probability by choosing the right pivot. Obtaining an average case behavior by choosing right pivot element makes it improvise the performance and becoming as efficient as Merge sort.
3- Locality of reference: Quicksort in particular exhibits good cache locality and this makes it faster than merge sort in many cases like in virtual memory environment.
4- Tail recursion: QuickSort is tail recursive while Merge sort is not. A tail recursive function is a function where recursive call is the last thing executed by the function. The tail recursive functions are considered better than non tail recursive functions as tail-recursion can be optimized by compiler.

Quicksort is the fastest sorting algorithm in practice but has a number of pathological cases that can make it perform as badly as O(n2).
Heapsort is guaranteed to run in O(n*ln(n)) and requires only finite additional storage. But there are many citations of real world tests which show that heapsort is significantly slower than quicksort on average.

Quicksort is NOT better than mergesort. With O(n^2) (worst case that rarely happens), quicksort is potentially far slower than the O(nlogn) of the merge sort. Quicksort has less overhead, so with small n and slow computers, it is better. But computers are so fast today that the additional overhead of a mergesort is negligible, and the risk of a very slow quicksort far outweighs the insignificant overhead of a mergesort in most cases.
In addition, a mergesort leaves items with identical keys in their original order, a useful attribute.

Wikipedia's explanation is:
Typically, quicksort is significantly faster in practice than other Θ(nlogn) algorithms, because its inner loop can be efficiently implemented on most architectures, and in most real-world data it is possible to make design choices which minimize the probability of requiring quadratic time.
Quicksort
Mergesort
I think there are also issues with the amount of storage needed for Mergesort (which is Ω(n)) that quicksort implementations don't have. In the worst case, they are the same amount of algorithmic time, but mergesort requires more storage.

Why Quicksort is good?
QuickSort takes N^2 in worst case and NlogN average case. The worst case occurs when data is sorted.
This can be mitigated by random shuffle before sorting is started.
QuickSort doesn't takes extra memory that is taken by merge sort.
If the dataset is large and there are identical items, complexity of Quicksort reduces by using 3 way partition. More the no of identical items better the sort. If all items are identical, it sorts in linear time. [This is default implementation in most libraries]
Is Quicksort always better than Mergesort?
Not really.
Mergesort is stable but Quicksort is not. So if you need stability in output, you would use Mergesort. Stability is required in many practical applications.
Memory is cheap nowadays. So if extra memory used by Mergesort is not critical to your application, there is no harm in using Mergesort.
Note: In java, Arrays.sort() function uses Quicksort for primitive data types and Mergesort for object data types. Because objects consume memory overhead, so added a little overhead for Mergesort may not be any issue for performance point of view.
Reference: Watch the QuickSort videos of Week 3, Princeton Algorithms Course at Coursera

Unlike Merge Sort Quick Sort doesn't uses an auxilary space. Whereas Merge Sort uses an auxilary space O(n).
But Merge Sort has the worst case time complexity of O(nlogn) whereas the worst case complexity of Quick Sort is O(n^2) which happens when the array is already is sorted.

The answer would slightly tilt towards quicksort w.r.t to changes brought with DualPivotQuickSort for primitive values . It is used in JAVA 7 to sort in java.util.Arrays
It is proved that for the Dual-Pivot Quicksort the average number of
comparisons is 2*n*ln(n), the average number of swaps is 0.8*n*ln(n),
whereas classical Quicksort algorithm has 2*n*ln(n) and 1*n*ln(n)
respectively. Full mathematical proof see in attached proof.txt
and proof_add.txt files. Theoretical results are also confirmed
by experimental counting of the operations.
You can find the JAVA7 implmentation here - http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/java/util/Arrays.java
Further Awesome Reading on DualPivotQuickSort - http://permalink.gmane.org/gmane.comp.java.openjdk.core-libs.devel/2628

In merge-sort, the general algorithm is:
Sort the left sub-array
Sort the right sub-array
Merge the 2 sorted sub-arrays
At the top level, merging the 2 sorted sub-arrays involves dealing with N elements.
One level below that, each iteration of step 3 involves dealing with N/2 elements, but you have to repeat this process twice. So you're still dealing with 2 * N/2 == N elements.
One level below that, you're merging 4 * N/4 == N elements, and so on. Every depth in the recursive stack involves merging the same number of elements, across all calls for that depth.
Consider the quick-sort algorithm instead:
Pick a pivot point
Place the pivot point at the correct place in the array, with all smaller elements to the left, and larger elements to the right
Sort the left-subarray
Sort the right-subarray
At the top level, you're dealing with an array of size N. You then pick one pivot point, put it in its correct position, and can then ignore it completely for the rest of the algorithm.
One level below that, you're dealing with 2 sub-arrays that have a combined size of N-1 (ie, subtract the earlier pivot point). You pick a pivot point for each sub-array, which comes up to 2 additional pivot points.
One level below that, you're dealing with 4 sub-arrays with combined size N-3, for the same reasons as above.
Then N-7... Then N-15... Then N-32...
The depth of your recursive stack remains approximately the same (logN). With merge-sort, you're always dealing with a N-element merge, across each level of the recursive stack. With quick-sort though, the number of elements that you're dealing with diminishes as you go down the stack. For example, if you look at the depth midway through the recursive stack, the number of elements you're dealing with is N - 2^((logN)/2)) == N - sqrt(N).
Disclaimer: On merge-sort, because you divide the array into 2 exactly equal chunks each time, the recursive depth is exactly logN. On quick-sort, because your pivot point is unlikely to be exactly in the middle of the array, the depth of your recursive stack may be slightly greater than logN. I haven't done the math to see how big a role this factor and the factor described above, actually play in the algorithm's complexity.

This is a pretty old question, but since I've dealt with both recently here are my 2c:
Merge sort needs on average ~ N log N comparisons. For already (almost) sorted sorted arrays this gets down to 1/2 N log N, since while merging we (almost) always select "left" part 1/2 N of times and then just copy right 1/2 N elements. Additionally I can speculate that already sorted input makes processor's branch predictor shine but guessing almost all branches correctly, thus preventing pipeline stalls.
Quick sort on average requires ~ 1.38 N log N comparisons. It does not benefit greatly from already sorted array in terms of comparisons (however it does in terms of swaps and probably in terms of branch predictions inside CPU).
My benchmarks on fairly modern processor shows the following:
When comparison function is a callback function (like in qsort() libc implementation) quicksort is slower than mergesort by 15% on random input and 30% for already sorted array for 64 bit integers.
On the other hand if comparison is not a callback, my experience is that quicksort outperforms mergesort by up to 25%.
However if your (large) array has a very few unique values, merge sort starts gaining over quicksort in any case.
So maybe the bottom line is: if comparison is expensive (e.g. callback function, comparing strings, comparing many parts of a structure mostly getting to a second-third-forth "if" to make difference) - the chances are that you will be better with merge sort. For simpler tasks quicksort will be faster.
That said all previously said is true:
- Quicksort can be N^2, but Sedgewick claims that a good randomized implementation has more chances of a computer performing sort to be struck by a lightning than to go N^2
- Mergesort requires extra space

Quicksort has a better average case complexity but in some applications it is the wrong choice. Quicksort is vulnerable to denial of service attacks. If an attacker can choose the input to be sorted, he can easily construct a set that takes the worst case time complexity of o(n^2).
Mergesort's average case complexity and worst case complexity are the same, and as such doesn't suffer the same problem. This property of merge-sort also makes it the superior choice for real-time systems - precisely because there aren't pathological cases that cause it to run much, much slower.
I'm a bigger fan of Mergesort than I am of Quicksort, for these reasons.

That's hard to say.The worst of MergeSort is n(log2n)-n+1,which is accurate if n equals 2^k(I have already proved this).And for any n,it's between (n lg n - n + 1) and (n lg n + n + O(lg n)).But for quickSort,its best is nlog2n(also n equals 2^k).If you divide Mergesort by quickSort,it equals one when n is infinite.So it's as if the worst case of MergeSort is better than the best case of QuickSort,why do we use quicksort?But remember,MergeSort is not in place,it require 2n memeroy space.And MergeSort also need to do many array copies,which we don't include in the analysis of algorithm.In a word,MergeSort is really faseter than quicksort in theroy,but in reality you need to consider memeory space,the cost of array copy,merger is slower than quick sort.I once made an experiment where I was given 1000000 digits in java by Random class,and it took 2610ms by mergesort,1370ms by quicksort.

Quick sort is worst case O(n^2), however, the average case consistently out performs merge sort. Each algorithm is O(nlogn), but you need to remember that when talking about Big O we leave off the lower complexity factors. Quick sort has significant improvements over merge sort when it comes to constant factors.
Merge sort also requires O(2n) memory, while quick sort can be done in place (requiring only O(n)). This is another reason that quick sort is generally preferred over merge sort.
Extra info:
The worst case of quick sort occurs when the pivot is poorly chosen. Consider the following example:
[5, 4, 3, 2, 1]
If the pivot is chosen as the smallest or largest number in the group then quick sort will run in O(n^2). The probability of choosing the element that is in the largest or smallest 25% of the list is 0.5. That gives the algorithm a 0.5 chance of being a good pivot. If we employ a typical pivot choosing algorithm (say choosing a random element), we have 0.5 chance of choosing a good pivot for every choice of a pivot. For collections of a large size the probability of always choosing a poor pivot is 0.5 * n. Based on this probability quick sort is efficient for the average (and typical) case.

When I experimented with both sorting algorithms, by counting the number of recursive calls,
quicksort consistently has less recursive calls than mergesort.
It is because quicksort has pivots, and pivots are not included in the next recursive calls. That way quicksort can reach recursive base case more quicker than mergesort.

While they're both in the same complexity class, that doesn't mean they both have the same runtime. Quicksort is usually faster than mergesort, just because it's easier to code a tight implementation and the operations it does can go faster. It's because that quicksort is generally faster that people use it instead of mergesort.
However! I personally often will use mergesort or a quicksort variant that degrades to mergesort when quicksort does poorly. Remember. Quicksort is only O(n log n) on average. It's worst case is O(n^2)! Mergesort is always O(n log n). In cases where realtime performance or responsiveness is a must and your input data could be coming from a malicious source, you should not use plain quicksort.

All things being equal, I'd expect most people to use whatever is most conveniently available, and that tends to be qsort(3). Other than that quicksort is known to be very fast on arrays, just like mergesort is the common choice for lists.
What I'm wondering is why it's so rare to see radix or bucket sort. They're O(n), at least on linked lists and all it takes is some method of converting the key to an ordinal number. (strings and floats work just fine.)
I'm thinking the reason has to do with how computer science is taught. I even had to demonstrate to my lecturer in Algorithm analysis that it was indeed possible to sort faster than O(n log(n)). (He had the proof that you can't comparison sort faster than O(n log(n)), which is true.)
In other news, floats can be sorted as integers, but you have to turn the negative numbers around afterwards.
Edit:
Actually, here's an even more vicious way to sort floats-as-integers: http://www.stereopsis.com/radix.html. Note that the bit-flipping trick can be used regardless of what sorting algorithm you actually use...

Small additions to quick vs merge sorts.
Also it can depend on kind of sorting items. If access to items, swap and comparisons is not simple operations, like comparing integers in plane memory, then merge sort can be preferable algorithm.
For example , we sort items using network protocol on remote server.
Also, in custom containers like "linked list", the are no benefit of quick sort.
1. Merge sort on linked list, don't need additional memory.
2. Access to elements in quick sort is not sequential (in memory)

Quick sort is an in-place sorting algorithm, so its better suited for arrays. Merge sort on the other hand requires extra storage of O(N), and is more suitable for linked lists.
Unlike arrays, in liked list we can insert items in the middle with O(1) space and O(1) time, therefore the merge operation in merge sort can be implemented without any extra space. However, allocating and de-allocating extra space for arrays have an adverse effect on the run time of merge sort. Merge sort also favors linked list as data is accessed sequentially, without much random memory access.
Quick sort on the other hand requires a lot of random memory access and with an array we can directly access the memory without any traversing as required by linked lists. Also quick sort when used for arrays have a good locality of reference as arrays are stored contiguously in memory.
Even though both sorting algorithms average complexity is O(NlogN), usually people for ordinary tasks uses an array for storage, and for that reason quick sort should be the algorithm of choice.
EDIT: I just found out that merge sort worst/best/avg case is always nlogn, but quick sort can vary from n2(worst case when elements are already sorted) to nlogn(avg/best case when pivot always divides the array in two halves).

Consider time and space complexity both.
For Merge sort :
Time complexity : O(nlogn) ,
Space complexity : O(nlogn)
For Quick sort :
Time complexity : O(n^2) ,
Space complexity : O(n)
Now, they both win in one scenerio each.
But, using a random pivot you can almost always reduce Time complexity of Quick sort to O(nlogn).
Thus, Quick sort is preferred in many applications instead of Merge sort.

In c/c++ land, when not using stl containers, I tend to use quicksort, because it is built
into the run time, while mergesort is not.
So I believe that in many cases, it is simply the path of least resistance.
In addition performance can be much higher with quick sort, for cases where the entire dataset does not fit into the working set.

One of the reason is more philosophical. Quicksort is Top->Down philosophy. With n elements to sort, there are n! possibilities. With 2 partitions of m & n-m which are mutually exclusive, the number of possibilities go down in several orders of magnitude. m! * (n-m)! is smaller by several orders than n! alone. imagine 5! vs 3! *2!. 5! has 10 times more possibilities than 2 partitions of 2 & 3 each . and extrapolate to 1 million factorial vs 900K!*100K! vs. So instead of worrying about establishing any order within a range or a partition,just establish order at a broader level in partitions and reduce the possibilities within a partition. Any order established earlier within a range will be disturbed later if the partitions themselves are not mutually exclusive.
Any bottom up order approach like merge sort or heap sort is like a workers or employee's approach where one starts comparing at a microscopic level early. But this order is bound to be lost as soon as an element in between them is found later on. These approaches are very stable & extremely predictable but do a certain amount of extra work.
Quick Sort is like Managerial approach where one is not initially concerned about any order , only about meeting a broad criterion with No regard for order. Then the partitions are narrowed until you get a sorted set. The real challenge in Quicksort is in finding a partition or criterion in the dark when you know nothing about the elements to sort. That is why we either need to spend some effort to find a median value or pick 1 at random or some arbitrary "Managerial" approach . To find a perfect median can take significant amount of effort and leads to a stupid bottom up approach again. So Quicksort says just a pick a random pivot and hope that it will be somewhere in the middle or do some work to find median of 3 , 5 or something more to find a better median but do not plan to be perfect & don't waste any time in initially ordering. That seems to do well if you are lucky or sometimes degrades to n^2 when you don't get a median but just take a chance. Any way data is random. right.
So I agree more with the top ->down logical approach of quicksort & it turns out that the chance it takes about pivot selection & comparisons that it saves earlier seems to work better more times than any meticulous & thorough stable bottom ->up approach like merge sort. But

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio