Efficiency of Sort Algorithms - algorithm

I am studying up for a pretty important interview tomorrow and there is one thing that I have a great deal of trouble with: Sorting algorithms and BigO efficiencies.
What number is important to know? The best, worst, or average efficiency?

worst, followed by average. be aware of the real-world impact of the so-called "hidden constants" too - for instance, the classic quicksort algorithm is O(n^2) in the worst case, and O(n log n) on average, whereas mergesort is O(n log n) in the worst case, but quicksort will outperform mergesort in practice.

All of them are important to know, of course. You have to understand the benefits of one sort algorithm in the average case can become a terrible deficit in the worst case, or the worst case isn't that bad, but the best case isn't that good, and it only works well on unsorted data, etc.

In short.
Sorting algorithm efficiency will vary on input data and task.
sorting max speed, that can be archived is n*log(n)
if data contains sorted sub data, max speed can be better then n*log(n)
if data consists of duplicates, sorting can be done in near linear time
most of sorting algorithms have their uses
Most of quick sort variants have its average case also n*log(n), but thy are usually faster then other not heavily optimized algorithms. It is faster when it is not stable, but stable variants are only a fraction slower. Main problem is worst case. Best casual fix is Introsort.
Most of merge sort variants have its best, average and worst case fixed to n*log(n). It is stable and relatively easy to scale up. BUT it needs a binary tree (or its emulation) with relative to the size of total items. Main problem is memory. Best casual fix is timsort.
Sorting algorithms vary also by size of input. I can make a newbie claim, that over 10T size data input, there is no match for merge sort variants.

I recommend that you don't merely memorize these factoids. Learn why they are what they are. If I was interviewing you, I would make sure to ask questions that show my you understand how to analyze an algorithm, not just can spit back out something you saw on a webpage or in a book. Additionally, the day before an interview is not the time to be doing this studying.
I wish you the best of luck!! Please report back in a comment how it went!

I am over with one set of interviews at my college just now...
Every algorithm has its benefits, otherwise it wont exist.
So, its better to understand what is so good with the algorithm that you are studying. Where does it do well? How can it be improved?
I guess you'll automatically need to read various efficiency notations when you do this. Mind the worst case, and pay attention to the average case, best cases are rare.
All the best for your interview.

You may also want to look into other types of sorting that can be used when certain conditions exist. For example, consider Radix sort. http://en.wikipedia.org/wiki/Radix_sort

Related

Algorithmic complexity vs real life situations?

My question is about theory vs practice thing.
Let’s say for example that I want to sort a list of numbers. Mergesort has a complexity of O(n*logn) while bubblesort has a complexity of O(n^2).
This means that mergesort is quicker. But the complexity doesn’t take into account the whole thing happening on a computer. What I mean by that, is that mergesort for example is a divide and conquer algorithm and it needs more space than bubblesort.
So isn’t it possible that the creation of this additional space and usage of resources (time to transfer the data, to populate the code instructions, etc) to take more time than bubblesort which doesn’t use any additional space ?
Wouldn’t be possible to be more efficient to use an algorithm with worse (“bigger”) complexity than another for certain length of inputs (maybe small) ?
The answer is a clear yes.
A classic example is that insertion sort is O(n^2). However efficient sorting implementations often switch to insertion sort at something like 100 elements left because insertion sort makes really good use of cache, and avoids pipeline stalls in the CPU. No, insertion sort won't scale, but it outperforms.
The way that I put it is that scalability is like a Mack Truck. You want it for a big load, but it might not be the best thing to take for a shopping trip at the local grocery store.
Algorithmic complexity only tells you how two algorithms will compare as their input grows larger, i.e. approaches infinity. It tells you nothing about how they will compare on smaller inputs. The only way to know that for sure is to benchmark on data and equipment that represents a typical situation.

When should one implement a simple or advanced sorting algorithm?

Apart from the obvious "It's faster when there are many elements". When is it more appropriate to use a simple sorting algorithm (0(N^2)) compared to an advanced one (O(N log N))?
I've read quite a bit about for example insertion sort being preferred when you've got a small array that's nearly sorted because you get the best case N. Why is it not good to use quicksort for example, when you've got say 20 elements. Not just insertion or quick but rather when and why is a more simple algorithm useful compared to an advanced?
EDIT: If we're working with for example an array, does it matter which data input we have? Such as objects or primitive types (Integer).
The big-oh notation captures the runtime cost of the algorithm for large values of N. It is less effective at measuring the runtime of the algorithm for small values.
The actual transition from one algorithm to another is not a trivial thing. For large N, the effects of N really dominate. For small numbers, more complex effects become very important. For example, some algorithms have better cache coherency. Others are best when you know something about the data (like your example of insertion sort when the data is nearly sorted).
The balance also changes over time. In the past, CPU speeds and memory speeds were closer together. Cache coherency issues were less of an issue. In modern times, CPU speeds have generally left memory busses behind, so cache coherency is more important.
So there's no one clear cut and dry answer to when you should use one algorithm over another. The only reliable answer is to profile your code and see.
For amusement: I was looking at the dynamic disjoint forest problem a few years back. I came across a state-of-the-art paper that permitted some operations to be done in something silly like O(log log N / log^4N). They did some truly brilliant math to get there, but there was a catch. The operations were so expensive that, for my graphs of 50-100 nodes, it was far slower than the O(n log n) solution that I eventually used. The paper's solution was far more important for people operating on graphs of 500,000+ nodes.
When programming sorting algorithms, you have to take into account how much work would be put into implementing the actual algorithm vs the actual speed of it. For big O, the time to implement advanced algorithms would be outweighed by the decreased time taken to sort. For small O, such as 20-100 items, the difference is minimal, so taking a simpler route is much better.
First of all O-Notation gives you the sense of the worst case scenario. So in case the array is nearly sorted the execution time could be near to linear time so it would be better than quick sort for example.
In case the n is small enough, we do take into consideration other aspects. Algorithms such as Quick-sort can be slower because of all the recursions called. At that point it depends on how the OS handles the recursions which can end up being slower than the simple arithmetic operations required in the insertion-sort. And not to mention the additional memory space required for recursive algorithms.
Better than 99% of the time, you should not be implementing a sorting algorithm at all.
Instead use a standard sorting algorithm from your language's standard library. In one line of code you get to use a tested and optimized implementation which is O(n log(n)). It likely implements tricks you wouldn't have thought of.
For external sorts, I've used the Unix sort utility from time to time. Aside from the non-intuitive LC_ALL=C environment variable that I need to get it to behave, it is very useful.
Any other cases where you actually need to implement your own sorting algorithm, what you implement will be driven by your precise needs. I've had to deal with this exactly once for production code in two decades of programming. (That was because for a complex series of reasons, I needed to be sorting compressed data on a machine which literally did not have enough disk space to store said data uncompressed. I used a merge sort.)

Optimizing Mergesort

Merge-sort is a fairly common sorting algorithm, and I have written a working merge-sort algorithm. Then I want to optimize it. The first step was to convert it from a recursive to an iterative one, which I did. Then I couldn't discern what else can be optimized. After poring through lots of articles on internet, I got two mechanisms, using multi-merge sort and tiled merge-sort. However none of the documents provided any pseudo-code, or even cared to explain much on how to do it, and how does it offer the advantages its author says it does, like being cache-friendly and improved locality hit.
Can anyone elaborate on this matter, and if possible, provide some pseudo-code? Specifically, I want to know how to make it cache-friendly. I have absolutely no idea about what these things are, otherwise I would have tried it myself.
One common and relatively straightforward optimization you can make is to switch from mergesort to another algorithm like insertion sort when the subarray sizes get below a certain threshold. Although mergesort runs in time O(n log n), that talks about its long-term growth rate and doesn't say anything about how well the algorithm will perform on small inputs. Insertion sort, for example, runs pretty fast on small input sizes even though it's worse in the long run. Consequently, consider changing the base case of your mergesort so that if the array to sort is below a certain size threshold (say, 50-100), you use insertion sort rather than continuing onward in the recursion. From experience, this can markedly improve the performance of the algorithm.

Do we need to know/find/analyze every case {Best, Average and Worst...all} scenarios of an algorithm?

In the books about data-structures and algorithms, we often see that they do not analyze every case scenarios of all algorithms.
Some algorithms are discussed along with average case, some are with average and worst case and others are all of best, average and worst.
Why they tend to do that?
Why don't we need to know all cases for all algorithms?
Best case is generally useless unless you control the inputs. (i.e. best case is usually an anomalous case). Unless it's easy to compute, it's not worth wasting your time.
Average case: it's what can you expect in general. Assuming you work with a large range of inputs, this is usually the most useful thing to consider.
Worst case: a decent tie breaker for two algorithms with the same average case, if you deal with arbitrary inputs (especially if they're untrusted - i.e. you're accepting inputs from people on the web). Also something to consider in design in general - this will come up occasionally. In general, if you have two algorithms that are O(n) average case, but one is O(n lg n) worst case, and one is O(n^2) - it may influence your decision on which to go with. Or it may influence your algorithm design.
Example: quick sort v. merge sort. Both are O(n lg n) Quicksort's worst case is O(n^2), merge sort is (IIRC) still O(n lg n) - but in general, Quicksort tends to be faster if data fits in memory. Even though it has a costlier worst case, since we KNOW it has a costlier worst case, we can try to mitigate it (median-of-three instead of just a random partition, etc.) and take advantage of the fact that it usually is faster than mergesort.
Worst case analysis is usually the most useful way to spend a moderate amount of effort analysing an algorithm. Average case is more complicated, because the average case usually depends on how likely different inputs are, so you have to say what the average case is in terms of what the probability is in terms of different inputs. Best case is not very useful because a hope that my program might complete in 1 second does not allow me to plan some activity that will take too long if my program actually takes 3 hours. Knowing that, no matter what, it will complete in five minutes, is much more useful.
Best case also has a problem with programs that store a small number of pre-prepared inputs and outputs, then check the input they get against the pre-prepared inputs. If they get a match, then they respond with the pre-prepared output without doing anything else and so get great - but meaningless - best case behaviour.
There are some cases where worst case analysis is not what you want. Somebody designing an encryption algorithm might want a guarantee that nobody can break it in less than 10 years (unfortunately, typically such guarantees don't exist).

Is selection sort an efficient algorithm?

I know it's a quadratric time algorithm, but how does it compare to other sorting algorithms, such as QuickSort or Bubble Sort?
Sorting algorithms generally vary based on the nature of data you have.
However, while bubble sort and selection sort are easy to understand (and implement) sorting algorithms their run time is O(n^2) i.e the worst time that you can possibly get.
As far as quick sort is concerned on an avergage it takes O(n log n) time hence it is an excellent sort. However, it too can take O(n ^ 2) in certain cases.
Quadratic time algorithms, depending the size of your data set, can be unbelievably slower.
n = 10e78 (about the number of atoms in the universe)
For a quadratic algorithm, that's n*(10e78). For an nlog(n) algorithm, like quicksort or mergesort, that's n*262. That's a huge difference.
But if your dataset is relatively small (< 1000 items, say), then the performance difference probably isn't going to be noticeable (unless, perhaps, the sort is being done repeatedly). In these cases it's usually best to use the simplest algorithm, and optimize later if it turns out to be too slow.
"Premature optimization is the root of all evil."
-Sir Tony Hoare, popularized by Donald Knuth
Wikipedia knows all.
Selection sort pretty much sucks.
If the data you have consists of only positive integers you may want to look at Bucket Sort. The algorithm can have a linear running time O(n) in the right conditions.

Resources