No key-comparison sorting algorithm - sorting

On this webpage I can read:
A few special case algorithms (one example is mentioned in
Programming Pearls) can sort certain data sets faster than
O(n*log(n)). These algorithms are not based on comparing the items
being sorted and rely on tricks. It has been shown that no
key-comparison algorithm can perform better than O(n*log(n)).
It's the first time I hear about non-comparison algorithms. Could anybody give me an example of one of those algorithms and explain better how they solve the sorting problem faster then O(nlog(n))? What kind of tricks the author of that webpage is talking about?
Any link to papers or other good source are welcome. Thank you.

First, let's get the terminology straight:
Key comparison algorithms can't do better than O(n logn).
There exist other -- non-comparison -- algorithms that, given certain assumptions about the data, can do better than O(n logn). Bucket sort is one such example.
To give an intuitive example of the second class, let's say you know that your input array consists entirely of zeroes and ones. You could iterate over the array, counting the number of zeroes and ones. Let's call the final counts n0 and n1. You then iterate over the output array, writing out n0 zeroes followed by n1 ones. This is an O(n) sorting algorithm.
It has been possible to come up with a linear-time algorithm for this problem only because we exploit the special structure of the data. This is in contrast to key comparison algorithms, which are general-purpose. Such algorithms don't need to know anything about the data, except for one thing: they need to know how to compare the sorting keys of any two elements. In other words, given any two elements, they need to know which should come first in the sorted array.
The price of being able to sort anything in any way imaginable using just one algorithm is that no such algorithm can hope to do better than O(n logn) on average.

Yes non comparison sorting usually takes O(n) an example of these sorting algorithms are the Bucket Sort and Radix Sort

Related

What's 1.5n comparison?

I'm beginning to learn Data Structure and Algorithms with UCSD's MOOC.
For the second problem, they ask us to implement an algorithm to find the two highest values in an array.
As an additional problem, they add the following exercise:
Exercise Break. Find two largest elements in an array in 1.5n comparisons.
I don't know exactly what 1.5 comparisons mean. I've searched on Google but couldn't find an explanation of comparisons in algorithms.
Is there a site with some examples of comparisons?
Is talking about the complexity of the algorithm
You have to give an algorithm who takes O(3/2 n) in the worst case.
Just to an example, bubble sort algo. takes O(n*n) in the worst case

Why is insertion sort "decrease and conquer" whereas selection sort is "brute force"?

Why are insertion sort algorithms also not considered brute force algorithms? Don't they systematically look through every value of the array as well? I undertand that selection sort has a worst best-case time complexity, but I'm still not fully understanding the differentiation between a brute-force algorithm and a decrease-and-conquer algorithm. Sorry if this is a stupid question. Thanks!
Both insertion and selection sort might be called “decrease and conquer” because at every step of outer loop they treat smaller and smaller part of array.
Formally speaking, in these sorts loop invariant condition is that the subarray A[0 to i-1] is always sorted.
I have not met “brute force” term in application to sorting algorithms, it looks like nonsense. Brute force algorithms usually check all possible variants and choose the best one. It is possible to generate all permutations of an array or list and choose sorted one - this kind of sortinh might be called "stupid sort" or something like.

Is my analysis for identifying the best sorting algorithm to solve this task correct?

This was an interview question and I am wondering if my analysis was correct:
A 'magic select' function basically generates the 'mth' smallest value in an array that has a size of n. The task was to sort the 'm' elements in ascending order using an efficient algorithm. My analysis was to first use the 'magic select' function to get the 'mth' smallest value. I then used a partition function to sort of create a pivot to get all smaller elements on the left. After that point, I felt that a bucket sort should accomplish the task of sorting the left half efficiently.
I was just wondering if this was the best way to sort the 'm' smallest elements. I see the possibility of a quick sort being used here too. However, I thought that avoiding a comparison based sorting could lead to an O(n). Could radix sort or heap sort (O(nlogn)) be used for this too? If I didn't do it in the best possible way, which could be the best possible way to accomplish this? An array was the data structure I was allowed to use.
Many thanks!
I'm pretty sure you can't do any better than any standard algorithm for selecting the k lowest elements out of an array in sorted order. The time complexity of your "magic machine" is O(n), which is the same time complexity you'd get from a standard selection algorithm like the median-of-medians algorithm or quickselect.
Consequently, your approaches seem very reasonable. I doubt you can do any better asymptotically.
Hope this helps!

What sorting techniques can I use when comparing elements is expensive?

Problem
I have an application where I want to sort an array a of elements a0, a1,...,an-1. I have a comparison function cmp(i,j) that compares elements ai and aj and a swap function swap(i,j), that swaps elements ai and aj of the array. In the application, execution of the cmp(i,j) function might be extremely expensive, to the point where one execution of cmp(i,j) takes longer than any other steps in the sort (except for other cmp(i,j) calls, of course) together. You may think of cmp(i,j) as a rather lengthy IO operation.
Please assume for the sake of this question that there is no way to make cmp(i,j) faster. Assume all optimizations that could possibly make cmp(i,j) faster have already been done.
Questions
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
It is possible in my application to write a predicate expensive(i,j) that is true iff a call to cmp(i,j) would take a long time. expensive(i,j) is cheap and expensive(i,j) ∧ expensive(j,k) → expensive(i,k) mostly holds in my current application. This is not guaranteed though.
Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I'd like pointers to further material on this topic.
Example
This is an example that is not entirely unlike the application I have.
Consider a set of possibly large files. In this application the goal is to find duplicate files among them. This essentially boils down to sorting the files by some arbitrary criterium and then traversing them in order, outputting sequences of equal files that were encountered.
Of course reader in large amounts of data is expensive, therefor one can, for instance, only read the first megabyte of each file and calculate a hash function on this data. If the files compare equal, so do the hashes, but the reverse may not hold. Two large file could only differ in one byte near the end.
The implementation of expensive(i,j) in this case is simply a check whether the hashes are equal. If they are, an expensive deep comparison is neccessary.
I'll try to answer each question as best as I can.
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Traditional sorting methods may have some variation, but in general, there is a mathematical limit to the minimum number of comparisons necessary to sort a list, and most algorithms take advantage of that, since comparisons are often not inexpensive. You could try sorting by something else, or try using a shortcut that may be faster that may approximate the real solution.
Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I don't think you can get around the necessity of doing at least the minimum number of comparisons, but you may be able to change what you compare. If you can compare hashes or subsets of the data instead of the whole thing, that could certainly be helpful. Anything you can do to simplify the comparison operation will make a big difference, but without knowing specific details of the data, it's hard to suggest specific solutions.
I'd like pointers to further material on this topic.
Check these out:
Apparently Donald Knuth's The Art of Computer Programming, Volume 3 has a section on this topic, but I don't have a copy handy.
Wikipedia of course has some insight into the matter.
Sorting an array with minimal number of comparisons
How do I figure out the minimum number of swaps to sort a list in-place?
Limitations of comparison based sorting techniques
The theoretical minimum number of comparisons needed to sort an array of n elements on average is lg (n!), which is about n lg n - n. There's no way to do better than this on average if you're using comparisons to order the elements.
Of the standard O(n log n) comparison-based sorting algorithms, mergesort makes the lowest number of comparisons (just about n lg n, compared with about 1.44 n lg n for quicksort and about n lg n + 2n for heapsort), so it might be a good algorithm to use as a starting point. Typically mergesort is slower than heapsort and quicksort, but that's usually under the assumption that comparisons are fast.
If you do use mergesort, I'd recommend using an adaptive variant of mergesort like natural mergesort so that if the data is mostly sorted, the number of comparisons is closer to linear.
There are a few other options available. If you know for a fact that the data is already mostly sorted, you could use insertion sort or a standard variation of heapsort to try to speed up the sorting. Alternatively, you could use mergesort but use an optimal sorting network as a base case when n is small. This might shave off enough comparisons to give you a noticeable performance boost.
Hope this helps!
A technique called the Schwartzian transform can be used to reduce any sorting problem to that of sorting integers. It requires you to apply a function f to each of your input items, where f(x) < f(y) if and only if x < y.
(Python-oriented answer, when I thought the question was tagged [python])
If you can define a function f such that f(x) < f(y) if and only if x < y, then you can sort using
sort(L, key=f)
Python guarantees that key is called at most once for each element of the iterable you are sorting. This provides support for the Schwartzian transform.
Python 3 does not support specifying a cmp function, only the key parameter. This page provides a way of easily converting any cmp function to a key function.
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Edit: Ah, sorry. There are algorithms that minimize the number of comparisons (below), but not that I know of for specific elements.
Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
Not that I know of, but perhaps you'll find it in these papers below.
I'd like pointers to further material on this topic.
On Optimal and Efficient in Place Merging
Stable Minimum Storage Merging by Symmetric Comparisons
Optimal Stable Merging (this one seems to be O(n log2 n) though
Practical In-Place Mergesort
If you implement any of them, posting them here might be useful for others too! :)
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Merge insertion algorithm, described in D. Knuth's "The art of computer programming", Vol 3, chapter 5.3.1, uses less comparisons than other comparison-based algorithms. But still it needs O(N log N) comparisons.
Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I think some of existing sorting algorithms may be modified to take into account expensive(i,j) predicate. Let's take the simplest of them - insertion sort. One of its variants, named in Wikipedia as binary insertion sort, uses only O(N log N) comparisons.
It employs a binary search to determine the correct location to insert new elements. We could apply expensive(i,j) predicate after each binary search step to determine if it is cheap to compare the inserted element with "middle" element found in binary search step. If it is expensive we could try the "middle" element's neighbors, then their neighbors, etc. If no cheap comparisons could be found we just return to the "middle" element and perform expensive comparison.
There are several possible optimizations. If predicate and/or cheap comparisons are not so cheap we could roll back to the "middle" element earlier than all other possibilities are tried. Also if move operations cannot be considered as very cheap, we could use some order statistics data structure (like Indexable skiplist) do reduce insertion cost to O(N log N).
This modified insertion sort needs O(N log N) time for data movement, O(N2) predicate computations and cheap comparisons and O(N log N) expensive comparisons in the worst case. But more likely there would be only O(N log N) predicates and cheap comparisons and O(1) expensive comparisons.
Consider a set of possibly large files. In this application the goal is to find duplicate files among them.
If the only goal is to find duplicates, I think sorting (at least comparison sorting) is not necessary. You could just distribute the files between buckets depending on hash value computed for first megabyte of data from each file. If there are more than one file in some bucket, take other 10, 100, 1000, ... megabytes. If still more than one file in some bucket, compare them byte-by-byte. Actually this procedure is similar to radix sort.
Most sorting algorithm out there try minimize the amount of comparisons during sorting.
My advice:
Pick quick-sort as a base algorithm and memorize results of comparisons just in case you happen to compare the same problems again. This should help you in the O(N^2) worst case of quick-sort. Bear in mind that this will make you use O(N^2) memory.
Now if you are really adventurous you could try the Dual-Pivot quick-sort.
Something to keep in mind is that if you are continuously sorting the list with new additions, and the comparison between two elements is guaranteed to never change, you can memoize the comparison operation which will lead to a performance increase. In most cases this won't be applicable, unfortunately.
We can look at your problem in the another direction, Seems your problem is IO related, then you can use advantage of parallel sorting algorithms, In fact you can run many many threads to run comparison on files, then sort them by one of a best known parallel algorithms like Sample sort algorithm.
Quicksort and mergesort are the fastest possible sorting algorithm, unless you have some additional information about the elements you want to sort. They will need O(n log(n)) comparisons, where n is the size of your array.
It is mathematically proved that any generic sorting algorithm cannot be more efficient than that.
If you want to make the procedure faster, you might consider adding some metadata to accelerate the computation (can't be more precise unless you are, too).
If you know something stronger, such as the existence of a maximum and a minimum, you can use faster sorting algorithms, such as radix sort or bucket sort.
You can look for all the mentioned algorithms on wikipedia.
As far as I know, you can't benefit from the expensive relationship. Even if you know that, you still need to perform such comparisons. As I said, you'd better try and cache some results.
EDIT I took some time to think about it, and I came up with a slightly customized solution, that I think will make the minimum possible amount of expensive comparisons, but totally disregards the overall number of comparisons. It will make at most (n-m)*log(k) expensive comparisons, where
n is the size of the input vector
m is the number of distinct component which are easy to compare between each other
k is the maximum number of elements which are hard to compare and have consecutive ranks.
Here is the description of the algorithm. It's worth nothing saying that it will perform much worse than a simple merge sort, unless m is big and k is little. The total running time is O[n^4 + E(n-m)log(k)], where E is the cost of an expensive comparison (I assumed E >> n, to prevent it from being wiped out from the asymptotic notation. That n^4 can probably be further reduced, at least in the mean case.
EDIT The file I posted contained some errors. While trying it, I also fixed them (I overlooked the pseudocode for insert_sorted function, but the idea was correct. I made a Java program that sorts a vector of integers, with delays added as you described. Even if I was skeptical, it actually does better than mergesort, if the delay is significant (I used 1s delay agains integer comparison, which usually takes nanoseconds to execute)

Sorting algorithm that runs in time O(n) and also sorts in place

Is there any sorting algorithm which has running time of O(n) and also sorts in place?
There are a few where the best case scenario is O(n), but it's probably because the collection of items is already sorted. You're looking at O(n log n) on average for some of the better ones.
With that said, the Wiki on sorting algorithms is quite good. There's a table that compares popular algorithms, stating their complexity, memory requirements (indicating whether the algorithm might be "in place"), and whether they leave equal value elements in their original order ("stability").
http://en.wikipedia.org/wiki/Sorting_algorithm
Here's a little more interesting look at performance, provided by this table (from the above Wiki):
http://en.wikipedia.org/wiki/File:SortingAlgoComp.png
Some will obviously be easier to implement than others, but I'm guessing that the ones worth implementing have already been done so in a library for your choosing.
No.
There's proven lower bound O(n log n) for general sorting.
Radix sort is based on knowing the numeric range of the data, but the in-place radix sorts mentioned here in practice require multiple passes for real-world data.
Radix Sort can do that:
http://en.wikipedia.org/wiki/Radix_sort#In-place_MSD_radix_sort_implementations
Depends on the input and the problem. For example, 1...n numbers can be sorted in O(n) in place.
Spaghetti sort is O(n), though arguably not in-place. Also, it's analog only.

Resources