Selection sort order - algorithm

My book says:
Suppose you have a group of N numbers and would like to determine the kth largest. This is known as the selection problem. Most students who have had a programming course or two would have no difficulty writing a program to solve this problem. There are quite a few “obvious” solutions. One way to solve this problem would be to read the N numbers into an array, sort the array in decreasing order.
It says that it would make sense to sort the array in decreasing order. How does that make sense? If I have an array of {1,9,3,7,4,6} and I want the greatest element, I would sort it in an increasing order so {1,3,4,6,7,9} and then return the last element. Why would the book say in decreasing order?

Because you may not want the largest element, the book says
would like to determine the kth largest
If you sort it in ascending order, how do you know what the, say, 3rd largest number is without first finding out how big the array is?
This would be easier if the array was descending, as the 3rd largest will simply be the 3rd element.

The order itself is not that important, but if you want to k-th largest element, then if you sort in descending order, it is located at the k-th element (or k-1 if we start with index 0), whereas if we sort in ascending order, it is located at index n-k+1 (or n-k if the index starts at 0).
For lazy sorting algorithms (like the ones in Haskell and C# Linq's .OrderBy), this can in fact have implications with respect to time complexity. If we implement a lazy selection sort algorithm (so a generator), then this will run in O(k×n) instead of O(n2). If we use for example a lazy variant of QuickSort, it will take O(n + k log n) to obtain the first k elements.
In a language like Haskell, where laziness is really a key feature, one typically does not only aim to minimize the time complexity of the algorithm producing the entire result, but also producing subsets of the result.

Related

Quicksort to already sorted array

In this question: https://www.quora.com/What-is-randomized-quicksort
Alejo Hausner told in: Cost of quicksort, in the worst case, that
Ironically, if you apply quicksort to an array that is already sorted, you will get probably get this costly behavior
I cannot get it. Can someone explain it to me.
https://www.quora.com/What-will-be-the-complexity-of-quick-sort-if-array-is-already-sorted may be answer to this, but that did not get me a complete response.
The Quicksort algorithm is this:
select a pivot
move elements smaller than the pivot to the beginning, and elements larger than pivot to the end
now the array looks like [<=p, <=p, <=p, p, >p, >p, >p]
recursively sort the first and second "halves" of the array
Quicksort will be efficient, with a running time close to n log n, if the pivot always end up close to the middle of the array. This works perfectly if the pivot is the median value. But selecting the actual median would be costly in itself. If the pivot happens, out of bad luck, to be the smallest or largest element in the array, you'll get an array like this: [p, >p, >p, >p, >p, >p, >p]. If this happens too often, your "quicksort" effectively behaves like selection sort. In that case, since the size of the subarray to be recursively sorted only reduces by 1 at every iteration, there will be n levels of iteration, each one costing n operations, so the overall complexity will be `n^2.
Now, since we're not willing to use costly operations to find a good pivot, we might as well pick an element at random. And since we also don't really care about any kind of true randomness, we can just pick an arbitrary element from the array, for instance the first one.
If the array was shuffled uniformly at random, then picking the first element is great. You can reasonably hope it will regularly give you an "average" element. But if the array was already sorted... Then by definition the first element is the smallest. So we're in the bad case where the complexity is n^2.
A simple way to avoid "bad lists" is to pick a true random element instead of an arbitrary element. Or if you have reasons to believe that quicksort will often be called on lists that are almost sorted, you could pick the element in position n/2 instead of the one in position 1.
There are also several research papers about different ways to select the pivot, with precise calculations on the impact on complexity. For instance, you could pick three random elements, rank them from smallest to largest and keep the middle one. But the conclusion usually is: if you try to write a better pivot-selection, then it will also be more costly, and the overall complexity of the algorithm won't be improved that much.
Depending on the implementations there are several 'common' ways to choose the pivot.
In general for 'unsorted' source there is no good or bad way to choose it.
So some implementations just take the first element as pivot.
In the case of a already sorted source this results in the worst pivot possible because the lest interval will always be empty.
-> recursion steps = O(n) instead the desired O(log n).
This leads to O(n²) complexity, which is very bad for sorting.
Choosing the pivot by random avoids this behavior. It is extremely unlikely that the random chosen pivot will have the same bad characteristics in every recursion as described above.
Also on purpose bad source is not possible to generate because you cannot predict the choices of the random generator (if it's a good one)

In quicksort If an array is randomized, does using the median of 3 for pivot selection matter?

I've been comparing the run times of various pivot selection algorithms. Surprisingly the simplest one where the first element is always chosen is the fastest. This may be because I'm filling the array with random data.
If the array has been randomized (shuffled) does it matter? For example picking the medium of 3 as the pivot is always(?) better than picking the first element as the pivot. But this isn't what I've noticed. Is it because if the array is already randomized there would be no reason to assume sortedness, and using the medium is assuming there is some degree of sortedness?
The worst case runtime of quicksort is O(n²). Quicksort is only in average case a fast sorting algorithm.
To reach a average runtime of O(n log n) you have to choose a random pivot element.
But instead of choosing a random pivot element, you can shuffle the list and choose the first element.
To see that this holds you can look at this that way: lets say all elements are in a specific order. Shuffling means you use a random permutation on the list of elements, so a random element will be at the first position and also on all other positions. You can also see it by shuffling the list by randomly choose one of all elements for the first element, then choosing randomly one element of the other (not yet coosen elements) for the second element, and so on.
If your list is already a random generated list, you can directly choose the first element as pivot, without shuffling again.
So, choosing the first element is the fastest one because of the random generated input, but choosing the thrid or the last will also as fast as choosing the first.
All other ways to choose a pivot element have to compute something (a median or a random number or something like this), but they have no advantage over a random choice.
A substantially late response, but I believe it will add some additional info.
Surprisingly the simplest one where the first element is always chosen
is the fastest.
This is actually not surprisingly at all, since you mentioned that you test the algorithm with the random data. In the reality, a percentage of almost-sorted and sorted data is much greater than it would statistically be expected. Take for example the chronological data, when you collect it into the log file some elements can be out of order, but most of them are already sorted. Unfortunately, the Quicksort implementation that takes first (or last) element as a pivot is vulnerable to such input and it degenerates into O(n^2) complexity because in the partition step you divide your array into two halves of size 1 and n-1 and therefore you get n partitions instead of log n, on average.
That's why people decided to add some sort of randomization that would make a probability of getting the problematic input as minimum as possible. There are three well-known approaches:
shuffle the input - to quote Robert Sedgewick, "the probability of getting O(n^2) performance with such approach is lower than the probability that you will be hit by a thunderstrike" :)
choose the pivot element randomly - Wikipedia says that in average, expected number of comparisons in this case is 1.386 n log n
choose the pivot element as a median of three - Wikipedia says that in average, expected number of comparisons in this case is 1.188 n log n
However, randomization costs. If you shuffle the input array, that is O(n) which is dominated by O(nlogn), but you need to take in the account the cost of invoking random(..) method n times. With your simple approach, that is avoided and it is thus faster.
See also:
Worst case for Quicksort - when can it occur?

Scenarios for selection sort, insertion sort, and quick sort

If anyone can give some input on my logic, I would very much appreciate it.
Which method runs faster for an array with all keys identical, selection sort or insertion sort?
I think that this would be similar to when the array is already sorted, so that insertion sort will be linear, and the selection sort quadratic.
Which method runs faster for an array in reverse order, selection sort or insertion sort?
I think that they would run similarly, since the values at every position will have to be changed. The worst case scenario for insertion sort is reverse order, so that would mean it is quadratic, and then the selection sort would already be quadratic as well.
Suppose that we use insertion sort on a randomly ordered array where elements have only one of three values. Is the running time linear, quadratic, or something in between?
Since it is randomly sorted, I think that would mean that the insertion sort would have to perform many more times the number of operations that the number of values. If that's the case, then its not linear.So, it would likely be quadratic, or perhaps a little below quadratic.
What is the maximum number of times during the execution of Quick.sort() that the largest item can be exchanged, for an array of length N?
The maximum number cannot be passed over more times than there are spaces available, since it should always be approaching its right position. So, going from being the first to the last value spot, it would be exchanged N times.
About how many compares will quick.sort() make when sorting an array of N items that are all equal?
When drawing out the quick sort , a triangle can be drawn around the compared objects at every phase, that is N tall and N wide, the area of this would equal the number of compares performed, which would be (N^2)/2
Here are my comments on your comments:
Which method runs faster for an array with all keys identical, selection sort or insertion sort?
I think that this would be similar to when the array is already sorted, so that insertion sort will be linear, and the selection sort quadratic.
Yes, that's correct. Insertion sort will do O(1) work per element and visit O(n) elements for a total runtime of O(n). Selection sort always runs in time Θ(n2) regardless of the input structure, so its runtime will be quadratic.
Which method runs faster for an array in reverse order, selection sort or insertion sort?
I think that they would run similarly, since the values at every position will have to be changed. The worst case scenario for insertion sort is reverse order, so that would mean it is quadratic, and then the selection sort would already be quadratic as well.
You're right that both algorithms have quadratic runtime. The algorithms should actually have relatively comparable performance, since they'll make the same total number of comparisons.
Suppose that we use insertion sort on a randomly ordered array where elements have only one of three values. Is the running time linear, quadratic, or something in between?
Since it is randomly sorted, I think that would mean that the insertion sort would have to perform many more times the number of operations that the number of values. If that's the case, then its not linear.So, it would likely be quadratic, or perhaps a little below quadratic.
This should take quadratic time (time Θ(n2)). Take just the elements in the back third of the array. About a third of these elements will be 1's, and in order to insert them into the sorted sequence they'd need to be moved above 2/3's of the way down the array. Therefore, the work done would be at least (n / 3)(2n / 3) = 2n2 / 9, which is quadratic.
What is the maximum number of times during the execution of Quick.sort() that the largest item can be exchanged, for an array of length N?
The maximum number cannot be passed over more times than there are spaces available, since it should always be approaching its right position. So, going from being the first to the last value spot, it would be exchanged N times.
There's an off-by-one error here. When the array has size 1, the largest element can't be moved any more, so the maximum number of moves would be N - 1.
About how many compares will quick.sort() make when sorting an array of N items that are all equal?
When drawing out the quick sort , a triangle can be drawn around the compared objects at every phase, that is N tall and N wide, the area of this would equal the number of compares performed, which would be (N^2)/2
This really depends on the implementation of Quick.sort(). Quicksort with ternary partitioning would only do O(n) total work because all values equal to the pivot are excluded in the recursive calls. If this isn't done, then your analysis would be correct.
Hope this helps!

Efficiently generating all possible permutations of a linked list?

There are many algorithms for generating all possible permutations of a given set of values. Typically, those values are represented as an array, which has O(1) random access.
Suppose, however, that the elements to permute are represented as a doubly-linked list. In this case, you cannot randomly access elements in the list in O(1) time, so many permutation algorithms will experience an unnecessary slowdown.
Is there an algorithm for generating all possible permutations of a linked list with as little time and space overhead as possible?
Try to think of how you generate all permutations on a piece of paper.
You start from the rightmost number and go one position to the left until you see a number that is smaller than its neighbour. Than you place there the number that is next in value, and order all the remaining numbers in increasing order after it. Do this until there is nothing more to do. Put a little thought in it and you can order the numbers in linear time with respect to their number.
This in fact is the typical algorithm used for next permutation as far as I know. I see no reason why this would be faster on array than on list.
You might want to look into the Steinhaus–Johnson–Trotter algorithm. It generates all permutations of a sequence only by swapping adjacent elements; something which you can do in O(1) in a doubly linked list.
You should read the linked-list's data into an array, which takes O(n) and then use Heap's permutation ( http://www.geekviewpoint.com/java/numbers/permutation) to find all the permutations.
You can use the Factoradic Permutation Algorithm, and rearrange the node pointers accordingly to generate the resulting permutation in place without recursion.
A pseudo description:
element_to_permute = list
temp_list = new empty list
for i = 1 in n!
indexes[] = factoradic(i)
for j in indexes[]
rearrange pointers of node `indexes[j]` of `element_to_permute` in `temp_list`

Fast algorithm for finding next smallest and largest number in a set

I have a set of positive numbers. Given a number not in the set, I want to find the next smallest and next largest numbers that are in the set. The only way I can think to do it now is to find the next smallest by decreasing by 1 until I find a number in the set, and then do the same for finding the next largest.
Motivation: I have a bunch of data in a hashmap, keyed by dates. I don't have a datapoint for every single date. If I have data for, say, 10/01/2000 as 60 and 10/05/2000 as 68, and I ask for 10/02/2000, I want to linearly interpolate. I should get 62.
It depends on if your set is sorted.
If your set is unsorted then finding the closest (higher and lower) is an O(n) operation and a fairly simple algorithm.
If your set is sorted then you can use a modified bisection search to find the answer in O(log n), which is obviously a lot better particularly on larger sets.
If you're doing this repeatedly it might be worth sorting the set, which incurs an O(n log n) cost that might be once off or not depending on how often the set changes. Some kind of tree sort may help improve future sorts as new items are added.
All this boils down to is binary search, provided you can get your data sorted. There are two options.
Sorted Container
If you keep your numbers in a sorted container, this is pretty easy. Instead of using a HashMap, put the data in a TreeMap, then you can efficiently find the next lower or next higher element. Java even has methods to do exactly what you want:
higherKey(K)
lowerKey(K)
This is efficient because TreeMap uses a red-black tree (a kind of balanced binary search tree) internally. higherKey and lowerKey simply start at the root and traverse the tree to find where your element should go.
I'm not sure what language you're using, but in C++ you would usestd::map, and the analogous methods are:
iterator lower_bound(const key_type& k)
iterator upper_bound(const key_type& k)
Array + Sorting
If you don't want to keep your data sorted all the time, you can always dump your data into an array (or any random access container), use sort, and then use the STL's binary search routines on the array:
lower_bound
upper_bound
In Java the analog would be to dump things into an ArrayList, call Java's sort(), then use binarySearch().
All the search routines here are O(logn) time. The cost of keeping your data sorted is O(nlogn) with either a sorted container or with the array. With a sorted container, the cost is amortized over n insertions; with the array you pay it all at once when you call sort().
If you don't want to sort things at all, you can always use a linear search, but you will pay if you use this a lot, as it's an O(n) algorithm.
Put your data items into a tree, like an AVL tree, a red-black tree, or a B+/B- tree. Then you can search the ordered values.
Sort the numbers, then perform binary search on each key to bisect the set. You can then find which numbers are on either side of your missing key.
Convert the set to a list and sort it, then run a binary search for the number not in the set. The result will be the insertion point, i.e. the position at which the number would be present if it were there. If you call that n, then the element at index n of the sorted list is the next smallest number and the element at index n+1 of the sorted list is the next largest number.
You can also do this by keeping the set in sorted order as you construct it, then it becomes an easy matter to search for the insertion point. This approach is used by e.g. the floorEntry() and ceilingEntry() methods of Java's TreeMap.
Keep your set as a sorted list/array and perform bisection-search: e.g., in Python, a sorted list and the bisect module from the standard Python library match your needs to the hilt.
If you get the keys in an array, you can sort the array and find the index of the last element that is less than the desired element. Then you know the index of the key directly before your desired point, and the next element after that is the one directly after.
That should give you enough to interpolate.
(The data structure used need not be an array, anything that will sort is fine. A balanced binary tree, as suggested by others, would be ideal, especially if you plan to reuse the data later).
Finding the n'th element in an unsorted set is O(n). (Select Algorithm) Although here you can boil it down to a simpler, less general algorithm, if you always want the smallest & next smallest elements. But in general, finding the smallest, second smallest, etc. element within an unsorted list is O(n). (You should have been taught this in your algorithms class...)
Sorting a set, and then indexing the element is O(n log n)
Finding an element in a sorted set is O(log n) (binary search)
If you know that there will always be a data point for, say, each week, then keep your HashMap as it is and do what you suggest... That will be a constant time operation since you will be doing 14 hash table lookups (probing 7 days on each side of your search date), each taking O(1) primitive operations.
If you don't know how dense your data is and you can keep it in RAM, then put it into a balanced tree structure as suggested by many others. But this can be costly if you have very many dates and if you have to load the data over the network from a database.

Resources