Generating data to test sorting algorithm - algorithm

I would like to generate data to test sorting algorithms with. This accomplishes two things:
Find bugs. The output could easily be checked if it was in fact sorted correctly
Profile the code and find which situations take longer for which parts.
I asked the question How do you test speed of sorting algorithm? awhile ago, but this question focuses particularly on generating the data.
I am thinking of
sorted
reverse sorted
random
sorted but then make n inversions in randomly selected elements and see how changing n affects the run time
Any suggestions? Do any frameworks exist that would make this easier? I'm thinking JUnit could be useful.
In this question on comp sci se an answer makes it sound like adding inversions and counting them doesn't mean much:
The number of inversions might work for some cases, but is sometimes
insufficient. An example given in [3] is the sequence
$$\langle \lfloor n/2 \rfloor + 1, \lfloor n/2 \rfloor + 2, \ldots, n,
1, \ldots, \lfloor n/2 \rfloor \rangle$$
that has a quadratic number of inversions, but only consists of two
ascending runs. It is nearly sorted, but this is not captured by
inversions.
I'm not particularly strong in math and don't understand how the example illustrates what's wrong with counting the number of inversions? Is it just academic? How does it make sense to say "quadratic number of inversions"?

Using integer math, the $$...$$ sequence can represent an array:
1 2 n/2 n indices
n/2+1, n/2+2, ..., n, 1, 2, ... n/2 array values
So as stated, just two ascending sequences.
By the definition of inversion, two elements a[i] and a[j] form an inversion if a[i] > a[j] and i < j. This means that all of the first n/2 elements of a, a[1 to n/2] are greater than all of the second n/2 elements of a, a[(n/2)+1 to n]. So that's (n/2)^2 = n^2/4 inversions which is quadratic.
The relationship between inversion count and sort time complexity depends on the sort algorithm. Using bubble sort on the example array would have time complexity O(n^2). Using generic merge sort on the array would be O(n log(n)), with an near best case comparison count. Using natural merge sort would find the two sorted runs and do a single merge pass for time complexity of O(n).

Related

Difference between O(logn) and O(nlogn)

I am preparing for software development interviews, I always faced the problem in distinguishing the difference between O(logn) and O(nLogn). Can anyone explain me with some examples or share some resource with me. I don't have any code to show. I understand O(Logn) but I haven't understood O(nlogn).
Think of it as O(n*log(n)), i.e. "doing log(n) work n times". For example, searching for an element in a sorted list of length n is O(log(n)). Searching for the element in n different sorted lists, each of length n is O(n*log(n)).
Remember that O(n) is defined relative to some real quantity n. This might be the size of a list, or the number of different elements in a collection. Therefore, every variable that appears inside O(...) represents something interacting to increase the runtime. O(n*m) could be written O(n_1 + n_2 + ... + n_m) and represent the same thing: "doing n, m times".
Let's take a concrete example of this, mergesort. For n input elements: On the very last iteration of our sort, we have two halves of the input, each half size n/2, and each half is sorted. All we have to do is merge them together, which takes n operations. On the next-to-last iteration, we have twice as many pieces (4) each of size n/4. For each of our two pairs of size n/4, we merge the pair together, which takes n/2 operations for a pair (one for each element in the pair, just like before), i.e. n operations for the two pairs.
From here, we can extrapolate that every level of our mergesort takes n operations to merge. The big-O complexity is therefore n times the number of levels. On the last level, the size of the chunks we're merging is n/2. Before that, it's n/4, before that n/8, etc. all the way to size 1. How many times must you divide n by 2 to get 1? log(n). So we have log(n) levels. Therefore, our total runtime is O(n (work per level) * log(n) (number of levels)), n work log(n) times.

Merge Sort with Random Split

In Mergesort Algorithm, instead of splitting array into the equal half, try to split array from random point in each call, I want to calculate the average time of this algorithm?
Our notes calculate it as normal merge sort. any formal idea?
Here is a proof that its time complexity is O(n log n)(it's not very formal).
Let's call a split "good" if the size of the largest part is at most 3/4 of the initial subarray(it looks this way: bad bad good good good good bad bad for an array with 8 elements). The probability of split to be good is 1/2. It means that among two splits we expect one two be "good".
Let's draw a tree of recursive merge sort calls:
[a_1, a_2, a_3, ..., a_n] --- level 1
/ \
[a_1, ..., a_k] [a_k + 1, a_n] --- level 2
/ \ / \
... --- level 3
...
--- level m
It is clear that there are at most n elements at each level, so the time complexity is O(n * m).
But 1). implies that the number of levels is 2 * log(n, 4 / 3), where log(a, b) is a logarithm of a base b, which is O(log n).
Thus, the time complexity is O(n * log n).
I assume you're talking about recursive merge sort.
In standard merge sort, you split the array at the midpoint, so you end up with (mostly) same-sized subarrays at each level. But if you split somewhere else then, except in pathological cases, you still end up with nearly the same number of subarrays.
Look at it this way: the divide and conquer approach of standard merge sort results in log n "levels" of sorting, with each level containing all n items. You do n comparisons at each level to sort the subarrays. That's where the n log n comes from.
If you randomly split your array, then you're bound to have more levels, but not all items are at all levels. That is, smaller subarrays result in single-item arrays before the longer ones do. So not all items are compared at all levels of the algorithm. Meaning that some items are compared more often than others but on average, each item is compared log n times.
So what you're really asking is, given a total number of items N split into k sorted arrays, is it faster to merge if each of the k arrays is the same length, rather than the k arrays being of varying lengths.
The answer is no. Merging N items from k sorted arrays takes the same amount of time regardless of the lengths of the individual arrays. See How to sort K sorted arrays, with MERGE SORT for an example.
So the answer to your question is that the average case (and the best case) of doing a recursive merge sort with a random split will be O(n log n), with stack space usage of O(log n). The worst case, which would occur only if your random split always split the array into one subarray that contains a single item, and the other contains the remainder, would require O(n) stack space, but still only O(n log n) time.
Note that if you use an iterative merge sort, there is no asymptotic difference in time or space usage.

Finding the m Largest Numbers

This is a problem from the Cormen text, but I'd like to see if there are any other solutions.
Given an array with n distinct numbers, you need to find the m largest ones in the array, and have
them in sorted order. Assume n and m are large, but grow differently. In particular, you need
to consider below the situations where m = t*n, where t is a small number, say 0.1, and then the
possibility m = √n.
The solution given in the book offers 3 options:
Sort the array and return the top m-long segment
Convert the array to a max-heap and extract the m elements
Select the m-th largest number, partition the array about it, and sort the segment of larger entries.
These all make sense, and they all have their pros and cons, but I'm wondering, is there another way to do it? It doesn't have to be better or faster, I'm just curious to see if this is a common problem with more solutions, or if we are limited to those 3 choices.
The time complexities of the three approaches you have mentioned are as follows.
O(n log n)
O(n + m log n)
O(n + m log m)
So option (3) is definitely better than the others in terms of asymptotic complexity, since m <= n. When m is small, the difference between (2) and (3) is so small it would have little practical impact.
As for other ways to solve the problem, there are infinitely many ways you could, so the question is somewhat poor in this regard. Another approach I can think of as being practically simple and performant is the following.
Extract the first m numbers from your list of n into an array, and sort it.
Repeatedly grab the next number from your list and insert it into the correct location in the array, shifting all the lesser numbers over by one and pushing one out.
I would only do this if m was very small though. Option (2) from your original list is also extremely easy to implement if you have a max-heap implementation and will work great.
A different approach.
Take the first m numbers, and turn them into a min heap. Run through the array, if its value exceeds the min of the top m then you extract the min value and insert the new one. When you reach the end of the array you can then extract the elements into an array and reverse it.
The worst case performance of this version is O(n log(m)) placing it between the first and second methods for efficiency.
The average case is more interesting. On average only O(m log(n/m)) of the elements are going to pass the first comparison test, each time incurring O(log(m)) work so you get O(n + m log(n/m) log(m)) work, which puts it between the second and third methods. But if n is many orders of magnitude greater than m then the O(n) piece dominates, and the O(n) median select in the third approach has worse constants than the one comparison per element in this approach, so in this case this is actually the fastest!

How to generate random permutations fast

I read a question in an algorithm book:
"Given a positive integer n, choose 100 random permutations of [1,2,...,n],..."
I know how to generate random permutations with Knuth's algorithm. but does there exist any fast algorithms to generate large amount of permutations ?
Knuth shuffles require you to do n random swaps for a permutation of n elements (see http://en.wikipedia.org/wiki/Random_permutation#Knuth_shuffles) so the complexity is O(n) which is about the best you can expect if you receive a permutation on n elements.
Is this causing you a practical problem? If so, perhaps you could look at what you are doing with all those permutations in practice. Apart from simply getting by on less, you could think about deferring generating a permutation until you are sure you need it. If you need a permutation on n objects but only look at k of those n objects, perhaps you need a scheme for generating only those k elements. For small k, you could simply generate k random numbers in the range [0, n) at random, repeating generations which return numbers which have already come up. For small k, this would be unlikely.
There exist N! permutations of numbers from 1 to N. If you sort them lexicographically, like in a dictionary, it's possible to construct permutation knowing it order in a list of sorted permutations.
For example, let N=3, lexicographically sorted list of permutations is {123,132,213,231,312,321}. You generate number between 1 and 3!, for example 5. 5-th permutaion is 312. How to construct it?
Let's find the 1-st number of 5-th permutation. Let's divide permutations into blocks, criteria is 1-st number, I means such groups - {123,132},{213,231},{312,321}. Each group contains (n-1)! elements. The first number of permutation is the block number. 5-th permutation is in ceil(5/(3-1)!)=3 block. So, we've just found the first number of 5-th permutation it's 3.
Now I'm looking for not 5-th but (5-(n-1)!*(ceil(5/2)-1))=5-2*2=1-th permutation in
{3,1,2},{3,2,1}. 3 is determined and the same for all group members, so I'm actually searching for 1-th permutation in {1,2},{2,1} and N now is 2. Again, next_num = ceil(1/(new_N-1)!) = 1.
Continue it N times.
Hope you got the idea. Complexity is O(N) - because you constructing permutation elements one by one with arithmetical tricks.
UPDATE
When you got next number by arithmetical opearions you should also keep used array and instead of X take X-th unused Complexity becomes NlogN because logN needed for getting X-th unused element

Finding the k smallest odd integer

Actually, I am teaching myself algorithm and here I am trying to solve this problem which is the following:
We have an array of n positive integers in an arbitrary order and we have k which is k>=1 to n. The question is to output k smallest odd integers. If the
number of odd integers in A is less than k, we should report all odd integers. For example,
if A = [2, 17, 3, 10, 28, 5, 9, 4, 12,13, 7] and k = 3, the output should be 3, 5, 9.
I want to solve this problem in O(n) time.
My current solution is to have another array with only odd numbers and then I apply this algorithm which is by finding the median and divide the list into L, Median , Right and compare the k as the following:
If |L|<k<= (|L|+|M|) Return the median
else if K<|L|, solve the problem recursively on (L)
else work on (R, k- (|L|+|M|)
Any help is appreciated.
Assuming the output can be in any order:
Create a separate array with only odd numbers.
Use a selection algorithm to determine the k-th item. One such algorithm is quickselect (which runs in O(n) on average), which is related to quicksort - it partitions the array by some pivot, and then recursively goes to one of the partitioned sides, based on the sizes of each. See this question for more details.
Since quickselect partitions the input, you will be able to output the results directly after running this algorithm (as Karoly mentioned).
Both of the above steps take O(n), thus the overall running time is O(n).
If you need the output in ascending order:
If k = n, and all the numbers are odd, then an O(n) solution to this would be an O(n) sorting algorithm, but no-one knows of such an algorithm.
To anyone who's considering disagreeing, saying that some non-comparison-based sort is O(n) - it's not, each of these algorithms have some other factor in the complexity, such as the size of the numbers.
The best you can do here, with unbounded numbers, is to use the approach suggested in Proger's answer (O(n + k log n)), or iterate through the input, maintaining a heap of the k smallest odd numbers (O(n log k)).

Resources