Testing sorting methods - sorting

I have an assignment requiring use of various sorting methods, specifically bubble sort, insertion sort, quick sort, and heap sort. We are assigned to write the code to perform each sort on sets of data given to us. I have written the code for each sort no problem but my question pertains to the second part of our assignment. We are asked to find the possible conditions we should test to verify that our code works under all cases of input data...I am very lost as to what to do because I can only think of two cases we would deal with:
1) The data is not sorted
2) The data is already sorted
The only conditions I can think of is the size of the data being tested and how "un-sorted" it is. Am I just thinking about this the wrong way or what?
Note: I am working in C++ although I don't think it matters being that the methodology behind sorting is the same across all languages

Empty data
1 value of data
values repeated in the data
1 value repeated 20 times
various permutations of values
Data sorted in reverse order
What you are looking to find is if your implementation has a defect. Normally you would create Unit Tests using these conditions...

There is a question in programmers site.
I guess the best answer is applicable to this question.
I agree to Jongware.
And I have some scripts to generate test data.
random.awk - random data
tim.awk : n-2, n-1, n-4, n-3, ..... , 2, 3, 0, 1 - nearly reversed
middle.pl : 5, 3, 1, 2, 4 - middle pivot quicksort killer(PERL script)
median3.awk : n-1, 0, 1, 2, ... , n-2 - median-of-3 quicksort killer
valley.awk : n-1, n-3, ..., 4, 2, 0, 1, 3, ..., n-4, n-2 - bad case for most quicksort
n111.awk : n, 1, 1, 1, ..., 1, 1 - first is big
n11n.awk : n, 1, 1, 1, ..., 1, n - first and last is big
n1n1.awk : n, 1, n, 1, ..., n, 1 - zigzag
nn11.awk : n, ..., n, 1,..., 1 - few
nnnn.awk : n, n, ..., n - same
Other n*.awk - awk scripts in github.


Finding set of products formed by two lists

Given two lists, say A = [1, 3, 2, 7] and B = [2, 3, 6, 3]
Find set of all products that can be formed by multiplying a number in A with a number in B. (By set, I mean I do not want duplicates). I looking for the fastest running time possible. Hash functions are not allowed.
First approach would be brute force, where the we multiple every number from A with every number in B and if we find a product that is not already in the list, then add it to the list. Finding all possible products will cost O(n^2) and to verify if the product is already present in the list, it will cost me O(n^2). So the total comes to O(n^4).
I am looking to optimize this solution. First thing that comes to my mind is to remove duplicates in list B. In my example, I have 3 as a duplicate. I do not need to compute the product of all elements from A with the duplicate 3 again. But this doesn't still help reducing the overall runtime though.
I am guessing the fastest possible run time can be O(n^2) if all the numbers in A and B combined are unique AND prime. That way it is guaranteed that there will be no duplicates and I do not need to verify if my product is already present in the list. So I am thinking if we can pre-process our input list such that it will guarantee unique product values (One way to pre-process is to remove duplicates in list B like I mentioned above).
Is this possible in O(n^2) time and will it make a difference if I only care about the number of unique possible products instead of the actual products?
for i = 1 to A.length:
for j = 1 to B.length:
if (A[i] * B[j]) not already present in list: \\ takes O(n^2) time to verify this
Add (A[i] * B[j]) to list
end if
end for
end for
print list
Expected result for the above input: 2, 3, 6, 9, 18, 4, 12, 14, 21, 42
I can think of a O(n^2 log n) solution:
1) I generate all possible product values without worrying about duplicates \ This is O(n^2)
2) Sort these product values \ this will be O(n^2 log n) because we have n^2 numbers to sort
3) Remove the duplicates in linear time since the elements are now sorted
Use sets to eliminate duplicates.
A=[3, 6, 6, 8]
B=[7, 8, 56, 3, 2, 8]
setA = set(A)
setB = set(B)
prod=set() #empty set
[prod.add(i*j) for i in setA for j in setB]
{64, 448, 6, 168, 9, 42, 12, 16, 48, 18, 336, 21, 24, 56}
Complexity is O(n^2).
Another way is the following.
O(n^3) complexity
for i in A:
for j in B:
if prod==[]:
for k in range(len(prod)):
if i*j < prod[k]:
elif i*j == prod[k]:
if k==len(prod)-1:
Yet another way. This could be using hash functions internally.
from toolz import unique
print(list(unique([i*j for i in A for j in B])))

How can I sort a vector of boolean vectors in this way? ('ranking analysis')

We need to sort a large number of vectors (an array of arrays) containing only true and false (1's and 0's), all the same size.
We have the rules that 1 + 1 = 1 (true + true = true) and 1 + 0 = 1 and 0 + 0 = 0.
The first vector is the one with the most 1's.
The second vector is the one which brings more 1's in addition to the ones we already had in the first vector.
The third vector is the one which brings more 1's in addition to the ones we already had in the previous 2 vectors.
And so on.
For example, let's say we have these 3 vectors:
a. (0, 1, 0, 0, 1, 1, 0)
b. (1, 0, 1, 1, 0, 1, 1)
c. (0, 1, 1, 1, 0, 1, 0)
The first one in our sort is b because it has the most 1's.
The next one is a. Even though c has more 1's than a, a has more 1's in addition to the 1's we had in b.
By now, the sum of a + b is (1, 1, 1, 1, 1, 1, 1), so the last one is c because, it brings nothing new to the sorting.
If two vectors brings the same number of extra 1's, the order of them doesn't really matter. I believe there are multiple possible results for this kind of sorting and they are all as good.
We call this a 'ranking analysis' here, but we don't have a clear term for this kind of sort and google doesn't yield very useful info on it.
The easiest method is to just take them one by one with an O(n^2). However, we are working with big data and we already have a software for this which is too slow, so we need something really optimized.
How can we achieve this? Programming language doesn't matter, we can use anything. Can this be parallelized (run it on multiple CPU's to speed up the process)? Any sources or ideas are welcome.
Edit: I checked; apparently we have a case where the length of these vectors is 103, so they can be longer than 64 slots.

Get kth group of unsorted result list with arbitrary number of results per group

Okay so I have a huge array of unsorted elements of an unknown data type (all elements are of the same type, obviously, I just can't make assumptions as they could be numbers, strings, or any type of object that overloads the < and > operators. The only assumption I can make about those objects is that no two of them are the same, and comparing them (A < B) should give me which one should show up first if it was sorted. The "smallest" should be first.
I receive this unsorted array (type std::vector, but honestly it's more of an algorithm question so no language in particular is expected), a number of objects per "group" (groupSize), and the group number that the sender wants (groupNumber).
I'm supposed to return an array containing groupSize elements, or less if the group requested is the last one. (Examples: 17 results with groupSize of 5 would only return two of them if you ask for the fourth group. Also, the fourth group is group number 3 because it's a zero-indexed array)
Received Array: {1, 5, 8, 2, 19, -1, 6, 6.5, -14, 20}
Received pageSize: 3
Received pageNumber: 2
If the array was sorted, it would be: {-14, -1, 1, 2, 5, 6, 6.5, 8, 19, 20}
If it was split in groups of size 3: {{-14, -1, 1}, {2, 5, 6}, {6.5, 8, 19}, {20}}
I have to return the third group (pageNumber 2 in a 0-indexed array): {6.5, 8, 19}
The biggest problem is the fact that it needs to be lightning fast. I can't sort the array because it has to be faster than O(n log n).
I've tried several methods, but can never get under O(n log n).
I'm aware that I should be looking for a solution that doesn't fill up all the other groups, and skips a pretty big part of the steps shown in the example above, to create only the requested group before returning it, but I can't figure out a way to do that.
You can find the value of the smallest element s in the group in linear time using the standard C++ std::nth_element function (because you know it's index in the sorted array). You can find the largest element S in the group in the same way. After that, you need a linear pass to find all elements x such that s <= x <= S and return them. The total time complexity is O(n).
Note: this answer is not C++ specific. You just need an implementation of the k-th order statistics in linear time.

Importance of order of the operation in backtracking algorithms

Order of operation in each recursive step of a backtracking algorithms are how much important in terms of the efficiency of that particular algorithm?
For Ex.
In the Knight’s Tour problem.
The knight is placed on the first block of an empty board and, moving
according to the rules of chess, must visit each square exactly once.
In each step there are 8 possible (in general) ways to move.
int xMove[8] = { 2, 1, -1, -2, -2, -1, 1, 2 };
int yMove[8] = { 1, 2, 2, 1, -1, -2, -2, -1 };
If I change this order like...
int xmove[8] = { -2, -2, 2, 2, -1, -1, 1, 1};
int ymove[8] = { -1, 1,-1, 1, -2, 2, -2, 2};
for a n*n board
upto n=6
both the operation order does not affect any visible change in the execution time,
But if it is n >= 7
First operation (movement) order's execution time is much less than the later one.
In such cases, it is not feasible to generate all the O(m!) operation order and test the algorithm. So how do I determine the performance of such algorithms on a specific movement order, or rather how could it be possible to reach one (or a set) of operation orders such that the algorithm that is more efficient in terms of execution time.
This is an interesting problem from a Math/CS perspective. There definitely exists a permutation (or set of permutations) that would be most efficient for a given n . I don't know if there is a permutation that is most efficient among all n. I would guess not. There could be a permutation that is better 'on average' (however you define that) across all n.
If I was tasked to find an efficient permutation I might try doing the following: I would generate a fixed number x of randomly generated move orders. Measure their efficiency. For every one of the randomly generated movesets, randomly create a fixed number of permutations that are near the original. Compute their efficiencies. Now you have many more permutations than you started with. Take top x performing ones and repeat. This will provide some locally maxed algorithms, but I don't know if it leads up to the globally maxed algorithm(s).

Algorithm for merging sets that share at least 2 elements

Given a list of sets:
S_1 : [ 1, 2, 3, 4 ]
S_2 : [ 3, 4, 5, 6, 7 ]
S_3 : [ 8, 9, 10, 11 ]
S_4 : [ 1, 8, 12, 13 ]
S_5 : [ 6, 7, 14, 15, 16, 17 ]
What the most efficient way to merge all sets that share at least 2 elements? I suppose this is similar to a connected components problem. So the result would be:
[ 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17] (S_1 UNION S_2 UNION S_5)
[ 8, 9, 10, 11 ]
[ 1, 8, 12, 13 ] (S_4 shares 1 with S_1, and 8 with S_3, but not merged because they only share one element in each)
The naive implementation is O(N^2), where N is the number of sets, which is unworkable for us. This would need to be efficient for millions of sets.
Let there be a list of many Sets named (S)
Perform a pass through all elements of S, to determine the range (LOW .. HIGH).
Create an array of pointer to Set, of dimensions (LOW, HIGH), named (M).
Init all elements of M to NULL.
Iterate though S, processing them one Set at a time, named (Si).
Permutate all ordered pairs in Si. (P1, P2) where P1 <= P2.
For each pair examine M(P1, P2)
if M(P1, P2) is NULL
Continue with the next pair.
Merge Si, into the Set pointed to by, M(P1, P2).
Remove Si from S, as it has been merged.
Move on to processing Set S(i + 1)
If Si was not merged,
Permutate again through Si
For each pair, make M(P1, P2) point to Si.
while At least one set was merged during the pass.
My head is saying this is about Order (2N ln N).
Take that with a grain of salt.
If you can order the elements in the set, you can look into using Mergesort on the sets. The only modification needed is to check for duplicates during the merge phase. If one is found, just discard the duplicate. Since mergesort is O(n*log(n)), this will offer imrpoved speed when compared to the naive O(n^2) algorithm.
However, to really be effective, you should maintain a sorted set and keep it sorted, so that you can skip the sort phase and go straight to the merge phase.
I don't see how this can be done in less than O(n^2).
Every set needs to be compared to every other one to see if they contain 2 or more shared elements. That's n*(n-1)/2 comparisons, therefore O(n^2), even if the check for shared elements takes constant time.
In sorting, the naive implementation is O(n^2) but you can take advantage of the transitive nature of ordered comparison (so, for example, you know nothing in the lower partition of quicksort needs to be compared to anything in the upper partition, as it's already been compared to the pivot). This is what result in sorting being O(n * log n).
This doesn't apply here. So unless there's something special about the sets that allows us to skip comparisons based on the results of previous comparisons, it's going to be O(n^2) in general.
One side note: It depends on how often this occurs. If most pairs of sets do share at least two elements, it might be most efficient to build the new set at the same time as you are stepping through the comparison, and throw it away if they don't match the condition. If most pairs do not share at least two elements, then deferring the building of the new set until confirmation of the condition might be more efficient.
If your elements are numerical in nature, or can be naturally ordered (ie. you can assign a value such as 1, 2, 42 etc...), I would suggest using a radix sort on the merged sets, and make a second pass to pick up on the unique elements.
This algorithm should be of O(n), and you can optimize the radix sort quite a bit using bitwise shift operators and bit masks. I have done something similar for a project I was working on, and it works like a charm.
