Time complexity of sort by x1, then x2? - algorithm

The standard sort function literature will tell you that a sort can generally (Merge, Quick) be done in N log N.
So for instance if I have this list: [1,6,3,3,4,2]
It will be sorted in NlogN time to this: [1,2,3,3,4,6]
What if I have a list where I sort by the first property, then the second?
Like this: [(1,1),(6,3),(3,4),(3,1),(2,8)] to this: [(1,1),(2,8),(3,1),(3,4),(6,3)]
What's the time complexity of that?
What I've come up with is that if all the first indices are the same, you're just doing an N log N again, so the same. If there's a bunch of different first indices though, you are re-sorting a bunch of little sets.

Merge sort(or quick sort) performs O(N log N) comparisons. Its time complexity is O(N log N * time_to_compare_two_elements). The time complexity of comparing a pair of elements is constants(if the time to compare two element is constant). Thus, the time complexity of sorting an array of pairs is O(N log N), too.

Firstly you will be comparing the first elements of each pair and sort.Takes NlogN. Now will compare seconds elements if first elements are same. which takes NlogN. Total 2Nlogn which is nothing but NlogN
Hope this Helps!!

If you don't know anything more about your data, you can't guarantee better than O(N log(N)): all first elements could be the same, and then you'd be stuck sorting the second elements as normal.
That's what the O(N log(N)) bound means: if you have to generic-comparison-sort your data, you can't improve on this. Full stop.
If you do have further information, then (as you intuitively reasoned) you might be able to improve on this. As an example, say that for any x that occurs at least once as the first element in a pair, there are roughly log(N) pairs having x as their first element. In this case, your first pass can be more efficient:
d = {}
for x, y in L:
xL = d.setdefault(x, [])
xL.append(y)
xs_sorted = sorted(d.keys())
This is (roughly) O(N), since d.keys() has N / log(N) elements. Next, you can sort each of the N / log(N) sublists, which each have size log(N):
L_sorted = []
for x in xs_sorted:
ys_sorted = sorted(d[x])
for y in ys_sorted:
L_sorted.append((x, y))
This is O(N log(log(N))), which dominates the runtime - but is better than O(N log(N))!

There is a constant factor in Nlog(N) complexity which is dependent how a computer is going to make the comparison before deciding the order. For example if the sorting is on the characters the complexity of Nlog(N) will become
Nlog(N) * length of maximum string size

Related

Is this sub-quadratic time?

After analyzing an algorithm i was working on this is the running-time
N Time
1000 0.019123315811157227
10000 0.11949563026428223
100000 1.4074015617370605
1000000 16.07071304321289
The algorithm simply returns common points in 2 2D arrays ( 'a' and 'b')
This is the code that has been used
def common_points(a,b):
start=time.time()
specialSuperSorting(a) #Insertion Sort - ~1/4 N^2
specialSuperSorting(b) #Insertion Sort - ~1/4 N^2
common=[]
for i in range(len(a)):
x=a[i]
#BinarySearch coordinates(x,y)-Returns True if found, false if not
y=specialBinarySearch(b,x)
if(y):
common.append(x)
end=time.time()
print(len(a),' ',end-start)
return common
I know I could have used a faster sorting algorithm ... but I just took the easier path to save my time as this was just an exercise
So is this sub-quadratic time ? and how can i decide based on the table of N against T(N) only... without having the algorithm itself
The runtime of an algorithm will be dominated by its slowest component. Insertion sort is already O(n^2) or quadratic, so your algorithm will be at least this slow. Assuming specialBinarySearch is O(log n), and you run it n times, this part of the algorithm will be O(n log n).
In summary, your algorithm runs in O(1/4 n^2 + 1/4 n^2 + n log n) = O(n^2). It's quadratic. The 1/4 doesn't change that. You can see this trend in your data, which tends upwards much faster than linear or n log n, if you were to plot it in a graph.

Verifying complexity of sorting N integers, M at a time

I am asking complexity only of the first part(sort part) of a bigger problem called external sort.
N - number of integers (big enof to fit in memory)
M - number of integers that can be sorted in memory using merge sort.
Complexity of merge sort:
O (M log M)
But we need to sort total of N elements.
Thus, we need so sort N / M times totally.
thus
O ((N / M) * M log M)
thus finally deriving
O (N log M)
Is this correct complexity ? If not do correct my calculations.
Yes, this is the correct complexity for the first stage of sorting N integers M at a time. You could express the same number differently if you say that the number of M-sized sets is k. Then you could say it's
O(N*Log(N/k))
If you want to end up with all N elements being sorted this is insufficient. You split your total set of N numbers into N/M subsets of M numbers each. Then you sort each subset. As you correctly found out, this has a complexity of O(N log M), and if your goal is to end up with a couple of sorted subsets you are done and everything is fine. However, at this point you don't have the entire set of N sorted, you just have a couple of sorted subsets and still some work to do. What's still missing is the operation of merging those subsets together. Assuming your merge operations always merge two subsets into one, you still have Log2( N/M ) merge operations to do, each with a complexity O(N). So the final complexity is O(N log M + N Log(N/M) ) = O( N Log N ). As it's supposed to be.

How to find the first m smallest integers in an array in O(n)?

When the array has length of n and 1 <= m <= n^0.5
I think you can use a selection algorithm to find the mth smallest integer(there is a complicated one called BFPRT in http://en.wikipedia.org/wiki/Selection_algorithm that is O(n)) and then use that as a pivot to partition the array to get the first m smallest integers.
But, is there a way to do this using a data structure such as a min-heap? And how can I know if it's O(n)?
You can create a min-heap in linear time. Then you just need to remove the minimum element m times with cost log(n) for each removal. That's O(n) + m*O(log(n)) which is O(n) + O(sqrt(n)*log(n)) which is O(n).
edit I originally said O(n) + O(sqrt(n)*log(n)) is O(sqrt(n)*log(n)) which is wrong because O(n) is actually o(sqrt(n)*log(n)) which implies it's not O(sqrt(n)*log(n))
Simply use radix sort to sort the array in O(n) time.
Build_Heap(A) method can create min_heap or max_heap from random array in O(n) time , if we creates min_heap then it takes O(1) time to get smallest element
so total time to get smallest element is O(n)+)(1)
that is O(n)

What is the fastest algorithm to find an element with highest frequency in an array

I have two input arrays X and Y. I want to return that element of array X which occurs with highest frequency in array Y.
The naive way of doing this requires that for each element x of array X, I linearly search array Y for its number of occurrences and then return that element x which has highest frequency. Here is the pseudo algorithm:
max_frequency = 0
max_x = -1 // -1 indicates no element found
For each x in X
frequency = 0
For each y in Y
if y == x
frequency++
End For
If frequency > max_frequency
max_frequency = frequency
max_x = x
End If
End For
return max_x
As there are two nested loops, time complexity for this algorithm would be O(n^2). Can I do this in O(nlogn) or faster ?
Use a hash table mapping keys to counts. For each element in the array, do like counts[element] = counts[element] + 1 or your language's equivalent.
At the end, loop through the mappings in the hash table and find the max.
Alternatively, if you can have additional data structures, you walk the array Y, for each number updating its frequency in a hash table. This takes O(N(Y) time. Then walk X finding which element in X has highest frequency. This takes O(N(X)) time. Overall: linear time, and since you have to look at each element of both X and Y in any implementation at least once (EDIT: This is not strictly speaking true in all cases/all implementations, as jwpat7 points out, though it true in the worst case), you can't do it any faster than that.
The time complexity of common algorithms are listed below:
Algorithm | Best | Worst | Average
--------------+-----------+-----------+----------
MergeSort | O(n lg n) | O(n lg n) | O(n lg n)
InsertionSort | O(n) | O(n^2) | O(n^2)
QuickSort | O(n lg n) | O(n^2) | O(n lg n)
HeapSort | O(n lg n) | O(n lg n) | O(n lg n)
BinarySearch | O(1) | O(lg n) | O(lg n)
In general, when traversing through a list to fulfill a certain criteria, you really can't do any better than linear time. If you are required to sort the array, I would say stick with Mergesort (very dependable) to find the element with highest frequency in an array.
Note: This is under the assumption that you want to use a sorting algorithm. Otherwise, if you are allowed to use any data structure, I would go with a hashmap/hashtable type structure with constant lookup time. That way, you just match keys and update the frequency key-value pair. Hope this helps.
1st step: Sort both X and Y. Assuming their corresponding lengths are m and n, complexity of this step will be O(n log n) + O(m log m).
2nd step: count each Xi in Y and track maximum count so far. Search of Xi in sorted Y is O(log n). Total 2nd step complexity is: O(m log n)
Total complexity: O(n log n) + O(m log m) + O(m log n), or simpified: O(max(n,m) log n)
Merge Sorting Based on Divide and Conquer Concept gives you O(nlogn) complexity
Your suggested approach will be O(n^2) if both lists are length n. What's more likely is that the lists can be different lengths, so the time complexity could be expressed as O(mn).
You can separate your problem into two phases:
1. Order the unique elements from Y by their frequency
2. Find the first item from this list that exists in X
As this sounds like a homework question I'll let you think about how fast you can make these individual steps. The sum of these costs will give you the overall cost of the algorithm. There are many approaches that will be cheaper than the product of the two list lengths that you currently have.
Sort X and Y. Then do merge sort. Count the frequencies from Y every time it encounters with same element in X.
So complexity, O(nlogn) + O(mlogm) + O(m+n) = O(klogk) where n,m = length of X, Y; k = max(m,n)
Could do a quicksort and then traverse it with a variable that counts how many of a number are in a row + what that number is. That should give you nlogn

Prove 3-Way Quicksort Big-O Bound

For 3-way Quicksort (dual-pivot quicksort), how would I go about finding the Big-O bound? Could anyone show me how to derive it?
There's a subtle difference between finding the complexity of an algorithm and proving it.
To find the complexity of this algorithm, you can do as amit said in the other answer: you know that in average, you split your problem of size n into three smaller problems of size n/3, so you will get, in รจ log_3(n)` steps in average, to problems of size 1. With experience, you will start getting the feeling of this approach and be able to deduce the complexity of algorithms just by thinking about them in terms of subproblems involved.
To prove that this algorithm runs in O(nlogn) in the average case, you use the Master Theorem. To use it, you have to write the recursion formula giving the time spent sorting your array. As we said, sorting an array of size n can be decomposed into sorting three arrays of sizes n/3 plus the time spent building them. This can be written as follows:
T(n) = 3T(n/3) + f(n)
Where T(n) is a function giving the resolution "time" for an input of size n (actually the number of elementary operations needed), and f(n) gives the "time" needed to split the problem into subproblems.
For 3-Way quicksort, f(n) = c*n because you go through the array, check where to place each item and eventually make a swap. This places us in Case 2 of the Master Theorem, which states that if f(n) = O(n^(log_b(a)) log^k(n)) for some k >= 0 (in our case k = 0) then
T(n) = O(n^(log_b(a)) log^(k+1)(n)))
As a = 3 and b = 3 (we get these from the recurrence relation, T(n) = aT(n/b)), this simplifies to
T(n) = O(n log n)
And that's a proof.
Well, the same prove actually holds.
Each iteration splits the array into 3 sublists, on average the size of these sublists is n/3 each.
Thus - number of iterations needed is log_3(n) because you need to find number of times you do (((n/3) /3) /3) ... until you get to one. This gives you the formula:
n/(3^i) = 1
Which is satisfied for i = log_3(n).
Each iteration is still going over all the input (but in a different sublist) - same as quicksort, which gives you O(n*log_3(n)).
Since log_3(n) = log(n)/log(3) = log(n) * CONSTANT, you get that the run time is O(nlogn) on average.
Note, even if you take a more pessimistic approach to calculate the size of the sublists, by taking minimum of uniform distribution - it will still get you first sublist of size 1/4, 2nd sublist of size 1/2, and last sublist of size 1/4 (minimum and maximum of uniform distribution), which will again decay to log_k(n) iterations (with a different k>2) - which will yield O(nlogn) overall - again.
Formally, the proof will be something like:
Each iteration takes at most c_1* n ops to run, for each n>N_1, for some constants c_1,N_1. (Definition of big O notation, and the claim that each iteration is O(n) excluding recursion. Convince yourself why this is true. Note that in here - "iteration" means all iterations done by the algorithm in a certain "level", and not in a single recursive invokation).
As seen above, you have log_3(n) = log(n)/log(3) iterations on average case (taking the optimistic version here, same principles for pessimistic can be used)
Now, we get that the running time T(n) of the algorithm is:
for each n > N_1:
T(n) <= c_1 * n * log(n)/log(3)
T(n) <= c_1 * nlogn
By definition of big O notation, it means T(n) is in O(nlogn) with M = c_1 and N = N_1.
QED

Resources