Sorting [log n] different values in linear time

Sorting [log n] different values in linear time - algorithm

I have an array of n integers that can only assume log n possible values (and any value). For example, in S = [349,12,12,283,349,283,283,12], there are only 3 different numbers (log 8 = 3).
I have to sort this array in less than O(nlogn) time. Which algorithm should I use? Maybe Radix Sort with Counting Sort? What about its analysis?

Since it is a given that there will be only log(n) unique elements, you could get the sorted list in O(n) time using the algorithm below:
create a mapping of how many distinct items are there in the list, and keep the count of each (basically a dictionary/hashmap of key, count)
This is a single iteration over the input list, so O(n) steps.
Sort the above list of tuples of size log(n) based on their key.
Say we use merge sort, then the time complexity of merge sort is k*log(k), where k is size of the input.
Replacing k with log(n), we get complexity of this step as O(log(n)*log(log(n))).
Since in terms of complexity, O(log(n)*log(log(n))) < O(n), so overall complexity till this step is O(n) + O(log(n)*log(log(n))), which is equivalent to O(n) again.
Iterate over the sorted keys, and generate the sorted list using a single for loop, wherein each value is repeated by its count. There will be max O(n) iterations here.
So overall, the algorithm above will run in O(n) time.

Count the number of each element in the list (using a hashtable) (time: O(n)).
De-duplicate the list (time: O(n)).
Sort the now-de-duped items (time: O(log n * log log n)).
Build a list with the right number of copies of each item (time: O(n)).
Overall, this is O(n) and easy to implement.
Here's some python that implements this:
def duplicates_sort(xs):
keys = collections.Counter(xs)
result = []
for k in sorted(keys):
result.extend([k] * keys[k])
return result

Radix Sort complexity is O(dn) with d as the number of digits in a number.
The algorithm runs in linear time only when d is constant! In your case d = 3log(n) and your algorithm will run in O(nlog(n)).
I'm honestly not sure how to solve this problem in linear time. Is there any other piece of information regarding the nature of the numbers I'm wondering if there is any other piece of information missing about the nature of numbers...
Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.
void radixSort(string[] seqs, size_t base = 0) {
if(seqs.length == 0)
return;
size_t TPos = seqs.length, APos = 0;
size_t i = 0;
while(i < TPos) {
if(seqs[i][base] == 'A') {
swap(seqs[i], seqs[APos++]);
i++;
}
else if(seqs[i][base] == 'T') {
swap(seqs[i], seqs[--TPos]);
} else i++;
}
i = APos;
size_t CPos = APos;
while(i < TPos) {
if(seqs[i][base] == 'C') {
swap(seqs[i], seqs[CPos++]);
}
i++;
}
if(base < seqs[0].length - 1) {
radixSort(seqs[0..APos], base + 1);
radixSort(seqs[APos..CPos], base + 1);
radixSort(seqs[CPos..TPos], base + 1);
radixSort(seqs[TPos..seqs.length], base + 1);
}
}
Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.
Edit: I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.
Let's look at en example with two-digit decimal numbers:
49, 25, 19, 27, 87, 67, 22, 90, 47, 91
Sorting by the first digit yields
19, 25, 27, 22, 49, 47, 67, 87, 90, 91
Next, you sort by the second digit, yielding
90, 91, 22, 25, 27, 47, 67, 87, 19, 49
Seems wrong, doesn't it? Or isn't this what you are doing? Maybe you can show us the code if I got you wrong.
If you are doing the second bucket sort on all groups with the same first digit(s), your algorithm would be equivalent to the recursive version. It would be stable as well. The only difference is that you'd do the bucket sorts breadth-first instead of depth-first.
UPDATE
Check this answer : sort O(nlogn)

Related

Analysis of a loop that has a variable expanding exponentially not by a constant amount

I know that if a loop variable expands exponentially by a constant exponent k it has a complexity of O(log(logn)). But I couldn't wrap my head around analyzing this function.
void foo{
int a=0;
for(int i=2;i<=n;i++){
if(i%2==0){
a++;
} else{
i=(i-1)*i;
}
}
}
I found this sequence fitting to the problem in hand

Let's look at the values that i takes on over the iterations of the loop.
Before the first iteration, i = 2.
Before the second iteration, we have i = 3, since we compute skip the “else” branch and then add one.
Before the third iteration, we have i = 7, since we compute i = 3 * (3 - 1) and then add one.
Before the fourth iteration, we have i = 43, since we compute i = 7 * (7 - 1) and then add one.
Before the fifth iteration, we have i = 1807, since we compute i = 43 * (43 - 1) and the add one.
In what I can only describe as a magical coincidence, I happen to recognize the sequence of numbers 2, 3, 7, 43, 1807 as the start of Sylvester's sequence, which I just learned about a few weeks ago. Checking Wikipedia confirms that, indeed, Sylvester's sequence is given by the recurrence relation
S0 = 2
Sn+1 = Sn(Sn - 1) + 1.
It's known that this sequence grows doubly-exponentially, with Sn approximately equal to E2n for E = 1.264084735... . So this indeed means that the number of iterations of this algorithm as a function of n will be O(log log n), since i grows doubly-exponentially quickly.
But let's say that you didn't have the dumb luck of having found this exact sequence on a random Wikipedia search. Could you still show that the runtime is Θ(log log n)? The answer is yes, and here's one way to do it.
The first observation we can make is that this procedure will cause i to grow more slowly than if we instead set i = i2. You can check this by seeing the corresponding terms of what happens:
i = i(i - 1) + 1 2, 3, 7, 43, ...
i = i^2 2, 4, 16, 256, ...
If we were to grow i at this faster rate, then after k iterations of the loop we'd have i = 22k. (Notice that 2, 4, 16, 256, ... = 220, 221, 222, ...). We stop once n = 22k, which happens after k = Θ(log log n) iterations. This provides a lower bound on how many steps this algorithm will take.
To get an upper-bound, imagine that we instead initialized i = 1.2 and then updated i = i2 per iteration. Then we'd get these numbers:
i = i(i - 1) + 1 2, 3, 7, 43, ...
i = i^2 1.2, 1.4, 4.3, 18.5,
You can prove, if you'd like, that indeed the sequence you've come up with indeed continues to outpace this smaller sequence. And that smaller sequence has the nice property that the value of i on iteration k is (1.2)2k, which means that we (again) stop after Θ(log log n) iterations.
In other words, our sequence is sandwiched between two different other sequences, each of which grows doubly-exponentially quickly, which means that this sequence too grows doubly-exponentially quickly. Therefore, this loop stops after Θ(log log n) steps.
Hope this helps!

Divide and conquer algorithm

I had a job interview a few weeks ago and I was asked to design a divide and conquer algorithm. I could not solve the problem, but they just called me for a second interview! Here is the question:
we are giving as input two n-element arrays A[0..n − 1] and B[0..n − 1] (which
are not necessarily sorted) of integers, and an integer value. Give an O(nlogn) divide and conquer algorithm that determines if there exist distinct values i, j (that is, i != j) such that A[i] + B[j] = value. Your algorithm should return True if i, j exists, and return False otherwise. You may assume that the elements in A are distinct, and the elements in B are distinct.
can anybody solve the problem? Thanks

My approach is..
Sort any of the array. Here we sort array A. Sort it with the Merge Sort algorithm which is a Divide and Conquer algorithm.
Then for each element of B, Search for Required Value- Element of B in array A by Binary Search. Again this is a Divide and Conquer algorithm.
If you find the element Required Value - Element of B from an Array A then Both element makes pair such that Element of A + Element of B = Required Value.
So here for Time Complexity, A has N elements so Merge Sort will take O(N log N) and We do Binary Search for each element of B(Total N elements) Which takes O(N log N). So total time complexity would be O(N log N).
As you have mentioned you require to check for i != j if A[i] + B[j] = value then here you can take 2D array of size N * 2. Each element is paired with its original index as second element of the each row. Sorting would be done according the the data stored in the first element. Then when you find the element, You can compare both elements original indexes and return the value accordingly.

The following algorithm does not use Divide and Conquer but it is one of the solutions.
You need to sort both the arrays, maintaining the indexes of the elements maybe sorting an array of pairs (elem, index). This takes O(n log n) time.
Then you can apply the merge algorithm to check if there two elements such that A[i]+B[j] = value. This would O(n)
Overall time complexity will be O(n log n)

I suggest using hashing. Even if it's not the way you are supposed to solve the problem, it's worth mentioning since hashing has a better time complexity O(n) v. O(n*log(n)) and that's why more efficient.
Turn A into a hashset (or dictionary if we want i index) - O(n)
Scan B and check if there's value - B[j] in the hashset (dictionary) - O(n)
So you have an O(n) + O(n) = O(n) algorithm (which is better that required (O n * log(n)), however the solution is NOT Divide and Conquer):
Sample C# implementation
int[] A = new int[] { 7, 9, 5, 3, 47, 89, 1 };
int[] B = new int[] { 5, 7, 3, 4, 21, 59, 0 };
int value = 106; // 47 + 59 = A[4] + B[5]
// Turn A into a dictionary: key = item's value; value = item's key
var dict = A
.Select((val, index) => new {
v = val,
i = index, })
.ToDictionary(item => item.v, item => item.i);
int i = -1;
int j = -1;
// Scan B array
for (int k = 0; k < B.Length; ++k) {
if (dict.TryGetValue(value - B[k], out i)) {
// Solution found: {i, j}
j = k;
// if you want any solution then break
break;
// scan further (comment out "break") if you want all pairs
}
}
Console.Write(j >= 0 ? $"{i} {j}" : "No solution");

Seems hard to achieve without sorting.
If you leave the arrays unsorted, checking for existence of A[i]+B[j] = Value takes time Ω(n) for fixed i, then checking for all i takes Θ(n²), unless you find a trick to put some order in B.
Balanced Divide & Conquer on the unsorted arrays doesn't seem any better: if you divide A and B in two halves, the solution can lie in one of Al/Bl, Al/Br, Ar/Bl, Ar/Br and this yields a recurrence T(n) = 4 T(n/2), which has a quadratic solution.
If sorting is allowed, the solution by Sanket Makani is a possibility but you do better in terms of time complexity for the search phase.
Indeed, assume A and B now sorted and consider the 2D function A[i]+B[j], which is monotonic in both directions i and j. Then the domain A[i]+B[j] ≤ Value is limited by a monotonic curve j = f(i) or equivalently i = g(j). But strict equality A[i]+B[j] = Value must be checked exhaustively for all points of the curve and one cannot avoid to evaluate f everywhere in the worst case.
Starting from i = 0, you obtain f(i) by dichotomic search. Then you can follow the border curve incrementally. You will perform n step in the i direction, and at most n steps in the j direction, so that the complexity remains bounded by O(n), which is optimal.
Below, an example showing the areas with a sum below and above the target value (there are two matches).
This optimal solution has little to do with Divide & Conquer. It is maybe possible to design a variant based on the evaluation of the sum at a central point, which allows to discard a whole quadrant, but that would be pretty artificial.

Inset number from a non-sorted list of numbers into a sorted list of number

I have a list A which its elements are sorted from smallest to biggest. For example:
A = 1, 5, 9, 11, 14, 20, 46, 99
I want to insert the elements of a non-sorted list of numbers inside A while keeping the A sorted. For example:
B = 0, 77, 88, 10, 4
is going to be inserted in A as follows:
A = 0, 1, 4, 5, 9, 10, 14, 20, 46, 77, 88, 99
What is the best possible solution to this problem?

Best possible is too subjective depending on the definition of best. From big-O point of view, having an array A of length n1 and array B of length n2, you can achieve it in max(n2 * log(n2), n1 + n2).
This can be achieved by sorting the array B in O(n log n) and then merging two sorted arrays in O(n + m)

Best solution depends on how you define the best.
Even for time complexity, it still depends on your input size of A and B. Assume input size A is m and input size of B is n.
As Salvador mentioned, sort B in O(nlogn) and merge with A in O(m + n) is a good solution. Notice that you can sort B in O(n) if you adopt some non-comparison based sort algorithm like counting sort, radix sort etc.
I provide another solution here: loop every elements in B and do a binary search in A to find the insertion position and then insert. The time complexity is O(nlog(m + n)).
Edit 1: As #moreON points out, the binary search and then insert approach assume you list implementation support at least amortized O(1) for insert and random access. Also found that the time complexity should be O(nlog(m + n)) instead of O(nlogm) as the binary search took more time when more elements are added.

Given an array as input find the output array that has median of each sub array whose index starts from 0 to i(i = 1,2...array.length-1)

Given an array as input find the output array that has median of each sub array whose index starts from 0 to i(i = 1,2...array.length-1).
So basically given array A[], output array B[]. B[i] is the median of A[0] ... A[i].
I am thinking about using dynamic programming to store the two numbers before and after the median of each sub array. But it somehow gets complicated. Is there any easier solution?

Basically what we need here is a data structure that supports two operations: adding an arbitrary element and finding the median of all elements added to it so far.
The most conceptually simple solution is a balanced binary search tree that stores sizes of all subtrees(add operation just adds an element to the tree and finding median is just one traversal from the root of the tree(we can choose where to go in each node because we know the sizes of all subtrees). But this one might be a little bit tedious to implement(binary search trees from standard library usually don't support the "get k-th element" operation efficiently).
Here is another solution(it is O(N log N), too) which uses two heaps. It is easier implementation-wise because a priority queue from a standard library works fine.
Let's maintain two heaps: low(a max-heap) and high(a min-heap). The invariants are: any element of low are less than or equal to any element of high and their size differs by at most one.
Initially, they are empty.
When we add a new number, we do the following: if its less than or equal to the largest element in low, we add to low. Otherwise, we add to high. It is easy to see that the first invariant holds true. How to maintain the second invariant? If their sizes differ by 2 after insertion, we can just pop the top element from the larger heap and insert into the other one. Now their size differs by at most one. Thus, we can restore both invariants in O(log N) time when we add a new element.
These two invariants imply the following property: if size(low) != size(high), the median is the top element of the larger heap. If their sizes are equal, the median is the top of one them(which exactly? It depends on the definition of a median of an array with even number of elements).

Did I misunderstand the question or what? Why use heaps and queues?
The median of a set of numbers is the value in the middle of the sorted set.
e.g.,
{1, 2, 3} median is 2
{1, 2, 3, 4} median is (2+3) / 2 = 2
Assume the array is sorted (if not, just sort the array, which is O(n lg n))
Time: O(n)
Space: O(1)
int[] output = new int[input.length];
for(int i = 0 ; i < input.length ; i++) {
if(i % 2 == 1){
int midPoint = i / 2;
output[i] = (input[midPoint] + input[midPoint+1]) / 2;
} else {
output[i] = input[(i+1)/2];
}
}
return output;
Test
input {24, 29, 33, 40, 40, 42, 45, 47, 48, 49}
output {24, 26, 29, 31, 33, 36, 40, 40, 40, 41}
input {12, 14, 22, 30, 33, 38, 39, 41, 43, 45}
output {12, 13, 14, 18, 22, 26, 30, 31, 33, 35}

You can solve the problem in O(n log n) by using a binary search tree and augmenting it to be able to find the k-th element in O(log n) time, like described in my answer here.
For each element at index i in your array do:
B[i] = find_k_in_bst(i / 2)
insert_into_bst(A[i])
Make sure to use a balanced search tree.
If you have access to a library heap, then the heap solution described above would be the easiest. The easiest to implement yourself for this particular problem (in my opinion) is actually a segment tree: each node tells you how many elements you have inserted in its associated interval. You can use these update and query methods:
Update: when inserting a value x, go down to the leaf node associated with x. Increment all counts down to it.
Query: use a similar algorithm to the one I linked to for k-th element: when at a node, if count(left_child) == k - 1, then you have your answer: it must be the first element in the interval associated with the right node, and so on.
Note that this solution is O(n log V) where V is the max value in your array. To get O(n log n) you must scale the array to [1, n]: 1 100 1000 => 1 2 3. You can use a sort to do that.

Big Theta Notation and Selection Sort

What would be the best-case and worst-case complexity in Big-Theta (T) notation of the selection sort algorithm when the array grows by repeatedly appending a 19?
For instance:
[ 19, 13, 7, 19, 12, 16, 19 ],
[ 19, 13, 7, 19, 12, 16, 19, 19 ],
[ 19, 13, 7, 19, 12, 16, 19, 19, 19 ]
Et cetera. n is used to represent the length of the array.
So we're adding the same number to the end of the array, but this number also happens to be the largest number so it would stay at the end of the array. Does this mean that it really doesn't have any affect on the efficiency? I'm very confused.

Ascending order,
Find the minimum value in the list
Swap it with the value in the first position
Repeat the steps above for the remainder of the list (starting at
the second position and advancing each time)
So, we have to examine the rest of all elements rain or shine to obtain the minimum by checking, even if same elements are there, up to the last element.
A - an array containing the list of numbers
numItems - the number of numbers in the list
for i = 0 to numItems - 1
for j = i+1 to numItems
if A[i] > A[j]
// Swap the entries
Temp = A[i]
A[i] = A[j]
A[j] = Temp
End If
Next j
Next i
As you can see the above pseudo-code, both loops iterate up to end of their limits without any condition. So it gives θ(n²). However, O(n) average, θ(n) worst and θ(1) best cases for swap circumstance.

No, it has no effect on the efficiency.
In selection sort, the performance is N^2 because of the nested loops. Even though the elements at the end do not need to be swapped, the loops will still compare them.
So, to answer your exact question: since the best case, worst case, and average case performance are all N^2, and there is no effect on efficiency, there is no change in the performance.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio