algorithm - Sort an array with LogLogN distinct elements

algorithm - Sort an array with LogLogN distinct elements - algorithm

This is not my school home work. This is my own home work and I am self-learning algorithms.
In Algorithm Design Manual, there is such an excise
4-25 Assume that the array A[1..n] only has numbers from {1, . . . , n^2} but that at most log log n of these numbers ever appear. Devise an algorithm that sorts A in substantially less than O(n log n).
I have two approaches:
The first approach:
Basically I want to do counting sort for this problem. I can first scan the whole array (O(N)) and put all distinct numbers into a loglogN size array (int[] K).
Then apply counting sort. However, when setting up the counting array (int[] C), I don't need to set its size as N^2, instead, I set the size as loglogN too.
But in this way, when counting the frequencies of each distinct number, I have to scan array K to get that element's index (O(NloglogN) and then update array C.
The second approach:
Again, I have to scan the whole array to get a distinct number array K with size loglogN.
Then I just do a kind of quicksort like, but the partition is based on median of K array (i.e., each time the pivot is an element of K array), recursively.
I think this approach will be best, with O(NlogloglogN).
Am I right? or there are better solutions?
Similar excises exist in Algorithm Design Manual, such as
4-22 Show that n positive integers in the range 1 to k can be sorted in O(n log k) time. The interesting case is when k << n.
4-23 We seek to sort a sequence S of n integers with many duplications, such that the number of distinct integers in S is O(log n). Give an O(n log log n) worst-case time algorithm to sort such sequences.
But basically for all these excises, my intuitive was always thinking of counting sort as we can know the range of the elements and the range is short enough comparing to the length of the whole array. But after more deeply thinking, I guess what the excises are looking for is the 2nd approach, right?
Thanks

We can just create a hash map storing each element as key and its frequency as value.
Sort this map in log(n)*log(log(n)) time i.e (klogk) using any sorting algorithm.
Now scan the hash map and add elements to the new array frequency number of times. Like so:
total time = 2n+log(n)*log(log(n)) = O(n)

Counting sort is one of possible ways:
I will demonstrate this solution on example 2, 8, 1, 5, 7, 1, 6 and all number are <= 3^2 = 9. I use more elements to make my idea more clearer.
First for each number A[i] compute A[i] / N. Lets call this number first_part_of_number.
Sort this array using counting sort by first_part_of_number.
Results are in form (example for N = 3)
(0, 2)
(0, 1)
(0, 1)
(2, 8)
(2, 6)
(2, 7)
(2, 6)
Divide them into groups by first_part_of_number.
In this example you will have groups
(0, 2)
(0, 1)
(0, 1)
and
(2, 8)
(2, 6)
(2, 7)
(2, 6)
For each number compute X modulo N. Lets call it second_part_of_number. Add this number to each element
(0, 2, 2)
(0, 1, 1)
(0, 1, 1)
and
(2, 8, 2)
(2, 6, 0)
(2, 7, 1)
(2, 6, 0)
Sort each group using counting sort by second_part_of_number
(0, 1, 1)
(0, 1, 1)
(0, 2, 2)
and
(2, 6, 0)
(2, 6, 0)
(2, 7, 1)
(2, 8, 2)
Now combine all groups and you have result 1, 1, 2, 6, 6, 7, 8.
Complexity:
You were using only counting sort on elements <= N.
Each element took part in exactly 2 "sorts". So overall complexity is O(N).

I'm going to betray my limited knowledge of algorithmic complexity here, but:
Wouldn't it make sense to scan the array once and build something like a self-balancing tree? As we know the number of nodes in the tree will only grow to (log log n) it is relatively cheap (?) to find a number each time. If a repeat number is found (likely) a counter in that node is incremented.
Then to construct the sorted array, read the tree in order.
Maybe someone can comment on the complexity of this and any flaws.

Update: After I wrote the answer below, #Nabb showed me why it was incorrect. For more information, see Wikipedia's brief entry on Õ, and the links therefrom. At least because it is still needed to lend context to #Nabb's and #Blueshift's comments, and because the whole discussion remains interesting, my original answer is retained, as follows.
ORIGINAL ANSWER (INCORRECT)
Let me offer an unconventional answer: though there is indeed a difference between O(n*n) and O(n), there is no difference between O(n) and O(n*log(n)).
Now, of course, we all know that what I just said is wrong, don't we? After all, various authors concur that O(n) and O(n*log(n)) differ.
Except that they don't differ.
So radical-seeming a position naturally demands justification, so consider the following, then make up your own mind.
Mathematically, essentially, the order m of a function f(z) is such that f(z)/(z^(m+epsilon)) converges while f(z)/(z^(m-epsilon)) diverges for z of large magnitude and real, positive epsilon of arbitrarily small magnitude. The z can be real or complex, though as we said epsilon must be real. With this understanding, apply L'Hospital's rule to a function of O(n*log(n)) to see that it does not differ in order from a function of O(n).
I would contend that the accepted computer-science literature at the present time is slightly mistaken on this point. This literature will eventually refine its position in the matter, but it hasn't done, yet.
Now, I do not expect you to agree with me today. This, after all, is merely an answer on Stackoverflow -- and what is that compared to an edited, formally peer-reviewed, published computer-science book -- not to mention a shelffull of such books? You should not agree with me today, only take what I have written under advisement, mull it over in your mind these coming weeks, consult one or two of the aforementioned computer-science books that take the other position, and make up your own mind.
Incidentally, a counterintuitive implication of this answer's position is that one can access a balanced binary tree in O(1) time. Again, we all know that that's false, right? It's supposed to be O(log(n)). But remember: the O() notation was never meant to give a precise measure of computational demands. Unless n is very large, other factors can be more important than a function's order. But, even for n = 1 million, log(n) is only 20, compared, say, to sqrt(n), which is 1000. And I could go on in this vein.
Anyway, give it some thought. Even if, eventually, you decide that you disagree with me, you may find the position interesting nonetheless. For my part, I am not sure how useful the O() notation really is when it comes to O(log something).
#Blueshift asks some interesting questions and raises some valid points in the comments below. I recommend that you read his words. I don't really have a lot to add to what he has to say, except to observe that, because few programmers have (or need) a solid grounding in the mathematical theory of the complex variable, the O(log(n)) notation has misled probably, literally hundreds of thousands of programmers to believe that they were achieving mostly illusory gains in computational efficiency. Seldom in practice does reducing O(n*log(n)) to O(n) really buy you what you might think that it buys you, unless you have a clear mental image of how incredibly slow a function the logarithm truly is -- whereas reducing O(n) even to O(sqrt(n)) can buy you a lot. A mathematician would have told the computer scientist this decades ago, but the computer scientist wasn't listening, was in a hurry, or didn't understand the point. And that's all right. I don't mind. There are lots and lots of points on other subjects I don't understand, even when the points are carefully explained to me. But this is a point I believe that I do happen to understand. Fundamentally, it is a mathematical point not a computer point, and it is a point on which I happen to side with Lebedev and the mathematicians rather than with Knuth and the computer scientists. This is all.

Related

Difficulty in thinking a divide and conquer approach

I am self-learning algorithms. As we know Divide and Conquer is one of the algorithm design paradigms. I have studied mergeSort, QuickSort, Karatsuba Multiplication, counting inversions of an array as examples of this particular design pattern. Although it sounds very simple, divides the problems into subproblems, solves each subproblem recursively, and merges the result of each of them, I found it very difficult to develop an idea of how to apply that logic to a new problem. To my understanding, all those above-mentioned canonical examples come up with a very clever trick to solve the problem. For example, I am trying to solve the following problem:
Given a sequence of n numbers such that the difference between two consecutive numbers is constant, find the missing term in logarithmic time.
Example: [5, 7, 9, 11, 15]
Answer: 13
First, I came up with the idea that it can be solved using the divide and conquer approach as the naive approach will take O(n) time. From my understanding of divide and conquer, this is how I approached:
The original problem can be divided into two independent subproblems. I can search for the missing term in the two subproblems recursively. So, I first divide the problem.
leftArray = [5,7,9]
rightArray = [11, 15]
Now it says, I need to solve the subproblems recursively until it becomes trivial to solve. In this case, the subproblem becomes of size 1. If there is only one element, there are 0 missing elements. Now to combine the result. But I am not sure how to do it or how it will solve my original problem.
Definitely, I am missing something crucial here. My question is how to approach when solving this type of divide and conquer problem. Should I come up with a trick like a mergeSort or QuickSort? The more I see the solution to this kind of problem, it feels I am memorizing the approach to solve, not understanding and each problem solves it differently. Any help or suggestion regarding the mindset when solving divide and conquer would be greatly appreciated. I have been trying for a long time to develop my algorithmic skill but I improved very little. Thanks in advance.

You have the right approach. The only missing part is an O(1) way to decide which side you are discarding.
First, note that the numbers in your problem must be ordered, otherwise you can't do better than O(n). There also needs to be at least three numbers, otherwise you wouldn't figure out the "step".
With this understanding in place, you can determine the "step" in O(1) time by examining the initial three terms, and see what's the difference between the consecutive ones. Two outcomes are possible:
Both differences are the same, and
One difference is twice as big as the other.
Case 2 hands you a solution by luck, so we will consider only the first case from now on. With the step in hand, you can determine if the range has a gap in it by subtracting the endpoints, and comparing the result to the number of gaps times the step. If you arrive at the same result, the range does not have a missing term, and can be discarded. When both halves can be discarded, the gap is between them.

As #Sergey Kalinichenko points out, this assumes the incoming set is ordered
However, if you're certain the input is ordered (which is likely in this case) observe the nth position's value to be start + jumpsize * index; this allows you to bisect to find where it shifts
Example: [5, 7, 9, 11, 15]
Answer: 13
start = 5
jumpsize = 2
check midpoint: 5 * 2 * 2 -> 9
this is valid, so the shift must be after the midpoint
recurse
You can find the jumpsize by checking the first 3 values
a, b, c = (language-dependent retrieval)
gap1 = b - a
gap2 = c - b
if gap1 != gap2:
if (value at 4th index) - c == gap1:
missing value is b + gap1 # 2nd gap doesn't match
else:
missing value is a + gap2 # 1st gap doesn't match
bisect remaining values

Sorting algorithm based on subset inversion

I'm looking for a sorting algorithm based on subset inversion. It's like pancake sort, only instead of taking all the pancakes on top of the spatula, you can just invert any subset you want. Length of the subset doesn't matter.
Like this:
http://www.yourgenome.org/sites/default/files/illustrations/diagram/dna_mutations_inversion_yourgenome.png
So we can't simply swap numbers without inverting everything in between.
We're doing this to determine how one subspecies of fruitfly can mutate into the other. Both have the same genes but in a different order. The second subspecies' genome is 'sorted', i.e. the gene numbers are 1-25. The first subspecies genome is unsorted. Hence, we're looking for a sorting algorithm.
This is the "genome" we're looking at (though we should be able to have this work on all lists of numbers):
[23, 1, 2, 11, 24, 22, 19, 6, 10, 7, 25, 20, 5, 8, 18, 12, 13, 14, 15, 16, 17, 21, 3, 4, 9];
We're looking at two separate problems:
1) To sort a list of 25 numbers with the least amount of inversions
2) To sort a list of 25 numbers with the least amount of numbers moved
We also want to establish both upper and lower bounds for both.
We've already found a way to sort like this by just going from left to right, searching for the next lowest value and inverting everything in between, but we're absolutely certain we should be able to do this faster. However, we still haven't found any other methods so I'm asking for your help!
UPDATE: the method we currently use is based on the above method
but instead works both ways. It looks at the next elements needed
for both ends (e.g. 1 and 25 at the beginning) and then calculates
which inversion would be cheapest. All values at the ends can be
ignored for the rest of the algorithm because they get put into the
correct place immediately. Our first method took 18/19 steps and 148
genes, and this one does it in 17 steps and 101 genes. For both
optimalisation tactics (the two mentioned above), this is a better
method. It is however not cheaper in terms of code and processing.
Right now, we're working in Python because we have most experience with that, but I'd be happy with any pseudocode ideas on how we can more efficiently tackle this. If you think another language might be better suited, please let me know. Pseudocode, ideas, thoughts and actual code are all welcome!
Thanks in advance!

Regarding the first question: Do you know (and care about) which of the two strands the genes are on?
If so, you're in luck: This is called the inversion distance between signed permutations problem, and there is a linear-time algorithm for it: http://www.ncbi.nlm.nih.gov/pubmed/11694179. I haven't looked at the details.
If not, then unfortunately (as described on p. 2 of that paper) the problem is NP-hard, so it's very unlikely that any algorithm exists that is efficient (polynomial-time) in the worst case.
Regarding the second question: Assuming you mean that you want to find the minimum number of swaps needed to sort a list of numbers, you should be able to find solutions to this by searching here on SO and elsewhere. I think this is a clear and concise explanation. You can also use the optimal solution to this problem to get an upper bound for your first question: Any swap of positions i and j can be simulated using the two interval reversals (i, j) and (i+1, j-1). (This upper bound might be very bad, though, and in particular could be worse than your existing greedy algorithm.)

I think what you're looking for for the second question is the minimum number of swaps of adjacent elements to sort a sequence, which is equal to the number of inversions in the sequence (where a[i] > a[j] and i < j).
The first question seems quite a bit more complicated to me. One potential heuristic might be to think of the subset inversion as similar to the adjacent swap of more than one element. For example, if you've managed to get a sequence to this position,
5,6,1,2,3,4,7,8
we can "adjacent swap" indexes [0,1] with [2,3] (so inverting [0,1,2,3]),
2,1,6,5,3,4,7,8
and then [2,3] with [4,5] (inverting [2,3,4,5]),
2,1,4,3,5,6,7,8
and arrive at a sequence that now has significantly less element inversions, meaning less single adjacent swaps are needed to now complete the sort.
So maybe attempting to quantify inversions (in the sense of a[i] > a[j] and i < j) of sections rather than single elements could help move in the direction of estimating or building a method for the first question.

Dynamic algorithm to multiply elements in a sequence two at a time and find the total

I am trying to find a dynamic approach to multiply each element in a linear sequence to the following element, and do the same with the pair of elements, etc. and find the sum of all of the products. Note that any two elements cannot be multiplied. It must be the first with the second, the third with the fourth, and so on. All I know about the linear sequence is that there are an even amount of elements.
I assume I have to store the numbers being multiplied, and their product each time, then check some other "multipliable" pair of elements to see if the product has already been calculated (perhaps they possess opposite signs compared to the current pair).
However, by my understanding of a linear sequence, the values must be increasing or decreasing by the same amount each time. But since there are an even amount of numbers, I don't believe it is possible to have two "multipliable" pairs be the same (with potentially opposite signs), due to the issue shown in the following example:
Sequence: { -2, -1, 0, 1, 2, 3 }
Pairs: -2*-1, 0*1, 2*3
Clearly, since there are an even amount of pairs, the only case in which the same multiplication may occur more than once is if the elements are increasing/decreasing by 0 each time.
I fail to see how this is a dynamic programming question, and if anyone could clarify, it would be greatly appreciated!

A quick google for define linear sequence gave
A number pattern which increases (or decreases) by the same amount each time is called a linear sequence. The amount it increases or decreases by is known as the common difference.
In your case the common difference is 1. And you are not considering any other case.
The same multiplication may occur in the following sequence
Sequence = {-3, -1, 1, 3}
Pairs = -3 * -1 , 1 * 3
with a common difference of 2.
However this is not necessarily to be solved by dynamic programming. You can just iterate over the numbers and store the multiplication of two numbers in a set(as a set contains unique numbers) and then find the sum.

Probably not what you are looking for, but I've found a closed solution for the problem.
Suppose we observe the first two numbers. Note the first number by a, the difference between the numbers d. We then count for a total of 2n numbers in the whole sequence. Then the sum you defined is:
sum = na^2 + n(2n-1)ad + (4n^2 - 3n - 1)nd^2/3
That aside, I also failed to see how this is a dynamic problem, or at least this seems to be a problem where dynamic programming approach really doesn't do much. It is not likely that the sequence will go from negative to positive at all, and even then the chance that you will see repeated entries decreases the bigger your difference between two numbers is. Furthermore, multiplication is so fast the overhead from fetching them from a data structure might be more expensive. (mul instruction is probably faster than lw).

How do I efficiently count a list of intervals against a list of points?

So I have a list of intervals, let's say on a real line,
let intervals = [(1, 12), (2, 5), (3, 24), (7, 8)]
Note that the I use parentheses only because I store them as pairs, the intervals are actually inclusive (closed).
And I have a list of points,
let points = [13, 2, 7, 3, 14]
I am trying to count the number of points that fall into each interval, this should be an [Integer] that has length length intervals,
counts == [3, 2, 4, 1]
Now the problem is in reality both intervals and points are really long, so using a iterative algorithm that takes O(length intervals * length points) would take forever. Thus I'm consider using some kind of segment tree to make it O(log (length intervals) * length points). Currently I'm looking at the package SegmentTree. However, my limited Haskell knowledge wasn't enough for me to come up with a complete solution.
I understand that if the objective were to count the number of intervals that cover each point then then solution is straight forward:
import qualified Data.SegmentTree as S
map (S.countingQuery $ S.fromList intervals) points
But I can't think of a way to do the reverse. To me it seems that in order to do it efficiently a mutable data structure has to be used, and that is just going to open a Pandora Box.
What could be a solution?

If you can sort the points list first, you can do it pretty quickly: for each interval, find the index in the points list of its lower bound and of its upper bound, and subtract. Those lookups take log(nPoints) time, and you are doing nRanges of them, so the overall performance will be governed either by the initial sort (n log n) or by the lookups (m log n).
This comes out to O(log(nPoints) * max(nPoints, nRanges)), which is certainly better than quadratic time. It's as good as I'd expect to be able to get, too: I don't see any clever way to get down to linear time, and a log factor is pretty small.
The main drawback is that it requires having the entire points list in memory at once, for the sort, whereas you can imagine that a lazy solution could exist that would use less space.

Sorted Event form transformation puzzle

I came across an algorithmic puzzle as following:
Given an array of events in the form of (name, start time, end time)
e.g.
(a, 1, 6)
(b, 2, 4)
(c, 7, 8)
...
The events are sorted based on their start time. I was asked to transform the event into another form (name, time),
e.g.
(a, 1)
(b, 2)
(b, 4)
(a, 6)
(c, 7)
(c, 8)
Notice that each event is now broken into two events, and they are required to be sorted by time.
The most naive way is O(n log n), and I thought of several other ways, but non of them is faster than O(n log n).
Anybody know the most time and space efficient way of solving this?

Sweep time from beginning to end, maintaining a priority queue of the end times of active events, whose top element is compared repeatedly to the begin time of the next event. This is O(n log k), where k is the maximum number of simultaneous events, with extra space usage of O(k) on top of the input and output. I implemented something similar in C++ for this answer: https://stackoverflow.com/a/25694591/2144669 .

This can be proven to be just as time consuming as regular sorting.
For example, suppose I want to sort N positive numbers. I could convert that to this problem of sorting N tuples of the form (a1, 0, num1), (a2, 0, num2), ..., (aN, 0, numN). This would yield a sorted result (a1, 0), (a2, 0), ..., (aN, 0), (aSorted1, numSorted1), ..., (aSortedN, numSortedN). Hence we would get {numSorted1, ..., numSortedN}. It is proven that the last sort should take at least O(N*Log(N)), so you can't get any better than that in the general case.
However, if you say that the start times are unique, there may be some other optimizations to the problem.
EDIT: We are using additional space of O(N) here but I think there can be an argument made with that case. However it is not as rigorous an answer though.

Really efficient implementation depends on some limitation, applied to your data.
But, in general, following method can sort your list for O(N), and memory 2N:
1. Create struct like:
struct data {
int timestamp; // Event timestamp
int orig_index; // Index of original event in the input array
}
2. Copy data from the input array into array of structures [1].
Each original event is copied into two structs "data". This is O(N).
3. Sort resulting array with RadixSort, by field "timestamp".
This is again O(N): http://en.wikipedia.org/wiki/Radix_sort
4. Create output array, and restore names from original array by
index "orig_index", if needed. Again O(N).

No algorithm is more time efficient, since the sorting problem, which has a tight lower bound of omega(nlogn) is reducible to this problem (to sort an array, pick any group of start times etc.)
As for space complexity, Heapsort, which uses O(1) auxillary space and O(nlogn) time, is probably the best worst-case algorithm.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio