How do you find multiple ki smallest elements in array? - algorithm

I am struggling with my homework and need a little push- the question is to design an algorithm that will in O(nlogm) time find multiple smallest elements 1<k1<k2<...<kn and you have m *k. I know that a simple selection algorithm takes o(n) time to find the kth element, but how do you reduce the m in your recurrence? I though to do both k1 and kn in each run, but that will only take out 2, not m/2.
Would appreciate some directions.
Thanks

If I understand the question correctly, you have a vector K containing m indices, and you want to find the k'th ranked element of A for each k in K. If K contains the smallest m indices (i.e. k=1,2,...,m) then this can be done easily in linear time T=O(n) by using quickselect to find the element k_m (since all the smaller elements will be on the left at the end of quickselect). So I'm assuming that K can contain any set of m indices.
One way to accomplish this is by running quickselect on all of K at the same time. Here is the algorithm
QuickselectMulti(A,K)
If K is empty, then return an empty result set
Pick a pivot p from A at random
Partition A into sets A0<p and A1>p.
i = A0.size + 1
if K contains i, then remove i from K and add (i=>p) to the result set.
Partition K into sets K0<i and K1>i
add QuickselectMulti(A0,K0) to the result set
subtract i from each k in K1
call QuickselectMulti(A1,K1), add i to each index of the output, and add this to the result set
return the result set
If K contains just one element, this is the same as randomized quickselect. To see why the running time is O(n log m) on average, first consider what happens when each pivot exactly splits both A and K in half. In this case, you get two recursive calls, so you have
T = n + 2T(n/2,m/2)
= n + n + 4T(n/4,m/4)
= n + n + n + 8T(n/8,m/8)
Since m drops in half each time, then n will show up log m times in this summation. To actually derive the expected running time requires a little more work, because you can't assume that the pivot will split both arrays in half, but if you work through the calculations, you will see that the running time is in fact O(n log m) on average.
On edit: The analysis of this algorithm can make this simpler by choosing the pivot by running p=Quickselect(A,k_i) where k_i is the middle element of K, rather than choosing p at random. This will guarantee that K gets split in half each time, and so the number of recursive calls will be exactly log m, and since quickselect runs in linear time, the result will still be O(n log m).

Related

Greedy Attempt for covering all the numbers with the given intervals

Let S be a set of intervals (containing n number of intervals) of the natural numbers that might overlap and N be a list of numbers (containing n number of numbers).
I want to find the smallest subset (let's call P) of S such that for each number
in our list N, there exists at least one interval in P that contains it. The intervals in P are allowed to overlap.
Trivial example:
S = {[1..4], [2..7], [3..5], [8..15], [9..13]}
N = [1, 4, 5]
// so P = {[1..4], [2..7]}
I think a dynamic algorithm might not work always, so if anybody knows of a solution to this problem (or a similar one that can be converted into), that would be great. I am trying to make a O(n^2 solution)
Here is one greedy approach
P = {}
for each q in N: // O(n)
if q in P // O(n)
continue
for each i in S // O(n)
if q in I: // O(n)
P.add(i)
break
But that is O(n^4).. Any help with creating a greedy approach that is O(n^2) would be great!
Thanks!
* Update: * I've been slamming at this problem and I think I have an O(n^2) solution!!
Let me know if you think I'm right!!!
N = MergeSort (N)
upper, lower = infinity, -1
P = empty set
for each q in N do
if (q>=lower and q<=upper)=False
max_interval = [-infinity, infinity]
for each r in S do
if q in r then
if r.rightEndPoint > max_interval.rightEndPoint
max_interval = r
P.append(max_interval)
lower = max_interval.leftEndPoint
upper = max_interval.rightEndPoint
S.remove(max_interval)
I think this should work!! I'm trying to find a counter solution; but yeah!!
This problem is similar to set cover problem, which is NP-complete (i.e., arguably has no solution faster than exponential). What makes it different is that intervals always cover adjacent elements (not arbitrary subset of N), which opens ways for faster solutions.
http://en.wikipedia.org/wiki/Set_cover_problem
I think that the solution proposed by Mike is good enough. But I think I have quite straightforward O(N^2) greedy algo. It starts like the Mike's one (moreover, I believe Mike's solution can also be improved in similar way):
You sort your N numbers and place them sorted into array ELEM; COMPLEXITY O(N*lg N);
Using binary search, for each interval S[i] you identify starting and ending index of elements in ELEM that are covered by S[i]. Say, you place this pair of numbers into array COVER, the difference between the two indices tells you how many elements you cover, for simplicity, let us place it array COVER_COUNT; COMPLEXITY O(N*lg N);
You introduce index pointer p, that shows till which element in ELEM, your N is already covered. you set p = 0, meaning that all elements up to 0-th (excluded) are initially covered (i.e., no elements); Complexity O(1). Moreover you introduce boolean array IS_INCLUDED, that reflects if interval S[i] is already included in your coverage set. Complexity O(N)
Then you start from the 0-th element in ELEM and see what is the interval that contains ELEM[0] and has greater coverage COVER_COUNT[i]. Imagine that it is i-th interval. We then mark it as included by setting IS_INCLUDED[i] to true. Then you set p to end[i] + 1 where end[i] is the ending index in COVER[i] pair (indeed now all elements til end[i] are covered). Then, knowing p you update all elements in COVER_COUNT so that they reflect how many elements of not yet covered elements each interval covers (this can be easily done in O(N) time). Then you perform the same step for ELEM[p] and continues till p >= ELEM.length. It can be observed that the overall complexity is O(N^2).
You finish in O(n^2) and in IS_INCLUDED has true for intervals of S included in optimal cover set
Let me know if this solution seems reasonable to you and if I calculated everything well.
P.S. Just wanted to add that the optimality of ythe solution found by algo can be proved by induction and contradiction. By contradiction, it is easy to show that at least one optimal solution includes the longest interval of those covering element ELEM[0]. If so, by induction we can show that for each next element in algo, we can keep on following the strategy of selelcting the interval that is the longest with respect to the number of remaining elements covered and that covers the leftmost yet uncovered element.
I am not sure, but mb some think like this.
1) For each interval create a list with elements from N witch contain in interval, it will take O(n^2) lets call it Q[i] for S[i]
2) Then sort our S by length of Q[i], O(n*lg(n))
3) Go throw this array excluding Q[i] from N O(n) and from Q[i+1]...Q[n] = O(n^2)
4) Repeat 2 while N is not empty.
It's not O(n^2), it's O(n^3) but if you can use hashmap, i think you can improve this.

Find pairs with given difference

Given n, k and n number of integers. How would you find the pairs of integers for which their difference is k?
There is a n*log n solution, but I cannot figure it out.
You can do it like this:
Sort the array
For each item data[i], determine its two target pairs, i.e. data[i]+k and data[i]-k
Run a binary search on the sorted array for these two targets; if found, add both data[i] and data[targetPos] to the output.
Sorting is done in O(n*log n). Each of the n search steps take 2 * log n time to look for the targets, for the overall time of O(n*log n)
For this problem exists the linear solution! Just ask yourself one question. If you have a what number should be in the array? Of course a+k or a-k (A special case: k = 0, required an alternative solution). So, what now?
You are creating a hash-set (for example unordered_set in C++11) with all values from the array. O(1) - Average complexity for each element, so it's O(n).
You are iterating through the array, and check for each element Is present in the array (x+k) or (x-k)?. You check it for each element, in set in O(1), You check each element once, so it's linear (O(n)).
If you found x with pair (x+k / x-k), it is what you are looking for.
So it's linear (O(n)). If you really want O(n lg n) you should use a set on tree, with checking is_exist in (lg n), then you have O(n lg n) algorithm.
Apposition: No need to check x+k and x-k, just x+k is sufficient. Cause if a and b are good pair then:
if a < b then
a + k == b
else
b + k == a
Improvement: If you know a range, you can guarantee linear complexity, by using bool table (set_tab[i] == true, when i is in table.).
Solution similar to one above:
Sort the array
set variables i = 0; j = 1;
check the difference between array[i] and array[j]
if the difference is too small, increase j
if the difference is too big, increase i
if the difference is the one you're looking for, add it to results and increase j
repeat 3 and 4 until the end of array
Sorting is O(n*lg n), the next step is, if I'm correct, O(n) (at most 2*n comparisons), so the whole algorithm is O(n*lg n)

Why is merge sort worst case run time O (n log n)?

Can someone explain to me in simple English or an easy way to explain it?
The Merge Sort use the Divide-and-Conquer approach to solve the sorting problem. First, it divides the input in half using recursion. After dividing, it sort the halfs and merge them into one sorted output. See the figure
It means that is better to sort half of your problem first and do a simple merge subroutine. So it is important to know the complexity of the merge subroutine and how many times it will be called in the recursion.
The pseudo-code for the merge sort is really simple.
# C = output [length = N]
# A 1st sorted half [N/2]
# B 2nd sorted half [N/2]
i = j = 1
for k = 1 to n
if A[i] < B[j]
C[k] = A[i]
i++
else
C[k] = B[j]
j++
It is easy to see that in every loop you will have 4 operations: k++, i++ or j++, the if statement and the attribution C = A|B. So you will have less or equal to 4N + 2 operations giving a O(N) complexity. For the sake of the proof 4N + 2 will be treated as 6N, since is true for N = 1 (4N +2 <= 6N).
So assume you have an input with N elements and assume N is a power of 2. At every level you have two times more subproblems with an input with half elements from the previous input. This means that at the the level j = 0, 1, 2, ..., lgN there will be 2^j subproblems with an input of length N / 2^j. The number of operations at each level j will be less or equal to
2^j * 6(N / 2^j) = 6N
Observe that it doens't matter the level you will always have less or equal 6N operations.
Since there are lgN + 1 levels, the complexity will be
O(6N * (lgN + 1)) = O(6N*lgN + 6N) = O(n lgN)
References:
Coursera course Algorithms: Design and Analysis, Part 1
On a "traditional" merge sort, each pass through the data doubles the size of the sorted subsections. After the first pass, the file will be sorted into sections of length two. After the second pass, length four. Then eight, sixteen, etc. up to the size of the file.
It's necessary to keep doubling the size of the sorted sections until there's one section comprising the whole file. It will take lg(N) doublings of the section size to reach the file size, and each pass of the data will take time proportional to the number of records.
After splitting the array to the stage where you have single elements i.e. call them sublists,
at each stage we compare elements of each sublist with its adjacent sublist. For example, [Reusing #Davi's image
]
At Stage-1 each element is compared with its adjacent one, so n/2 comparisons.
At Stage-2, each element of sublist is compared with its adjacent sublist, since each sublist is sorted, this means that the max number of comparisons made between two sublists is <= length of the sublist i.e. 2 (at Stage-2) and 4 comparisons at Stage-3 and 8 at Stage-4 since the sublists keep doubling in length. Which means the max number of comparisons at each stage = (length of sublist * (number of sublists/2)) ==> n/2
As you've observed the total number of stages would be log(n) base 2
So the total complexity would be == (max number of comparisons at each stage * number of stages) == O((n/2)*log(n)) ==> O(nlog(n))
Algorithm merge-sort sorts a sequence S of size n in O(n log n)
time, assuming two elements of S can be compared in O(1) time.
This is because whether it be worst case or average case the merge sort just divide the array in two halves at each stage which gives it lg(n) component and the other N component comes from its comparisons that are made at each stage. So combining it becomes nearly O(nlg n). No matter if is average case or the worst case, lg(n) factor is always present. Rest N factor depends on comparisons made which comes from the comparisons done in both cases. Now the worst case is one in which N comparisons happens for an N input at each stage. So it becomes an O(nlg n).
Many of the other answers are great, but I didn't see any mention of height and depth related to the "merge-sort tree" examples. Here is another way of approaching the question with a lot of focus on the tree. Here's another image to help explain:
Just a recap: as other answers have pointed out we know that the work of merging two sorted slices of the sequence runs in linear time (the merge helper function that we call from the main sorting function).
Now looking at this tree, where we can think of each descendant of the root (other than the root) as a recursive call to the sorting function, let's try to assess how much time we spend on each node... Since the slicing of the sequence and merging (both together) take linear time, the running time of any node is linear with respect to the length of the sequence at that node.
Here's where tree depth comes in. If n is the total size of the original sequence, the size of the sequence at any node is n/2i, where i is the depth. This is shown in the image above. Putting this together with the linear amount of work for each slice, we have a running time of O(n/2i) for every node in the tree. Now we just have to sum that up for the n nodes. One way to do this is to recognize that there are 2i nodes at each level of depth in the tree. So for any level, we have O(2i * n/2i), which is O(n) because we can cancel out the 2is! If each depth is O(n), we just have to multiply that by the height of this binary tree, which is logn. Answer: O(nlogn)
reference: Data Structures and Algorithms in Python
The recursive tree will have depth log(N), and at each level in that tree you will do a combined N work to merge two sorted arrays.
Merging sorted arrays
To merge two sorted arrays A[1,5] and B[3,4] you simply iterate both starting at the beginning, picking the lowest element between the two arrays and incrementing the pointer for that array. You're done when both pointers reach the end of their respective arrays.
[1,5] [3,4] --> []
^ ^
[1,5] [3,4] --> [1]
^ ^
[1,5] [3,4] --> [1,3]
^ ^
[1,5] [3,4] --> [1,3,4]
^ x
[1,5] [3,4] --> [1,3,4,5]
x x
Runtime = O(A + B)
Merge sort illustration
Your recursive call stack will look like this. The work starts at the bottom leaf nodes and bubbles up.
beginning with [1,5,3,4], N = 4, depth k = log(4) = 2
[1,5] [3,4] depth = k-1 (2^1 nodes) * (N/2^1 values to merge per node) == N
[1] [5] [3] [4] depth = k (2^2 nodes) * (N/2^2 values to merge per node) == N
Thus you do N work at each of k levels in the tree, where k = log(N)
N * k = N * log(N)
MergeSort algorithm takes three steps:
Divide step computes mid position of sub-array and it takes constant time O(1).
Conquer step recursively sort two sub arrays of approx n/2 elements each.
Combine step merges a total of n elements at each pass requiring at most n comparisons so it take O(n).
The algorithm requires approx logn passes to sort an array of n elements and so total time complexity is nlogn.
lets take an example of 8 element{1,2,3,4,5,6,7,8} you have to first divide it in half means n/2=4({1,2,3,4} {5,6,7,8}) this two divides section take 0(n/2) and 0(n/2) times so in first step it take 0(n/2+n/2)=0(n)time.
2. Next step is divide n/22 which means (({1,2} {3,4} )({5,6}{7,8})) which would take
(0(n/4),0(n/4),0(n/4),0(n/4)) respectively which means this step take total 0(n/4+n/4+n/4+n/4)=0(n) time.
3. next similar as previous step we have to divide further second step by 2 means n/222 ((({1},{2},{3},{4})({5},{6},{7},{8})) whose time is 0(n/8+n/8+n/8+n/8+n/8+n/8+n/8+n/8)=0(n)
which means every step takes 0(n) times .lets steps would be a so time taken by merge sort is 0(an) which mean a must be log (n) because step will always divide by 2 .so finally TC of merge sort is 0(nlog(n))

What is the order of the run time for an algorithm with this desired output?

There are N sets Ai to An each with string entries. The average size of a set is K.
For each Ai we wish to return a list (or a better data structure?) of N-1 sets excluding Ai ordered by how many elements the sets have in common with Ai?
Please don't be shy to give a detailed response with nice mathematical arguments...:)
Also is this a standard problem and can I read more about it somewhere?
Basicly you generate each result list element by performing an intersections of 2 sets. You have N-1 intersections in your result list element, that boils down to N-1 * IntersectTime. For N list elements in the result this sums up to N(N-1) * IntersectTime. Afterwards you have to order N times N-1 sets, so just for ordering them you have O(N² log N).
IntersectTime depends on the implementation of the set, for a typical hashset this is for you O(k).
So finally we have O(N²k) + O(N² log N) = O(N² (k+log N)) = (if we assume k > log N) O(N²k).
EDIT: when you would really implemnt it, it is good to know that when you intersect two sets, you can use the result for 2 of the result list elements, that means, that for the first you have to intersect A_1 with N-1, for A_2 with N-2 (intersection with A_1 was already done at for first element), for A_3 with N-3 other sets and finally for A_N with none. BUT this does not modify the big-O time, it just halfs the runtime.
Here's my attempt -
I believe you can boil the process down into:
O(N * (C + S))
Where N is the number of sets, C is the amount of time it takes to compare N-1 sets to set Ai, and S is the amount of time it takes to sort the N-1 sets.
The comparison is K items to K items N-1 times, so (N-1)K^2 time to compare
Sorting should take log(n - 1) time with an efficient algorithm
For simplicity, we can shorten N-1 into just N
So, the whole thing should run in O(N(NK^2 + log(N)))
You should take this with a grain of salt, I haven't done anything with algorithms for quite a while. There may also be a more efficient way to compare the sets.

Problem k-subvector using dynamic programming

Given a vector V of n integers and an integer k, k <= n, you want a subvector (a sequence of consecutive elements of the vector ) of maximum length containing at most k distinct elements.
The technique that I use for the resolution of the problem is dynamic programming.
The complexity of this algorithm must be O(n*k).
The main problem is how to count distinct elements of the vector. as you would resolve it ?
How to write the EQUATION OF RECURRENCE ?
Thanks you!!!.
I don't know why you would insist on O(n*k), this can be solved in O(n) with 'sliding window' approach.
Maintain current 'window' [left..right]
At each step, if we can increase right by 1 (without violating 'at most k disctint elements' requirement), do it
Otherwise, increase left by 1
Check whether current window is the longest and go back to #2
Checking whether we can increase right in #2 is a little tricky. We can use hashtable storing for each element inside window how many times it occurred there.
So, the condition to allow right increase would look like
hash.size < k || hash.contains(V[right + 1])
And each time left or right is increased, we'll need to update hash (decrease or increase number of occurrences of the given element).
I'm pretty sure, any DP solution here would be longer and more complicated.
the main problem is how to count distinct elements of the vector. as you would resolve it?
If you allowed to use hashing, you could do the following
init Hashtable h
distinct_count := 0
for each element v of the vector V
if h does not contain v (O(1) time in average)
insert v into h (O(1) time in average)
distinct_count := distinct_count + 1
return distinct_count
This is in average O(n) time.
If not here is an O(n log n) solution - this time worst case
sort V (O(n log n) comparisons)
Then it should be easy to determine the number of different elements in O(n) time ;-)
I could also tell you an algorithm to sort V in O(n*b) where b is the bit count of the integers - if this helps you.
Here is the algorithm:
sort(vector, begin_index, end_index, currentBit)
reorder the vector[begin_index to end_index] so that the elements that have a 1 at bit currentBit are after those that have a 0 there (O(end_index-begin_index) time)
Let c be the count of elements that have a 0 at bit currentBit (O(end_index-begin_index) time; can be got from the step before)
if (currentBit is not 0)
call sort(vector, begin_index, begin_index+c)
call sort(vector, begin_index+c+1, end_index)
Call it with
vector = V
begin_index = 0
end_index = n-1
currentBit = bit count of the integers (=: b)-1.
This even uses dynamic programming as requested.
As you can determine very easily this is O(n*b) time with a recursion depth of b.

Resources