Big-O complexity of calculation: Drawing a non-colliding subset of k elements from n total with dumb algorithm - complexity-theory

I'm trying to understand the computational complexity of this pseudocode:
values is a set of n unique elements
subset is an empty set
for 0 ... k
X: randomly select a value from values
if value is in subset
goto X
else
insert value into subset
This is of course a (poor) algorithm for selecting a unique random subset of k elements from n, and I'm aware of the better choices, but I wanted to understand the computational complexity of this.
I can see easily that this is O(n) when duplicates are allowed because the conditional test is eliminated from the pseudocode and k choices are made each time.
When you have to account for duplicates, there is a probability that a re-test will be required which increases with each iteration. Depending on the values of n and k, this is a non-negligible fact, but I'm not certain how it affects the big-O complexity in a generalized way. Could someone explain this to me?

The probability of inserting value into subset for each value of k is (n-k)/n
The number of iterations of each k loop would be inversely proportional to that probability
Therefore Big O notation for each value of k would be O((n/(n-K)) + 1) where 1 would be 'insert value into subset'.
You have to calculate the summation of ((n/(n-K)) + 1) for each value of k ----final answer would be
O(((n/(n-K)) + 1) for k from 1 through k)
Disclaimer - this is assuming if Big(o) is applicable for functions that use random number generating algorithms (since X is random)

Related

How many times variable m is updated

Given the following pseudo-code, the question is how many times on average is the variable m being updated.
A[1...n]: array with n random elements
m = a[1]
for I = 2 to n do
if a[I] < m then m = a[I]
end for
One might answer that since all elements are random, then the variable will be updated on average on half the number of iterations of the for loop plus one for the initialization.
However, I suspect that there must be a better (and possibly the only correct) way to prove it using binomial distribution with p = 1/2. This way, the average number of updates on m would be
M = 1 + Σi=1 to n-1[k.Cn,k.pk.(1-p)(n-k)]
where Cn,k is the binomial coefficient. I have tried to solve this but I have stuck some steps after since I do not know how to continue.
Could someone explain me which of the two answers is correct and if it is the second one, show me how to calculate M?
Thank you for your time
Assuming the elements of the array are distinct, the expected number of updates of m is the nth harmonic number, Hn, which is the sum of 1/k for k ranging from 1 to n.
The summation formula can also be represented by the recursion:
H1 &equals; 1
Hn &equals; Hn−1&plus;1/n (n > 1)
It's easy to see that the recursion corresponds to the problem.
Consider all permutations of n−1 numbers, and assume that the expected number of assignments is Hn−1. Now, every permutation of n numbers consists of a permutation of n−1 numbers, with a new smallest number inserted in one of n possible insertion points: either at the beginning, or after one of the n−1 existing values. Since it is smaller than every number in the existing series, it will only be assigned to m in the case that it was inserted at the beginning. That has a probability of 1/n, and so the expected number of assignments of a permutation of n numbers is Hn−1 + 1/n.
Since the expected number of assignments for a vector of length one is obviously 1, which is H1, we have an inductive proof of the recursion.
Hn is asymptotically equal to ln n &plus; γ where γ is the Euler-Mascheroni constant, approximately 0.577. So it increases without limit, but quite slowly.
The values for which m is updated are called left-to-right maxima, and you'll probably find more information about them by searching for that term.
I liked #rici answer so I decided to elaborate its central argument a little bit more so to make it clearer to me.
Let H[k] be the expected number of assignments needed to compute the min m of an array of length k, as indicated in the algorithm under consideration. We know that
H[1] = 1.
Now assume we have an array of length n > 1. The min can be in the last position of the array or not. It is in the last position with probability 1/n. It is not with probability 1 - 1/n. In the first case the expected number of assignments is H[n-1] + 1. In the second, H[n-1].
If we multiply the expected number of assignments of each case by their probabilities and sum, we get
H[n] = (H[n-1] + 1)*1/n + H[n-1]*(1 - 1/n)
= H[n-1]*1/n + 1/n + H[n-1] - H[n-1]*1/n
= 1/n + H[n-1]
which shows the recursion.
Note that the argument is valid if the min is either in the last position or in any the first n-1, not in both places. Thus we are using that all the elements of the array are different.

How to calculate runtime complexity of the nested loops with variable length

Suppose I have a task to write an algorithm that runs through an array of strings and checks if each value in the array contains s character. The algorithm will have two nested loops, here is the pseudo code:
for (let i=0; i < a.length; i++)
for (let j=0; j < a[i].length; j++)
if (a[i][j] === 'c')
do something
Now, the task is to identify the runtime complexity of the algorithm. Here is my reasoning:
let the number of elements in the array be n, while the maximum length of string values m. So the general formula for complexity is
n x m
Now the possible cases.
If the maximum length of string values is equal to the number of elements, I get the complexity:
n^2
If the maximum length of elements is less than the number of elements by some number a, the complexity is
n x (n - a) = n^2 - na
If the maximum length of elements is more than the number of elements by some number a, the complexity is
n x (n - a) = n^2 + na
Since we discard lower growth functions, it seems that the complexity of the algorithm is n^2. Is my reasoning correct?
Your time complexity is just the total number of characters. Which of the analyses is applicable, depends entirely on which of your assumptions about the relationship between the length of words, and the number of words, holds true. Note in particular, your statement that the time complexity is N x M where M is the largest name in the array, is not correct (it's correct in the sense that it places an upper bound, but that upper bound is not tight, so it's not very interesting; it's correct in the same sense that N^2 x M^2 is correct).
I think certainly in many real cases of interest, your analysis is incorrect. The total number of characters is equal to the number of strings, times the average number of characters per string, i.e. word length (note: average, not maximum!). As the number of strings becomes large, the average sample word length will approach the mean of whatever distribution you are sampling from. So at least for any well behaved distribution where the sampling is iid, the time complexity is simply N.
A good practical example is a database that stores names. It depends of course which people happen to be in the database, but if you are storing names of say American citizens, then as N becomes large, the number of inner operations will approach N times the average number of characters in a name, in the US. The latter quantity just doesn't depend on N at all, so it's linear in N.

How do you find multiple ki smallest elements in array?

I am struggling with my homework and need a little push- the question is to design an algorithm that will in O(nlogm) time find multiple smallest elements 1<k1<k2<...<kn and you have m *k. I know that a simple selection algorithm takes o(n) time to find the kth element, but how do you reduce the m in your recurrence? I though to do both k1 and kn in each run, but that will only take out 2, not m/2.
Would appreciate some directions.
Thanks
If I understand the question correctly, you have a vector K containing m indices, and you want to find the k'th ranked element of A for each k in K. If K contains the smallest m indices (i.e. k=1,2,...,m) then this can be done easily in linear time T=O(n) by using quickselect to find the element k_m (since all the smaller elements will be on the left at the end of quickselect). So I'm assuming that K can contain any set of m indices.
One way to accomplish this is by running quickselect on all of K at the same time. Here is the algorithm
QuickselectMulti(A,K)
If K is empty, then return an empty result set
Pick a pivot p from A at random
Partition A into sets A0<p and A1>p.
i = A0.size + 1
if K contains i, then remove i from K and add (i=>p) to the result set.
Partition K into sets K0<i and K1>i
add QuickselectMulti(A0,K0) to the result set
subtract i from each k in K1
call QuickselectMulti(A1,K1), add i to each index of the output, and add this to the result set
return the result set
If K contains just one element, this is the same as randomized quickselect. To see why the running time is O(n log m) on average, first consider what happens when each pivot exactly splits both A and K in half. In this case, you get two recursive calls, so you have
T = n + 2T(n/2,m/2)
= n + n + 4T(n/4,m/4)
= n + n + n + 8T(n/8,m/8)
Since m drops in half each time, then n will show up log m times in this summation. To actually derive the expected running time requires a little more work, because you can't assume that the pivot will split both arrays in half, but if you work through the calculations, you will see that the running time is in fact O(n log m) on average.
On edit: The analysis of this algorithm can make this simpler by choosing the pivot by running p=Quickselect(A,k_i) where k_i is the middle element of K, rather than choosing p at random. This will guarantee that K gets split in half each time, and so the number of recursive calls will be exactly log m, and since quickselect runs in linear time, the result will still be O(n log m).

Choosing minimum length k of array for merge sort where use of insertion sort to sort the subarrays is more optimal than standard merge sort

This is a question from Introduction to Algorithms By Cormen. But this isn't a homework problem instead self-study.
There is an array of length n. Consider a modification to merge sort in which n/k sublists each of length k are sorted using insertion sort and then merged using merging mechanism, where k is a value to be determined.
The relationship between n and k isn't known. The length of array is n. k sublists of n/k means n * (n/k) equals n elements of the array. Hence k is simply a limit at which the splitting of array for use with merge-sort is stopped and instead insertion-sort is used because of its smaller constant factor.
I was able to do the mathematical proof that the modified algorithm works in Θ(n*k + n*lg(n/k)) worst-case time. Now the book went on to say to
find the largest value of k as a function of n for which this modified algorithm has the same running time as standard merge sort, in terms of Θ notation. How should we choose k in practice?
Now this got me thinking for a lot of time but I couldn't come up with anything. I tried to solve
n*k + n*lg(n/k) = n*lg(n) for a relationship. I thought that finding an equality for the 2 running times would give me the limit and greater can be checked using simple hit-and-trial.
I solved it like this
n k + n lg(n/k) = n lg(n)
k + lg(n/k) = lg(n)
lg(2^k) + lg(n/k) = lg(n)
(2^k * n)/k = n
2^k = k
But it gave me 2 ^ k = k which doesn't show any relationship. What is the relationship? I think I might have taken the wrong equation for finding the relationship.
I can implement the algorithm and I suppose adding an if (length_Array < k) statement in the merge_sort function here(Github link of merge sort implementation) for calling insertion sort would be good enough. But how do I choose k in real life?
Well, this is a mathematical minimization problem, and to solve it, we need some basic calculus.
We need to find the value of k for which d[n*k + n*lg(n/k)] / dk == 0.
We should also check for the edge cases, which are k == n, and k == 1.
The candidate for the value of k that will give the minimal result for n*k + n*lg(n/k) is the minimum in the required range, and is thus the optimal value of k.
Attachment, solving the derivitives equation:
d[n*k + n*lg(n/k)] / dk = d[n*k + nlg(n) - nlg(k)] / dk
= n + 0 - n*1/k = n - n/k
=>
n - n/k = 0 => n = n/k => 1/k = 1 => k = 1
Now, we have the candidates: k=n, k=1. For k=n we get O(n^2), thus we conclude optimal k is k == 1.
Note that we found the derivitives on the function from the big Theta, and not on the exact complexity function that uses the needed constants.
Doing this on the exact complexity function, with all the constants might yield a bit different end result - but the way to solve it is pretty much the same, only take derivitives from a different function.
maybe k should be lg(n)
theta(nk + nlog(n/k)) have two terms, we have the assumption that k>=1, so the second term is less than nlog(n).
only when k=lg(n), the whole result is theta(nlog(n))

Selection i'th smallest number algorithm

I'm reading Introduction to Algorithms book, second edition, the chapter about Medians and Order statistics. And I have a few questions about randomized and non-randomized selection algorithms.
The problem:
Given an unordered array of integers, find i'th smallest element in the array
a. The Randomized_Select algorithm is simple. But I cannot understand the math that explains it's work time. Is it possible to explain that without doing deep math, in more intuitive way? As for me, I'd think that it should work for O(nlog n), and in worst case it should be O(n^2), just like quick sort. In avg randomizedPartition returns near middle of the array, and array is divided into two each call, and the next recursion call process only half of the array. The RandomizedPartition costs (p-r+1)<=n, so we have O(n*log n). In the worst case it would choose every time the max element in the array, and divide the array into two parts - (n-1) and (0) each step. That's O(n^2)
The next one (Select algorithm) is more incomprehensible then previous:
b. What it's difference comparing to previous. Is it faster in avg?
c. The algorithm consists of five steps. In first one we divide the array into n/5 parts each one with 5 elements (beside the last one). Then each part is sorted using insertion sort, and we select 3rd element (median) of each. Because we have sorted these elements, we can be sure that previous two <= this pivot element, and the last two are >= then it. Then we need to select avg element among medians. In the book stated that we recursively call Select algorithm for these medians. How we can do that? In select algorithm we are using insertion sort, and if we are swapping two medians, we need to swap all four (or even more if it is more deeper step) elements that are "children" for each median. Or do we create new array that contain only previously selected medians, and are searching medians among them? If yes, how can we fill them in original array, as we changed their order previously.
The other steps are pretty simple and look like in the randomized_partition algorithm.
The randomized select run in O(n). look at this analysis.
Algorithm :
Randomly choose an element
split the set in "lower than" set L and "bigger than" set B
if the size of "lower than" is j-1 we found it
if the size is bigger, then Lookup in L
or lookup in B
The total cost is the sum of :
The cost of splitting the array of size n
The cost of lookup in L or the cost of looking up in B
Edited: I Tried to restructure my post
You can notice that :
We always go next in the set with greater amount of elements
The amount of elements in this set is n - rank(xj)
1 <= rank(xi) <= n So 1 <= n - rank(xj) <= n
The randomness of the element xj directly affect the randomness of the number of element which
are greater xj(and which are smaller than xj)
if xj is the element chosen , then you know that the cost is O(n) + cost(n - rank(xj)). Let's call rank(xj) = rj.
To give a good estimate we need to take the expected value of the total cost, which is
T(n) = E(cost) = sum {each possible xj}p(xj)(O(n) + T(n - rank(xj)))
xj is random. After this it is pure math.
We obtain :
T(n) = 1/n *( O(n) + sum {all possible values of rj when we continue}(O(n) + T(n - rj))) )
T(n) = 1/n *( O(n) + sum {1 < rj < n, rj != i}(O(n) + T(n - rj))) )
Here you can change variable, vj = n - rj
T(n) = 1/n *( O(n) + sum { 0 <= vj <= n - 1, vj!= n-i}(O(n) + T(vj) ))
We put O(n) outside the sum , gain a factor
T(n) = 1/n *( O(n) + O(n^2) + sum {1 <= vj <= n -1, vj!= n-i}( T(vj) ))
We put O(n) and O(n^2) outside, loose a factor
T(n) = O(1) + O(n) + 1/n *( sum { 0 <= vj <= n -1, vj!= n-i} T(vj) )
Check the link on how this is computed.
For the non-randomized version :
You say yourself:
In avg randomizedPartition returns near middle of the array.
That is exactly why the randomized algorithm works and that is exactly what it is used to construct the deterministic algorithm. Ideally you want to pick the pivot deterministically such that it produces a good split, but the best value for a good split is already the solution! So at each step they want a value which is good enough, "at least 3/10 of the array below the pivot and at least 3/10 of the array above". To achieve this they split the original array in 5 at each step, and again it is a mathematical choice.
I once created an explanation for this (with diagram) on the Wikipedia page for it... http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

Resources