How does my randomly partitioned array look in the general case? - algorithm

I have an array of n random integers
I choose a random integer and partition by the chosen random integer (all integers smaller than the chosen integer will be on the left side, all bigger integers will be on the right side)
What will be the size of my left and right side in the average case, if we assume no duplicates in the array?
I can easily see, that there is 1/n chance that the array is split in half, if we are lucky. Additionally, there is 1/n chance, that the array is split so that the left side is of length 1/2-1 and the right side is of length 1/2+1 and so on.
Could we derive from this observation the "average" case?

You can probably find a better explanation (and certainly the proper citations) in a textbook on randomized algorithms, but here's the gist of average-case QuickSort, in two different ways.
First way
Let C(n) be the expected number of comparisons required on average for a random permutation of 1...n. Since the expectation of the sum of the number of comparisons required for the two recursive calls equals the sum of the expectations, we can write a recurrence that averages over the n possible divisions:
C(0) = 0
1 n−1
C(n) = n−1 + ― sum (C(i) + C(n−1−i))
n i=0
Rather than pull the exact solution out of a hat (or peek at the second way), I'll show you how I'd get an asymptotic bound.
First, I'd guess the asymptotic bound. Obviously I'm familiar with QuickSort and my reasoning here is fabricated, but since the best case is O(n log n) by the Master Theorem, that's a reasonable place to start.
Second, I'd guess an actual bound: 100 n log (n + 1). I use a big constant because why not? It doesn't matter for asymptotic notation and can only make my job easier. I use log (n + 1) instead of log n because log n is undefined for n = 0, and 0 log (0 + 1) = 0 covers the base case.
Third, let's try to verify the inductive step. Assuming that C(i) ≤ 100 i log (i + 1) for all i ∈ {0, ..., n−1},
1 n−1
C(n) = n−1 + ― sum (C(i) + C(n−1−i)) [by definition]
n i=0
2 n−1
= n−1 + ― sum C(i) [by symmetry]
n i=0
2 n−1
≤ n−1 + ― sum 100 i log(i + 1) [by the inductive hypothesis]
n i=0
n
2 /
≤ n−1 + ― | 100 x log(x + 1) dx [upper Darboux sum]
n /
0
2
= n−1 + ― (50 (n² − 1) log (n + 1) − 25 (n − 2) n)
n
[WolframAlpha FTW, I forgot how to integrate]
= n−1 + 100 (n − 1/n) log (n + 1) − 50 (n − 2)
= 100 (n − 1/n) log (n + 1) − 49 n + 100.
Well that's irritating. It's almost what we want but that + 100 messes up the program a little bit. We can extend the base cases to n = 1 and n = 2 by inspection and then assume that n ≥ 3 to finish the bound:
C(n) = 100 (n − 1/n) log (n + 1) − 49 n + 100
≤ 100 n log (n + 1) − 49 n + 100
≤ 100 n log (n + 1). [since n ≥ 3 implies 49 n ≥ 100]
Once again, no one would publish such a messy derivation. I wanted to show how one could work it out formally without knowing the answer ahead of time.
Second way
How else can we derive how many comparisons QuickSort does in expectation? Another possibility is to exploit the linearity of expectation by summing over each pair of elements the probability that those elements are compared. What is that probability? We observe that a pair {i, j} is compared if and only if, at the leaf-most invocation where i and j exist in the array, either i or j is chosen as the pivot. This happens with probability 2/(j+1 − i), since the pivot must be i, j, or one of the j − (i+1) elements that compare between them. Therefore,
n n 2
C(n) = sum sum ―――――――
i=1 j=i+1 j+1 − i
n n+1−i 2
= sum sum ―
i=1 d=2 d
n
= sum 2 (H(n+1−i) − 1) [where H is the harmonic numbers]
i=1
n
= 2 sum H(i) − n
i=1
= 2 (n + 1) (H(n+1) − 1) − n. [WolframAlpha FTW again]
Since H(n) is Θ(log n), this is Θ(n log n), as expected.

Related

Proving Big Omega Function

I'm trying to find the n0 (n not) of a function with a big omega size of n^3 where c=2.25
𝑓(𝑛) = 3𝑛^3 − 39𝑛^2 + 360𝑛 + 20. In order to prove that 𝑓(𝑛) is Ω(𝑛^3), we need constants 𝑐, 𝑛0 > 0 such that 𝑓(𝑛) ≥ 𝑐𝑛^3 for every 𝑛 ≥ 𝑛0
If c=2.25, how do I find the smallest integer that satisfies n0?
My first thought was to plug in n=1, because n>0, and if the inequality worked n=1 would be the smallest n (therefore n0). But, the inequality has to be satisfied for every n>=n0, and if i plug in, for example, n=15 the inequality doesn't work.
You can solve this mathematically.
To make sure that I understand what you want, I will summarize what you are asking. You want to find the smallest integer n so that:
3𝑛^3 − 39𝑛^2 + 360𝑛 + 20 ≥ 2.25𝑛^3 (1)
And any other integers bigger than n must also satisfy the equation (1).
So here is my solution:
(1) <=> 0.75𝑛^3 − 39𝑛^2 + 360𝑛 + 20 ≥ 0
Let f(n) = 0.75𝑛^3 − 39𝑛^2 + 360𝑛 + 20
f(n) = 0 <=> n1 = -0.05522 or n2 = 12.079 or n3 = 39.976
If n < n1, f(n) < 0 (try this yourself)
If n1 < n < n2, f(n) > 0 (the sign will alternate)
If n2 < n < n3, f(n) < 0 (the sign will alternate, again)
If n > n3, f(n) > 0
So to satisfy your requirements, the minimum value of n must be 40
Think about it like this. After a certain point 3𝑛^3 − 39𝑛^2 + 360𝑛 + 20 will always be greater than or equal to n^3 for the simple fact that eventually 3n^3 will beat out the -39n^2. So F(n) will never dip below n^3 for an extremely large number. You don't have to put the minimum nO, just choose an extremely large number for nO, since the question is asking after a certain value for n, the statement will hold true for ever. Choose nO, for example, to be an extremely large number X, and then use an inductive proof where X is the base case.

Complexity when generating all combinations

Interview questions where I start with "this might be solved by generating all possible combinations for the array elements" are usually meant to let me find something better.
Anyway I would like to add "I would definitely prefer another solution since this is O(X)".. the question is: what is the O(X) complexity of generating all combinations for a given set?
I know that there are n! / (n-k)!k! combinations (binomial coefficients), but how to get the big-O notation from that?
First, there is nothing wrong with using O(n! / (n-k)!k!) - or any other function f(n) as O(f(n)), but I believe you are looking for a simpler solution that still holds the same set.
If you are willing to look at the size of the subset k as constant, then for k <= n - k:
n! / ((n - k)! k!) = ((n - k + 1) (n - k + 2) (n - k + 3) ... n ) / k!
But the above is actually (n ^ k + O(n ^ (k - 1))) / k!, which is in O(n ^ k)
Similarly, if n - k < k, you get O(n ^ (n - k))
Which gives us O(n ^ min{k, n - k})
I know this is an old question, but it comes up as a top hit on google, and IMHO has an incorrectly marked accepted answer.
C(n,k) = n Choose k = n! / ( (n-k)! * k!)
The above function represents the number of sets of k-element that can be made from a set of n-element. Purely from a logical reasoning perspective, C(n, k) has to be smaller than
∑ C(n,k) ∀ k ∊ (1..n).
as this expression represents the power-set. In English, the above expression represents: add C(n,k) for all k from 1 to n. We know this to have 2 ^ n elements.
So, C(n, k) has an upper bound of 2 ^ n which is definitely smaller than n ^ k for any n, k > 3, and k < n.
So to answer your question C(n, k) has an upper bound of 2 ^ n for sure, but don't know if there is a tighter upper bound that describes it better.
As a follow-up to #amit, an upper bound of min{k, n - k} is n / 2.
Therefore, the upper-bound for "n choose k" complexity is O(n ^ (n / 2))
case1: if n-k < k
Let suppose n=11 and k=8 and n-k=3 then
n!/(n-k)!k! = 11!/(3!8!)= 11x10x9/3!
let suppose it is (11x11x11)/6 = O(11^3) and 11 was equal to n so O(n^3) and also n-k=3 so it become O(n^(n-k))
case2: if k < n-k
Let suppose n=11 and k=3 and n-k=8 then
n!/(n-k)!k! = 11!/(8!3!)= 11x10x9/3!
let suppose it is (11x11x11)/6 = O(11^3) and 11 was equal to n so O(n^3) and also k=3 so it become O(n^(k))
Which gives us O(n^min{k,n-k})

Time complexity of iterating over a k-layer deep loop nest always Θ(nᵏ)?

Many algorithms have loops in them that look like this:
for a from 1 to n
for b from 1 to a
for c from 1 to b
for d from 1 to c
for e from 1 to d
...
// Do O(1) work
In other words, the loop nest is k layers deep, the outer layer loops from 1 to n, and each inner layer loops up from 1 to the index above it. This shows up, for example, in code to iterate over all k-tuples of positions inside an array.
Assuming that k is fixed, is the runtime of this code always Θ(nk)? For the special case where n = 1, the work is Θ(n) because it's just a standard loop over an array, and for the case where n = 2 the work is Θ(n2) because the work done by the inner loop is given by
0 + 1 + 2 + ... + n-1 = n(n-1)/2 = Θ(n2)
Does this pattern continue when k gets large? Or is it just a coincidence?
Yes, the time complexity will be Θ(nk). One way to measure the complexity of this code is to look at what values it generates. One particularly useful observation is that these loops will iterate over all possible k-element subsets of the array {1, 2, 3, ..., n} and will spend O(1) time producing each one of them. Therefore, we can say that the runtime is given by the number of such subsets. Given an n-element set, the number of k-element subsets is n choose k, which is equal to
n! / k!(n - k)!
This is given by
n (n-1)(n-2) ... (n - k + 1) / k!
This value is certainly no greater than this one:
n · n · n · ... · n / k! (with k copies of n)
= nk / k!
This expression is O(nk), since the 1/k! term is a fixed constant.
Similarly, when n - k + 1 ≥ n / 2, this expression is greater than or equal to
(n / 2) · (n / 2) · ... · (n / 2) / k! (with k copies of n/2)
= nk / k! 2k
This is Ω(nk), since 1 / k! 2k is a fixed constant.
Since the runtime is O(nk) and Ω(nk), the runtime is Θ(nk).
Hope this helps!
You may consume the following equation:
Where c is the number of constant time operations inside the innermost loop, n is the number of elements, and r is the number of nested loops.

complexity of a randomized search algorithm

Consider the following randomized search algorithm on a sorted array a of length n (in increasing order). x can be any element of the array.
size_t randomized_search(value_t a[], size_t n, value_t x)
size_t l = 0;
size_t r = n - 1;
while (true) {
size_t j = rand_between(l, r);
if (a[j] == x) return j;
if (a[j] < x) l = j + 1;
if (a[j] > x) r = j - 1;
}
}
What is the expectation value of the Big Theta complexity (bounded both below and above) of this function when x is selected randomly from a?
Although this seems to be log(n), I carried out an experiment with instruction count, and found out that the result grows a little faster than log(n) (according to my data, even (log(n))^1.1 better fit the result).
Someone told me that this algorithm has an exact big theta complexity (so obviously log(n)^1.1 is not the answer). So, could you please give the time complexity along with your approach to prove it? Thanks.
Update: the data from my experiment
log(n) fit result by mathematica:
log(n)^1.1 fit result:
If you're willing to switch to counting three-way compares, I can tell you the exact complexity.
Suppose that the key is at position i, and I want to know the expected number of compares with position j. I claim that position j is examined if and only if it's the first position between i and j inclusive to be examined. Since the pivot element is selected uniformly at random each time, this happens with probability 1/(|i - j| + 1).
The total complexity is the expectation over i <- {1, ..., n} of sum_{j=1}^n 1/(|i - j| + 1), which is
sum_{i=1}^n 1/n sum_{j=1}^n 1/(|i - j| + 1)
= 1/n sum_{i=1}^n (sum_{j=1}^i 1/(i - j + 1) + sum_{j=i+1}^n 1/(j - i + 1))
= 1/n sum_{i=1}^n (H(i) + H(n + 1 - i) - 1)
= 1/n sum_{i=1}^n H(i) + 1/n sum_{i=1}^n H(n + 1 - i) - 1
= 1/n sum_{i=1}^n H(i) + 1/n sum_{k=1}^n H(k) - 1 (k = n + 1 - i)
= 2 H(n + 1) - 3 + 2 H(n + 1)/n - 2/n
= 2 H(n + 1) - 3 + O(log n / n)
= 2 log n + O(1)
= Theta(log n).
(log means natural log here.) Note the -3 in the low order terms. This makes it look like the number of compares is growing faster than logarithmic at the beginning, but the asymptotic behavior dictates that it levels off. Try excluding small n and refitting your curves.
Assuming rand_between to implement sampling from a uniform probability distribution in constant time, the expected running time of this algorithm is Θ(lg n). Informal sketch of a proof: the expected value of rand_between(l, r) is (l+r)/2, the midpoint between them. So each iteration is expected to skip half of the array (assuming the size is a power of two), just like a single iteration of binary search would.
More formally, borrowing from an analysis of quickselect, observe that when you pick a random midpoint, half of the time it will be between ¼n and ¾n. Neither the left nor the right subarray has more than ¾n elements. The other half of the time, neither has more than n elements (obviously). That leads to a recurrence relation
T(n) = ½T(¾n) + ½T(n) + f(n)
where f(n) is the amount of work in each iteration. Subtracting ½T(n) from both sides, then doubling both sides, we have
½T(n) = ½T(¾n) + f(n)
T(n) = T(¾n) + 2f(n)
Now, since 2f(n) = Θ(1) = Θ(n ᶜ log⁰ n) where c = log(1) = 0, it follows by the master theorem that T(n) = Θ(n⁰ lg n) = Θ(lg n).

Solving recurrences: Substitution method

I'm trying to follow Cormen's book "Introduction to Algorithms" (page 59, I believe) about substitution method for solving recurrences. I don't get the notation used for MERGE-SORT substitution:
T(n) ≤ 2(c ⌊n/2⌋lg(⌊n/2⌋)) + n
≤ cn lg(n/2) + n
= cn lg n - cn lg 2 + n
= cn lg n - cn + n
≤ cn lg n
Part I don't understand is how do you turn ⌊n/2⌋ to n/2 assuming that it denotes recursion. Can you explain the substitution method and its general thought process (especially the math induction part) in a simple and easily understandable way ? I know there's a great answer of that sort about big-O notation here in SO.
The idea behind the substitution method is to bound a function defined by a recurrence via strong induction. I'm going to assume that T(n) is an upper bound on the number of comparisons merge sort uses to sort n elements and define it by the following recurrence with boundary condition T(1) = 0.
T(n) = T(floor(n/2)) + T(ceil(n/2)) + n - 1.
Cormen et al. use n instead of n - 1 for simplicity and cheat by using floor twice. Let's not cheat.
Let H(n) be the hypothesis that T(n) ≤ c n lg n. Technically we should choose c right now, so let's set c = 100. Cormen et al. opt to write down statements that hold for every (positive) c until it becomes clear what c should be, which is an optimization.
The base cases are H(1) and H(2), namely T(1) ≤ 0 and T(2) ≤ 2 c. Okay, we don't need any comparisons to sort one element, and T(2) = T(1) + T(1) + 1 = 1 < 200.
Inductively, when n ≥ 3, assume for all 1 ≤ n' < n that H(n') holds. We need to prove H(n).
T(n) = T(floor(n/2)) + T(ceil(n/2)) + n - 1
≤ c floor(n/2) lg floor(n/2) + T(ceil(n/2)) + n - 1
by the inductive hypothesis H(floor(n/2))
≤ c floor(n/2) lg floor(n/2) + c ceil(n/2) lg ceil(n/2) + n - 1
by the inductive hypothesis H(ceil(n/2))
≤ c floor(n/2) lg (n/2) + c ceil(n/2) lg ceil(n/2) + n - 1
since 0 < floor(n/2) ≤ n/2 and lg is increasing
Now we have to deal with the consequences of our honesty and bound lg ceil(n/2).
lg ceil(n/2) = lg (n/2) + lg (ceil(n/2) / (n/2))
< lg (n/2) + lg ((n/2 + 1) / (n/2))
since 0 < ceil(n/2) ≤ n/2 + 1 and lg is increasing
= lg (n/2) + log (1 + 2/n) / log 2
≤ lg (n/2) + 2/(n log 2)
by the inequality log (1 + x) ≤ x, which can be proved with calculus
Okay, back to bounding T(n).
T(n) ≤ c floor(n/2) lg (n/2) + c ceil(n/2) (lg (n/2) + 2/(n log 2)) + n - 1
since 0 < floor(n/2) ≤ n/2 and lg is increasing
= c n lg n - c n + n + 2 c ceil(n/2) / (n log 2) - 1
since floor(n/2) + ceil(n/2) = n and lg (n/2) = lg n - 1
≤ c n lg n - (c - 1) n + 2 c/log 2
since ceil(n/2) ≤ n
≤ c n lg n
since, for all n' ≥ 3, we have (c - 1) n' = 99 n' ≥ 297 > 200/log 2 ≈ 288.539.
Commentary
I guess this doesn't explain the why very well, but (hopefully) at least the derivations are correct in all of the details. People who write proofs like these often skip the base cases and ignore floor and ceil because, well, the details usually are just an annoyance that affects the constant c (which most computer scientists not named Knuth don't care about).
To me, the substitution method is for confirming a guess rather than formulating one. The interesting question is how one comes up with a guess. Personally, if the recurrence is (i) not something that looks like Fibonacci (e.g., linear homogeneous recurrences) and (ii) not covered by Akra–Bazzi, a generalization of the Master Theorem, then I'm going to have some trouble coming up with a good guess.
Also, I should mention the most common failure mode of the substitution method: if one can't quite choose c to be a large enough to swallow the extra terms from the subproblems, then the bound may be wrong. On the other hand, more base cases might suffice. In the preceding proof, I used two base cases because I couldn't prove the very last inequality unless I knew that n > 2/log 2.

Resources