Is there a better algorithm to assign numbers to combinations? - algorithm

It is well known that Pascal's identity can be used to encode a combination of k elements out of n into a number from 0 to (n \choose k) - 1 (let's call this number a combination index) using a combinatorial number system. Assuming constant time for arithmetic operations, this algorithm takes O(n) time.†
I have an application where k ≪ n and an algorithm in O(n) time is infeasible. Is there an algorithm to bijectively assign a number between 0 and (n \choose k) - 1 to a combination of k elements out of n whose runtime is of order O(k) or similar? The algorithm does not need to compute the same mapping as the combinatorial number system, however, the inverse needs to be computable in a similar time complexity.
† More specifically, the algorithm computing the combination from the combination index runs in O(n) time. Computing the combination index from the combination works in O(k) time if you pre-compute the binomial coefficients.

Description of a comment.
For given combinatorial index (N), to find k'th digit it is needed to find c_k such that (c_k \choose k) <= N and ((c_k+1) \choose k) > N.
Set P(i,k) = i!/(i-k)!.
P(i, k) = i * (i-1) * ... * (i-k+1)
substitute x = i - (k-1)/2
= (x+(k-1)/2) * (x+(k-1)/2-1) * ... * (x-(k-1)/2+1) * (x-(k-1)/2)
= (x^2 - ((k-1)/2)^2) * (x^2 - ((k-1)/2-1)^2) * ...
= x^k - sum(((k-2i-1)/2)^2))*x^(k-2) + O(x^(k-4))
= x^k - O(x^(k-2))
P(i, k) = (i - (k-1)/2)^k - O(i^(k-2))
From above inequality:
(c_k \choose k) <= N
P(c_k, k) <= N * k!
c_k ~= (N * k!)^(1/k) + (k-1)/2
I am not sure how large is O(c_k^(k-2)) part. I suppose it doesn't influence too much. If it is of order (c_k+1)/(c_k-k+1) than approximation is very good. That is due:
((c_k+1) \choose k) = (c_k \choose k) * (c_k + 1) / (c_k - k + 1)
I would try algorithm something like:
For given k
Precalculate k!
For given N
For i in (k, ..., 0)
Calculate c_i with (N * i!)^(1/i) + (i-1)/2
(*) Check is P(c_i, k) <=> N * i!
If smaller check c_i+1
If larger check c_i-1
Repeat (*) until found P(c_i, i) <= N * i! < P(c_i+1, i)
N = N - P(c_i, i)
If approximation is good, number of steps << k, than finding one digit is O(k).

Related

Why is m + k log m = O(m + k log k)?

Paredes and Navarro state that
m + k log m = O(m + k log k)
This gives an immediate "tighter looking" bound for incremental sorting. That is, if a partial or incremental sorting algorithm is O(m + k log m), then it is automatically O(m + k log k), where the k smallest elements are sorted from a set of size m. Unfortunately, their explanation is rather difficult for me to understand. Why does it hold?
Specifically, they state
Note that m + k log m = O(m + k log k), as they can differ only
if k = o(mα) for any α > 0, and then m dominates k log m.
This seems to suggest they're talking about k as a function of m along some path, but it's very hard to see how k = o(mα) plays into things, or where to place the quantifiers in their statement.
There are various ways to define big-O notation for multi-variable functions, which would seem to make the question difficult to approach. Fortunately, it doesn't actually matter exactly which definition you pick, as long as you make the entirely reasonable assumption that m > 0 and k >= 1. That is, in the incremental sorting context, you assume that you need to obtain at least the first element from a set with at least one element.
Theorem
If m and k are real numbers, m > 0, and k >= 1, then m + k log m <= 2(m + k log k).
Proof
Suppose for the sake of contradiction that
m + k log m > 2(m + k log k)
Rearranging terms,
k log m - 2k log k > m
By the product property for logarithms,
k log m - k (log (k^2)) > m
By the sum property for logarithms,
k (log (m / k^2)) > m
Dividing by k (which is positive),
log (m / k^2) > m/k
Since k >= 1, k^2 >= k, so (since m >= 0) m / k >= m / k^2. Thus
log (m / k^2) > m / k^2
The logarithm of a number can never exceed that number, so we have reached a contradiction.

How does my randomly partitioned array look in the general case?

I have an array of n random integers
I choose a random integer and partition by the chosen random integer (all integers smaller than the chosen integer will be on the left side, all bigger integers will be on the right side)
What will be the size of my left and right side in the average case, if we assume no duplicates in the array?
I can easily see, that there is 1/n chance that the array is split in half, if we are lucky. Additionally, there is 1/n chance, that the array is split so that the left side is of length 1/2-1 and the right side is of length 1/2+1 and so on.
Could we derive from this observation the "average" case?
You can probably find a better explanation (and certainly the proper citations) in a textbook on randomized algorithms, but here's the gist of average-case QuickSort, in two different ways.
First way
Let C(n) be the expected number of comparisons required on average for a random permutation of 1...n. Since the expectation of the sum of the number of comparisons required for the two recursive calls equals the sum of the expectations, we can write a recurrence that averages over the n possible divisions:
C(0) = 0
1 n−1
C(n) = n−1 + ― sum (C(i) + C(n−1−i))
n i=0
Rather than pull the exact solution out of a hat (or peek at the second way), I'll show you how I'd get an asymptotic bound.
First, I'd guess the asymptotic bound. Obviously I'm familiar with QuickSort and my reasoning here is fabricated, but since the best case is O(n log n) by the Master Theorem, that's a reasonable place to start.
Second, I'd guess an actual bound: 100 n log (n + 1). I use a big constant because why not? It doesn't matter for asymptotic notation and can only make my job easier. I use log (n + 1) instead of log n because log n is undefined for n = 0, and 0 log (0 + 1) = 0 covers the base case.
Third, let's try to verify the inductive step. Assuming that C(i) ≤ 100 i log (i + 1) for all i ∈ {0, ..., n−1},
1 n−1
C(n) = n−1 + ― sum (C(i) + C(n−1−i)) [by definition]
n i=0
2 n−1
= n−1 + ― sum C(i) [by symmetry]
n i=0
2 n−1
≤ n−1 + ― sum 100 i log(i + 1) [by the inductive hypothesis]
n i=0
n
2 /
≤ n−1 + ― | 100 x log(x + 1) dx [upper Darboux sum]
n /
0
2
= n−1 + ― (50 (n² − 1) log (n + 1) − 25 (n − 2) n)
n
[WolframAlpha FTW, I forgot how to integrate]
= n−1 + 100 (n − 1/n) log (n + 1) − 50 (n − 2)
= 100 (n − 1/n) log (n + 1) − 49 n + 100.
Well that's irritating. It's almost what we want but that + 100 messes up the program a little bit. We can extend the base cases to n = 1 and n = 2 by inspection and then assume that n ≥ 3 to finish the bound:
C(n) = 100 (n − 1/n) log (n + 1) − 49 n + 100
≤ 100 n log (n + 1) − 49 n + 100
≤ 100 n log (n + 1). [since n ≥ 3 implies 49 n ≥ 100]
Once again, no one would publish such a messy derivation. I wanted to show how one could work it out formally without knowing the answer ahead of time.
Second way
How else can we derive how many comparisons QuickSort does in expectation? Another possibility is to exploit the linearity of expectation by summing over each pair of elements the probability that those elements are compared. What is that probability? We observe that a pair {i, j} is compared if and only if, at the leaf-most invocation where i and j exist in the array, either i or j is chosen as the pivot. This happens with probability 2/(j+1 − i), since the pivot must be i, j, or one of the j − (i+1) elements that compare between them. Therefore,
n n 2
C(n) = sum sum ―――――――
i=1 j=i+1 j+1 − i
n n+1−i 2
= sum sum ―
i=1 d=2 d
n
= sum 2 (H(n+1−i) − 1) [where H is the harmonic numbers]
i=1
n
= 2 sum H(i) − n
i=1
= 2 (n + 1) (H(n+1) − 1) − n. [WolframAlpha FTW again]
Since H(n) is Θ(log n), this is Θ(n log n), as expected.

Time complexity: theory vs reality

I'm currently doing an assignment that requires us to discuss time complexities of different algorithms.
Specifically sum1 and sum2
def sum1(a):
"""Return the sum of the elements in the list a."""
n = len(a)
if n == 0:
return 0
if n == 1:
return a[0]
return sum1(a[:n/2]) + sum1(a[n/2:])
def sum2(a):
"""Return the sum of the elements in the list a."""
return _sum(a, 0, len(a)-1)
def _sum(a, i, j):
"""Return the sum of the elements from a[i] to a[j]."""
if i > j:
return 0
if i == j:
return a[i]
mid = (i+j)/2
return _sum(a, i, mid) + _sum(a, mid+1, j)
Using the Master theorem, my best guess for both of theese are
T(n) = 2*T(n/2)
which accoring to Wikipedia should equate to O(n) if I haven't made any mistakes in my assumptions, however when I do a benchmark with different arrays of length N with random integers in the range 1 to 100, I get the following result.
I've tried running the benchmark a multiple of times and I get the same result each time. sum2 seems to be twice as fast as sum1 which baffles me since they should make the same amount of operations. (?).
My question is, are these algorthim both linear and if so why do their run time vary.
If it does matter, I'm running these tests on Python 2.7.14.
sum1 looks like O(n) on the surface, but for sum1 T(n) is actually 2T(n/2) + 2*n/2. This is because of the list slicing operations which themselves are O(n). Using the master theorem, the complexity becomes O(n log n) which causes the difference.
Thus, for sum1, the time taken t1 = k1 * n log n. For sum2, time taken t2 = k2 * n.
Since you are plotting a time vs log n graph, let x = log n. Then,
t1 = k1 * x * 10^x
t2 = k2 * 10^x
With suitable values for k1 and k2, you get a graph very similar to yours. From your data, when x = 6, 0.6 ~ k1 * 6 * 10^6 or k1 ~ 10^(-7) and 0.3 ~ k2 * 10^6 or k2 = 3 * 10^(-7).
Your graph has log10(N) on the x-axis, which means that the right-most data points are for an N value that's ten times the previous ones. And, indeed, they take roughly ten times as long. So that's a linear progression, as you expect.

Big(O) for this algorithm

What is the big(O) of this algorithm.
I know that it is similar to O(log(n)) but instead of being halved each time, it is being shrunken exponentially.
sum = 0
i = n
j = 2
while(i>=1)
sum = sum+i
i = i/j
j = 2*j
The denominator d is
d := 2^(k * (k + 1) / 2)
in the k-th iteration of the loop. Thus you have to solve when d is larger than n which leads to a fraction less than 1
2^(k * (k + 1) / 2) > n
for k and fixed n. Inserting
solve 2^(k * (k + 1) / 2) > n for k
in WolframAlpha gives
Thus, you have a running time of O(sqrt(log n)) for your algorithm, when you remove the irrelevant constants from the formula.

complexity of a randomized search algorithm

Consider the following randomized search algorithm on a sorted array a of length n (in increasing order). x can be any element of the array.
size_t randomized_search(value_t a[], size_t n, value_t x)
size_t l = 0;
size_t r = n - 1;
while (true) {
size_t j = rand_between(l, r);
if (a[j] == x) return j;
if (a[j] < x) l = j + 1;
if (a[j] > x) r = j - 1;
}
}
What is the expectation value of the Big Theta complexity (bounded both below and above) of this function when x is selected randomly from a?
Although this seems to be log(n), I carried out an experiment with instruction count, and found out that the result grows a little faster than log(n) (according to my data, even (log(n))^1.1 better fit the result).
Someone told me that this algorithm has an exact big theta complexity (so obviously log(n)^1.1 is not the answer). So, could you please give the time complexity along with your approach to prove it? Thanks.
Update: the data from my experiment
log(n) fit result by mathematica:
log(n)^1.1 fit result:
If you're willing to switch to counting three-way compares, I can tell you the exact complexity.
Suppose that the key is at position i, and I want to know the expected number of compares with position j. I claim that position j is examined if and only if it's the first position between i and j inclusive to be examined. Since the pivot element is selected uniformly at random each time, this happens with probability 1/(|i - j| + 1).
The total complexity is the expectation over i <- {1, ..., n} of sum_{j=1}^n 1/(|i - j| + 1), which is
sum_{i=1}^n 1/n sum_{j=1}^n 1/(|i - j| + 1)
= 1/n sum_{i=1}^n (sum_{j=1}^i 1/(i - j + 1) + sum_{j=i+1}^n 1/(j - i + 1))
= 1/n sum_{i=1}^n (H(i) + H(n + 1 - i) - 1)
= 1/n sum_{i=1}^n H(i) + 1/n sum_{i=1}^n H(n + 1 - i) - 1
= 1/n sum_{i=1}^n H(i) + 1/n sum_{k=1}^n H(k) - 1 (k = n + 1 - i)
= 2 H(n + 1) - 3 + 2 H(n + 1)/n - 2/n
= 2 H(n + 1) - 3 + O(log n / n)
= 2 log n + O(1)
= Theta(log n).
(log means natural log here.) Note the -3 in the low order terms. This makes it look like the number of compares is growing faster than logarithmic at the beginning, but the asymptotic behavior dictates that it levels off. Try excluding small n and refitting your curves.
Assuming rand_between to implement sampling from a uniform probability distribution in constant time, the expected running time of this algorithm is Θ(lg n). Informal sketch of a proof: the expected value of rand_between(l, r) is (l+r)/2, the midpoint between them. So each iteration is expected to skip half of the array (assuming the size is a power of two), just like a single iteration of binary search would.
More formally, borrowing from an analysis of quickselect, observe that when you pick a random midpoint, half of the time it will be between ¼n and ¾n. Neither the left nor the right subarray has more than ¾n elements. The other half of the time, neither has more than n elements (obviously). That leads to a recurrence relation
T(n) = ½T(¾n) + ½T(n) + f(n)
where f(n) is the amount of work in each iteration. Subtracting ½T(n) from both sides, then doubling both sides, we have
½T(n) = ½T(¾n) + f(n)
T(n) = T(¾n) + 2f(n)
Now, since 2f(n) = Θ(1) = Θ(n ᶜ log⁰ n) where c = log(1) = 0, it follows by the master theorem that T(n) = Θ(n⁰ lg n) = Θ(lg n).

Resources