"Programming Pearls": Sampling m elements from a sequence - algorithm

From Programming Pearls: Column 12: A Sample Problem:
The input consists of two integers m and n, with m < n. The output is
a sorted list of m random integers in the range 0..n-1 in which no
integer occurs more than once. For probability buffs, we desire a
sorted selection without replacement in which each selection occurs
with equal probability.
The author provides one solution:
initialize set S to empty
size = 0
while size < m do
t = bigrand() % n
if t is not in S
insert t into S
size++
print the elements of S in sorted order
In the above pseudocode, bigrand() is a function returns a large random integer (much larger than m and n).
Can anyone help me prove the correctness of the above algorithm?
According to my understanding, every output should have the probability of 1/C(n, m).
How to prove the above algorithm can guarantee the output with the probability of 1/C(n, m)?

Each solution this algorithm yields is valid.
How many solutions are there?
Up to last line there (sorting) are n*(n-1)*(n-2)*..*(n-m) different permutations or
n!/(n-m)! and each result has same probability
When you sort you reduce number of possible solutions by m!.
So number of possible outputs is n!/((n-m)!*m!) and this is what you asked for.
n!/((n-m)!m!) = C(n,m)

Related

minimize variance of k integers from n ordered integers

Given a series of n integers and a number k, n>k, what's the solution of minimizing the variance of k new integers? You may add up any successive integers to a new integer and thus reduce n integers to k integers.
Here is an example. Given n=4, k=2, the series of integers are 4,4,1,1. The solution is 4,6 instead of 8,2 or 9,1.
I have come up with a greedy algorithm which goes like this: for every possible new integers, minimize the absolute value of the difference of this integer and the average of all the integers. But this won't work in some cases. Is there any efficient algorithm works?
The variance of a random variable X is E[(X - E[X])^2]. Here X is a random element of the output list. We know that E[X] is equal to the sum of the input numbers divided by k, so this objective is equivalent to the sum of (x - sum/k)^2 over output values x. This can be accomplished by slightly modifying a word wrap algorithm: Word wrap to X lines instead of maximum width (Least raggedness)

Why is counting sort not used for large inputs? [duplicate]

This question already has answers here:
Why we can not apply counting sort to general arrays?
(3 answers)
Closed 8 years ago.
Counting sort is the sorting algorithm with a average time complexity of O(n+K), and the counting sort assumes that each of the input element is an integer in the range of 0 to K.
Why can't we linear-search the maximum value in an unsorted array, equal it to K, and hence apply counting sort on it?
In the case where your inputs are arrays with maximum - minimum = O(n log n) (i.e. the range of values is reasonably restricted), this actually makes sense. If this is not the case, a standard comparison-based sort algorithm or even an integer sorting algorithm like radix sort is asymptotically better.
To give you an example, the following algorithm generates a family of inputs on which counting sort has runtime complexity Θ(n^2):
def generate_input(n):
array = []
for i := 1 to n:
array.append(i*i);
shuffle(array)
return array
Your heading of the question is Why is counting sort not used for large inputs?
What we do in counting sort? We take another array (suppose b[]) and initialize all element to zero. Then we increment an index if that index is an element of the given array. Then we run a loop from lower limit to upper limit of the given array and check if element of index of my taken array (b[]) is 0 or not. If it is not zero, that means, that index is an element of given array.
Now, If the difference between this two (upper limit & lower limit) is very high(like 10^9 or more), then a single loop is enough to kill our PC. :)
According to Big-O notation definition, if we say f(n) ∈ O(g(n)), it means that there is a value C > 0 and n = N such that f(n) < C*g(n), where C and N are constants. Nothing is said about the value of C nor for which n = N the inequality is true.
In any the algorithm analysis, the cost of each operation of the Turing machine must be considered (compare, move, sum, etc). The value of such costs are the defining factors of how big (or small) the values of C and N must be in order to turn the inequality true or false. Remove these cost is a naive assumption I myself used to do during the algorithm analysis course.
The statement "counting sort is O(n+k)" actually means that the sorting is polynomial and linear for a given C, n > N,n > K, where C, N, and K are constants. Thus other algorithms may have a better performance for smaller inputs, because the inequality is true only if the given conditions are true.

Partitioning a list of integers to minimize difference of their sums

Given a list of integers l, how can I partition it into 2 lists a and b such that d(a,b) = abs(sum(a) - sum(b)) is minimum. I know the problem is NP-complete, so I am looking for a pseudo-polynomial time algorithm i.e. O(c*n) where c = sum(l map abs). I looked at Wikipedia but the algorithm there is to partition it into exact halves which is a special case of what I am looking for...
EDIT:
To clarify, I am looking for the exact partitions a and b and not just the resulting minimum difference d(a, b)
To generalize, what is a pseudo-polynomial time algorithm to partition a list of n numbers into k groups g1, g2 ...gk such that (max(S) - min(S)).abs is as small as possible where S = [sum(g1), sum(g2), ... sum(gk)]
A naive, trivial and still pseudo-polynomial solution would be to use the existing solution to subset-sum, and repeat for sum(array)/2to 0 (and return the first one found).
Complexity of this solution will be O(W^2*n) where W is the sum of the array.
pseudo code:
for cand from sum(array)/2 to 0 descending:
subset <- subsetSumSolver(array,cand)
if subset != null:
return subset
The above will return the maximal subset that is lower/equals sum(array)/2, and the other part is the complement for the returned subset.
However, the dynamic programming for subset-sum should be enough.
Recall that the formula is:
f(0,i) = true
f(x,0) = false | x != 0
f(x,i) = f(x-arr[i],i-1) OR f(x,i-1)
When building the matrix, the above actually creates you each row with value lower than the initial x, if you input sum(array)/2 - it's basically all values.
After you generate the DP matrix, just find the maximal value of x such that f(x,n)=true, and this is the best partition you can get.
Complexity in this case is O(Wn)
You can phrase this as a 0/1 integer linear programming optimization problem. Let wi be the ith number, and let xi be a 0/1 variable which indicates whether wi is in the first set or not. Then you want to minimize sum(xi wi) - sum((1 - xi) wi) subject to
sum(xi wi) >= sum((1 - xi) wi)
and also subject to all xi being 0 or 1. There has been a lot of research into optimizing 0/1 linear programming solvers. For large total sum W this may be an improvement over the O(W n) pseudo-polynomial time algorithm presented because the W factor is scary.
My first thought is to:
Sort list of integers
Create two empty lists A and B
While iterating from biggest integer to smallest integer...add next integer to the list with the smallest current sum.
This is, of course, not guaranteed to give you the best result but you can bound the result it will give you by the size of the biggest integer in your list

What is the probability that all priorities are unique for Permute-By-Sorting algorithm?

I hope someone can help me answer the following question. Thanks!
Here is a pseudo code of Permute-By-Sorting algorithm:
Permute-By-Sorting (A)
n = A.length
let P[1..n] be a new array
for i = 1 to n
P[i] = Random (1,n^3)
sort A, using P as sort keys
In the above algorithm, the array P represents the priorities of the elements in array A. Line 4 chooses a random number between 1 and n^3.
The question is what is the probability that all priorities in P are unique? and how do I get the probability?
To reconcile the answers already given: for choice i = 0, ..., n - 1, given that no duplicates have been chosen yet, there are n^3 - i non-duplicate choices of n^3 total for the ith value. Thus the probability is the product for i = 0, ..., n - 1 of (1 - i/n^3).
sdcwc is using a union bound to lowerbound this probability by 1 - O(1/n). This estimate turns out to be basically right. The proof sketch is that (1 - i/n^3) is exp(-i/n^3 + O(i^2/n^6)), so the product is exp(-O(n^2)/n^3 + O(n^-3)), which is greater than or equal to 1 - O(n^2)/n^3 + O(n^-3) = 1 - O(1/n). I'm sure the fine folks on math.SE would be happy to do this derivation "properly" for you.
Others have given you the probability calculation, but I think you may be asking the wrong question.
I assume the reason you're asking about the probability of the priorities being unique, and the reason for choosing n^3 in the first place, is because you're hoping they will be unique, and choosing a large range relative to n seems to be a reasonable way of achieving uniqueness.
It is much easier to ensure that the values are unique. Simply populate the array of priorities with the numbers 1 .. n and then shuffle them with the Fisher-Yates algorithm (aka algorithm P from The Art of Computer Programming, volume 2, Seminumerical Algorithms, by Donald Knuth).
The sort would then be carried out with known unique priority values.
(There are also other ways of going about getting a random permutation. It is possible to generate the nth lexicographic permutation of a sequence using factoradic numbers (or, the factorial number system), and so generate the permutation for a randomly chosen value in [1 .. n!].)
You are choosing n numbers from 1...n^3 and asking what is the probability that they are all unique.
There are (n^3) P n = (n^3)!/(n^3-n)! ways to choose the n numbers uniquely, and (n^3)^n ways to choose the n-numbers total.
So the probability of the numbers being unique is just the first equation divided by the second, which gives
n3!
--------------
(n3-n)! n3n
Let Aij be the event: i-th and j-th elements collide. Obviously P(Aij)=1/n3.
There is at most n2 pairs, therefore probability of at least one collision is at most 1/n.
If you are interested in exact thing, see BlueRaja's answer, but in randomized algorithms it is usually enough to give this type of bound.
So the sort part is irrelevant
Assuming the "Random" is real random, the probability is just
n^3!
----------------
(n^3-n)!n^(3n)

Check if array B is a permutation of A

I tried to find a solution to this but couldn't get much out of my head.
We are given two unsorted integer arrays A and B. We have to check whether array B is a permutation of A. How can this be done.? Even XORing the numbers wont work as there can be several counterexamples which have same XOR value bt are not permutation of each other.
A solution needs to be O(n) time and with space O(1)
Any help is welcome!!
Thanks.
The question is theoretical but you can do it in O(n) time and o(1) space. Allocate an array of 232 counters and set them all to zero. This is O(1) step because the array has constant size. Then iterate through the two arrays. For array A, increment the counters corresponding to the integers read. For array B, decrement them. If you run into a negative counter value during iteration of array B, stop --- the arrays are not permutations of each others. Otherwise at the end (assuming A and B have the same size, a prerequisite) the counter array is all zero and the two arrays are permutations of each other.
This is O(1) space and O(n) time solution. However it is not practical, but would easily pass as a solution to the interview question. At least it should.
More obscure solutions
Using a nondeterministic model of computation, checking that the two arrays are not permutations of each others can be done in O(1) space, O(n) time by guessing an element that has differing count on the two arrays, and then counting the instances of that element on both of the arrays.
In randomized model of computation, construct a random commutative hash function and calculate the hash values for the two arrays. If the hash values differ, the arrays are not permutations of each others. Otherwise they might be. Repeat many times to bring the probability of error below desired threshold. Also on O(1) space O(n) time approach, but randomized.
In parallel computation model, let 'n' be the size of the input array. Allocate 'n' threads. Every thread i = 1 .. n reads the ith number from the first array; let that be x. Then the same thread counts the number of occurrences of x in the first array, and then check for the same count on the second array. Every single thread uses O(1) space and O(n) time.
Interpret an integer array [ a1, ..., an ] as polynomial xa1 + xa2 + ... + xan where x is a free variable and the check numerically for the equivalence of the two polynomials obtained. Use floating point arithmetics for O(1) space and O(n) time operation. Not an exact method because of rounding errors and because numerical checking for equivalence is probabilistic. Alternatively, interpret the polynomial over integers modulo a prime number, and perform the same probabilistic check.
If we are allowed to freely access a large list of primes, you can solve this problem by leveraging properties of prime factorization.
For both arrays, calculate the product of Prime[i] for each integer i, where Prime[i] is the ith prime number. The value of the products of the arrays are equal iff they are permutations of one another.
Prime factorization helps here for two reasons.
Multiplication is transitive, and so the ordering of the operands to calculate the product is irrelevant. (Some alluded to the fact that if the arrays were sorted, this problem would be trivial. By multiplying, we are implicitly sorting.)
Prime numbers multiply losslessly. If we are given a number and told it is the product of only prime numbers, we can calculate exactly which prime numbers were fed into it and exactly how many.
Example:
a = 1,1,3,4
b = 4,1,3,1
Product of ith primes in a = 2 * 2 * 5 * 7 = 140
Product of ith primes in b = 7 * 2 * 5 * 2 = 140
That said, we probably aren't allowed access to a list of primes, but this seems a good solution otherwise, so I thought I'd post it.
I apologize for posting this as an answer as it should really be a comment on antti.huima's answer, but I don't have the reputation yet to comment.
The size of the counter array seems to be O(log(n)) as it is dependent on the number of instances of a given value in the input array.
For example, let the input array A be all 1's with a length of (2^32) + 1. This will require a counter of size 33 bits to encode (which, in practice, would double the size of the array, but let's stay with theory). Double the size of A (still all 1 values) and you need 65 bits for each counter, and so on.
This is a very nit-picky argument, but these interview questions tend to be very nit-picky.
If we need not sort this in-place, then the following approach might work:
Create a HashMap, Key as array element, Value as number of occurances. (To handle multiple occurrences of the same number)
Traverse array A.
Insert the array elements in the HashMap.
Next, traverse array B.
Search every element of B in the HashMap. If the corresponding value is 1, delete the entry. Else, decrement the value by 1.
If we are able to process entire array B and the HashMap is empty at that time, Success. else Failure.
HashMap will use constant space and you will traverse each array only once.
Not sure if this is what you are looking for. Let me know if I have missed any constraint about space/time.
You're given two constraints: Computational O(n), where n means the total length of both A and B and memory O(1).
If two series A, B are permutations of each other, then theres also a series C resulting from permutation of either A or B. So the problem is permuting both A and B into series C_A and C_B and compare them.
One such permutation would be sorting. There are several sorting algorithms which work in place, so you can sort A and B in place. Now in a best case scenario Smooth Sort sorts with O(n) computational and O(1) memory complexity, in the worst case with O(n log n) / O(1).
The per element comparision then happens at O(n), but since in O notation O(2*n) = O(n), using a Smooth Sort and comparison will give you a O(n) / O(1) check if two series are permutations of each other. However in the worst case it will be O(n log n)/O(1)
The solution needs to be O(n) time and with space O(1).
This leaves out sorting and the space O(1) requirement is a hint that you probably should make a hash of the strings and compare them.
If you have access to a prime number list do as cheeken's solution.
Note: If the interviewer says you don't have access to a prime number list. Then generate the prime numbers and store them. This is O(1) because the Alphabet length is a constant.
Else here's my alternative idea. I will define the Alphabet as = {a,b,c,d,e} for simplicity.
The values for the letters are defined as:
a, b, c, d, e
1, 2, 4, 8, 16
note: if the interviewer says this is not allowed, then make a lookup table for the Alphabet, this takes O(1) space because the size of the Alphabet is a constant
Define a function which can find the distinct letters in a string.
// set bit value of char c in variable i and return result
distinct(char c, int i) : int
E.g. distinct('a', 0) returns 1
E.g. distinct('a', 1) returns 1
E.g. distinct('b', 1) returns 3
Thus if you iterate the string "aab" the distinct function should give 3 as the result
Define a function which can calculate the sum of the letters in a string.
// return sum of c and i
sum(char c, int i) : int
E.g. sum('a', 0) returns 1
E.g. sum('a', 1) returns 2
E.g. sum('b', 2) returns 4
Thus if you iterate the string "aab" the sum function should give 4 as the result
Define a function which can calculate the length of the letters in a string.
// return length of string s
length(string s) : int
E.g. length("aab") returns 3
Running the methods on two strings and comparing the results takes O(n) running time. Storing the hash values takes O(1) in space.
e.g.
distinct of "aab" => 3
distinct of "aba" => 3
sum of "aab => 4
sum of "aba => 4
length of "aab => 3
length of "aba => 3
Since all the values are equal for both strings, they must be a permutation of each other.
EDIT: The solutions is not correct with the given alphabet values as pointed out in the comments.
You can convert one of the two arrays into an in-place hashtable. This will not be exactly O(N), but it will come close, in non-pathological cases.
Just use [number % N] as it's desired index or in the chain that starts there. If any element has to be replaced, it can be placed at the index where the offending element started. Rinse , wash, repeat.
UPDATE:
This is a similar (N=M) hash table It did use chaining, but it could be downgraded to open addressing.
I'd use a randomized algorithm that has a low chance of error.
The key is to use a universal hash function.
def hash(array, hash_fn):
cur = 0
for item in array:
cur ^= hash_item(item)
return cur
def are_perm(a1, a2):
hash_fn = pick_random_universal_hash_func()
return hash_fn(a1, hash_fn) == hash_fn(a2, hash_fn)
If the arrays are permutations, it will always be right. If they are different, the algorithm might incorrectly say that they are the same, but it will do so with very low probability. Further, you can get an exponential decrease in chance for error with a linear amount of work by asking many are_perm() questions on the same input, if it ever says no, then they are definitely not permutations of each other.
I just find a counterexample. So, the assumption below is incorrect.
I can not prove it, but I think this may be possible true.
Since all elements of the arrays are integers, suppose each array has 2 elements,
and we have
a1 + a2 = s
a1 * a2 = m
b1 + b2 = s
b1 * b2 = m
then {a1, a2} == {b1, b2}
if this is true, it's true for arrays have n-elements.
So we compare the sum and product of each array, if they equal, one is the permutation
of the other.

Resources