Related
I have a number n and a set of numbers S ∈ [1..n]* with size s (which is substantially smaller than n). I want to sample a number k ∈ [1..n] with equal probability, but the number is not allowed to be in the set S.
I am trying to solve the problem in at worst O(log n + s). I am not sure whether it's possible.
A naive approach is creating an array of numbers from 1 to n excluding all numbers in S and then pick one array element. This will run in O(n) and is not an option.
Another approach may be just generating random numbers ∈[1..n] and rejecting them if they are contained in S. This has no theoretical bound as any number could be sampled multiple times even if it is in the set. But on average this might be a practical solution if s is substantially smaller than n.
Say s is sorted. Generate a random number between 1 and n-s, call it k. We've chosen the k'th element of {1,...,n} - s. Now we need to find it.
Use binary search on s to find the count of the elements of s <= k. This takes O(log |s|). Add this to k. In doing so, we may have passed or arrived at additional elements of s. We can adjust for this by incrementing our answer for each such element that we pass, which we find by checking the next larger element of s from the point we found in our binary search.
E.g., n = 100, s = {1,4,5,22}, and our random number is 3. So our approach should return the third element of [2,3,6,7,...,21,23,24,...,100] which is 6. Binary search finds that 1 element is at most 3, so we increment to 4. Now we compare to the next larger element of s which is 4 so increment to 5. Repeating this finds 5 in so we increment to 6. We check s once more, see that 6 isn't in it, so we stop.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. Binary search finds that 2 elements are at most 4, so we increment to 6. Now we compare to the next larger element of s which is 5 so increment to 7. We check s once more, see that the next number is > 7, so we stop.
If we assume that "s is substantially smaller than n" means |s| <= log(n), then we will increment at most log(n) times, and in any case at most s times.
If s is not sorted then we can do the following. Create an array of bits of size s. Generate k. Parse s and do two things: 1) count the number of elements < k, call this r. At the same time, set the i'th bit to 1 if k+i is in s (0 indexed so if k is in s then the first bit is set).
Now, increment k a number of times equal to r plus the number of set bits is the array with an index <= the number of times incremented.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. We parse s and 1) note that 1 element is below 4 (r=1), and 2) set our array to [1, 1, 0, 0]. We increment once for r=1 and an additional two times for the two set bits, ending up at 7.
This is O(s) time, O(s) space.
This is an O(1) solution with O(s) initial setup that works by mapping each non-allowed number > s to an allowed number <= s.
Let S be the set of non-allowed values, S(i), where i = [1 .. s] and s = |S|.
Here's a two part algorithm. The first part constructs a hash table based only on S in O(s) time, the second part finds the random value k ∈ {1..n}, k ∉ S in O(1) time, assuming we can generate a uniform random number in a contiguous range in constant time. The hash table can be reused for new random values and also for new n (assuming S ⊂ { 1 .. n } still holds of course).
To construct the hash, H. First set j = 1. Then iterate over S(i), the elements of S. They do not need to be sorted. If S(i) > s, add the key-value pair (S(i), j) to the hash table, unless j ∈ S, in which case increment j until it is not. Finally, increment j.
To find a random value k, first generate a uniform random value in the range s + 1 to n, inclusive. If k is a key in H, then k = H(k). I.e., we do at most one hash lookup to insure k is not in S.
Python code to generate the hash:
def substitute(S):
H = dict()
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
H[s] = j
j += 1
return H
For the actual implementation to be O(s), one might need to convert S into something like a frozenset to insure the test for membership is O(1) and also move the len(S) loop invariant out of the loop. Assuming the j in S test and the insertion into the hash (H[s] = j) are constant time, this should have complexity O(s).
The generation of a random value is simply:
def myrand(n, s, H):
k = random.randint(s + 1, n)
return (H[k] if k in H else k)
If one is only interested in a single random value per S, then the algorithm can be optimized to improve the common case, while the worst case remains the same. This still requires S be in a hash table that allows for a constant time "element of" test.
def rand_not_in(n, S):
k = random.randint(len(S) + 1, n);
if k not in S: return k
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
if s == k: return j
j += 1
Optimizations are: Only generate the mapping if the random value is in S. Don't save the mapping to a hash table. Short-circuit the mapping generation when the random value is found.
Actually, the rejection method seems like the practical approach.
Generate a number in 1...n and check whether it is forbidden; regenerate until the generated number is not forbidden.
The probability of a single rejection is p = s/n.
Thus the expected number of random number generations is 1 + p + p^2 + p^3 + ... which is 1/(1-p), which in turn is equal to n/(n-s).
Now, if s is much less than n, or even more up to s = n/2, this expected number is at most 2.
It would take s almost equal to n to make it infeasible in practice.
Multiply the expected time by log s if you use a tree-set to check whether the number is in the set, or by just 1 (expected value again) if it is a hash-set. So the average time is O(1) or O(log s) depending on the set implementation. There is also O(s) memory for storing the set, but unless the set is given in some special way, implicitly and concisely, I don't see how it can be avoided.
(Edit: As per comments, you do this only once for a given set.
If, additionally, we are out of luck, and the set is given as a plain array or list, not some fancier data structure, we get O(s) expected time with this approach, which still fits into the O(log n + s) requirement.)
If attacks against the unbounded algorithm are a concern (and only if they truly are), the method can include a fall-back algorithm for the cases when a certain fixed number of iterations didn't provide the answer.
Similarly to how IntroSort is QuickSort but falls back to HeapSort if the recursion depth gets too high (which is almost certainly a result of an attack resulting in quadratic QuickSort behavior).
Find all numbers that are in a forbidden set and less or equal then n-s. Call it array A.
Find all numbers that are not in a forbidden set and greater then n-s. Call it array B. It may be done in O(s) if set is sorted.
Note that lengths of A and B are equal, and create mapping map[A[i]] = B[i]
Generate number t up to n-s. If there is map[t] return it, otherwise return t
It will work in O(s) insertions to a map + 1 lookup which is either O(s) in average or O(s log s)
The problem link is: http://www.spoj.com/problems/ORDERSET/en/
In this problem, you have to maintain a dynamic set of numbers which support the two fundamental operations
INSERT(S,x): if x is not in S, insert x into S
DELETE(S,x): if x is in S, delete x from S
and the two type of queries
K-TH(S) : return the k-th smallest element of S
COUNT(S,x): return the number of elements of S smaller than x
Input
Line 1: Q (1 ≤ Q ≤ 200000), the number of operations
In the next Q lines, the first token of each line is a character I, D, K or C meaning that the corresponding operation is INSERT, DELETE, K-TH or COUNT, respectively, following by a whitespace and an integer which is the parameter for that operation.
If the parameter is a value x, it is guaranteed that 0 ≤ |x| ≤ 109. If the parameter is an index k, it is guaranteed that 1 ≤ k ≤ 109.
Output
For each query, print the corresponding result in a single line. In particular, for the queries K-TH, if k is larger than the number of elements in S, print the word 'invalid'.
I thought of using a set here, since the insertion and deletion in a set can be done in logarithmic time. However, I am not sure if set is the ideal data structure for finding the k'th element and number of elements less than it. What other DS can I use to make it optimal. Thanks!
A set is an abstract data type, not a data structure. An hash table is a data structure that can be used to implements a set, as is a binary tree.
Speaking of binary tree, it seems well suited for your needs here.
If balanced : fast INSERT and DELETE
If you keep track of subnodes count : K-TH and COUNT should be very fast too
I have this formula:
index = (a * k) % M
which maps a number 'k', from an input set K of distinct numbers, into it's position in a hashtable. I was wondering how to write a non-brute force program that finds such 'M' and 'a' so that 'M' is minimal, and there are no collisions for the given set K.
If, instead of a numeric multiplication you could perform a logic computation (and / or /not), I think that the optimal solution (minimum value of M) would be as small as card(K) if you could get a function that related each value of K (once ordered) with its position in the set.
Theoretically, it must be possible to write a truth table for such a relation (bit a bit), and then simplify the minterms through a Karnaugh Table with a proper program. Depending on the desired number of bits, the computational complexity would be affordable... or not.
If a is co-prime to M then a * k = a * k' mod M if, and only if, k = k' mod M, so you might as well use a = 1, which is always co-prime to M. This also covers all the cases in which M is prime, because all the numbers except 0 are then co-prime to M.
If a and M are not co-prime, then they share a common factor, say b, so a = x * b and M = y * b. In this case anything multiplied by a will also be divisible by b mod M, and you might as well by working mod y, not mod M, so there is nothing to be gained by using an a not co-prime to M.
So for the problem you state, you could save some time by leaving a=1 and trying all possible values of M.
If you are e.g. using 32-bit integers and really calculating not (a * k) mod M but ((a * k) mod 2^32) mod M you might be able to find cases where values of a other than 1 do better than a=1 because of what happens in (a * k) mod 2^32.
I have a set of 2D points, and I want to be able to make the following query with arguments x_min and n: what are the n points with largest y which have x > x_min?
To rephrase in Ruby:
class PointsThing
def initialize(points)
#points = points
end
def query(x_min, n)
#points.select { |point| point.x > x_min }.sort_by { |point| point.y }.take(n)
end
end
Ideally, my class would also support an insert and delete operation.
I can't think of a data structure for this which would enable the query to run in less than O(|#points|) time. Does anyone know one?
Sort the points by x descending. For each point in order, insert it into a purely functional red-black tree ordered by y descending. Keep all of the intermediate trees in an array.
To look up a particular x_min, use binary search to find the intermediate tree where exactly the points with x > x_min have been inserted. Traverse this tree to find the first n points.
The preprocessing cost is O(p log p) in time and space, where p is the number of points. The query time is O(log p + n), where n is the number of points to be returned in the query.
If your data are not sorted, then you have no choice but to check every point since you cannot know if there exists another point for which y is greater than that of all other points and for which x > x_min. In short: you can't know if another point should be included if you don't check them all.
In that case, I would assume that it would be impossible to check in sublinear time as you ask for, since you have to check them all. Best case for searching all would be linear.
If your data are sorted, then your best case will be constant time (all n points are those with the greatest y), and worst case would be linear (all n points are those with least y). Average case would be closer to constant I think if your x and x_min are both roughly random within a specific range.
If you want this to scale (that is, you could have large values of n), you will want to keep your resultant set sorted as well since you will need to check new potential points against it and to drop the lowest value when you insert (if size > n). Using a tree, this can be log time.
So, to do the entire thing, worst case is for unsorted points, in which case you're looking at nlog(n) time. Sorted points is better, in which case you're looking at average case of log(n) time (again, assuming roughly randomly distributed values for x and x_min), which yes is sub-linear.
In case it isn't at first obvious why sorted points will have have constant time to search through, I will go over that here quickly.
If the n points with the greatest y values all had x > x_min (the best case) then you are just grabbing what you need off the top, so that case is obvious.
For the average case, assuming roughly randomly distributed x and x_min, the odds that x > x_min are basically half. For any two random numbers a and b, a > b is just as likely to be true as b > a. This is the same thing with x and x_min; x > x_min is equally as likely to be true as x_min > x, meaning 0.5 probability. This means that, for your points, on average every second point checked will meet your x > x_min requirement, so on average you will check 2n points to find the n highest points that meet your criteria. So the best case was c time, average is 2c which is still constant.
Note, however, that for values of n approaching the size of the set this hides the fact that you are going through the entire set, essentially bringing it right back up to linear time. So my assertion that it is constant time does not hold true if you assume random values of n within the range of the size of your set.
If this is not a purely academic question and is prompted by some actual need, then it depends on the situation.
(edit)
I just realized that my constant-time assertions were assuming a data structure where you have direct access to the highest value and can go sequentially to lower values. If the data structure that those are provided to you in does not fit that description, then obviously that will not be the case.
Some precomputation would help in this case.
First partition the set of points taking x_min as pivot element.
Then for set of points lying on right side of x_min build a max_heap based on y co-ordinates.
Now run your query as: Perform n extract_max operations on the built max_heap.
The running time of your query would be log X + log (X-1) + ..... log (X-(n-1))
log X: For the first extract max operation.
log X-1: For the second extract max operation and so on.
X: Size of original Max heap.
Even in the worst case when your n << X , The time taken would be O(n log X).
Notation
Let P be the set of points.
Let top_y ( n, x_min) describe the query to collect the n points from P with the largest y-coordinates among those with x-coordinate greater than or equal to `x_min' .
Let x_0 be the minimum of x coordinates in your point set. Partition the x axis to the right of x_0 into a set of left-hand closed, right-hand open intervals I_i by the set of x coordinates of the point set P such that min(I_i) is the i-th but smallest x coordinate from P. Define the coordinate rank r(x) of x as the index of the interval x is an element of or 0 if x < x_0.
Note that r(x) can be computed in O(log #({I_i})) using a binary search tree.
Simple Solution
Sort your point set by decreasing y-coordinates and save this array A in time O(#P log #P) and space O(#P).
Process each query top_y ( n, x_min ) by traversing this array in order, skipping over items A_i: A_i.x < x_0, counting all other entries until the counter reaches n or you are at the end of A. This processing takes O(n) time and O(1) space.
Note that this may already be sufficient: Queries top_y ( n_0, a_0 ); a_0 < min { p.x | p \in P }, n_0 = c * #P, c = const require step 1 anyway and for n << #P and 'infrequent' queries any further optimizations weren't worth the effort.
Observation
Consider the sequences s_i,s_(i+1)of points with x-coordinates greater than or equal tomin(I_i), min(I_(i+1)), ordered by decreasing y-coordinate.s_(i+1)is a strict subsequence ofs_i`.
If p_1 \in s_(i+1) and p_2.x >= p_1.x then p_2 \in s_(i+1).
Refined Solution
A refined data structure allows for O(n) + O(log #P) query processing time.
Annotate the array A from the simple solution with a 'successor dispatch' for precisely those elements A_i with A_(i+1).x < A_i.x; This dispatch data would consist of an array disp:[r(A_(i+1).x) + 1 .. r(A_i.x)] of A-indexes of the next element in A whose x-coordinate ranks at least as high as the index into disp. The given dispatch indices suffice for processing the query, since ...
... disp[j] = disp[r(A_(i+1).x) + 1] for each j <= r(A_(i+1).x).
... for any x_min with r(x_min) > r(A_i.x), the algorithm wouldn't be here
The proper index to access disp is r(x_min) which remains constant throughout a query and thus takes O(log #P) to compute once per query while the index selection itself is O(1) at each A element.
disp can be precomputed. No two disp entries across all disp arrays are identical (Proof skipped, but it's easy [;-)] to see given the construction). Therefore the construction of disp arrays can be performed stack-based in a single sweep through the point set sorted in A. As there are #P entries, the disp structure takes O(#P) space and O(#P) time to construct, being dominated by space and time requirements for y-sorting. So in a certain sense, this structure comes for free.
Time requirements for query top_y(n,x_min)
Computing r(x_min): O(log #P);
Passage through A: O(n);
I saw this post: Finding even numbers in an array and I was thinking about how you could do it without feedback. Here's what I mean.
Given an array of length n containing at most e even numbers and a
function isEven that returns true if the input is even and false
otherwise, write a function that prints all the even numbers in the
array using the fewest number of calls to isEven.
The answer on the post was to use a binary search, which is neat since it doesn't mean the array has to be in order. The number of times you have to check if a number is even is e log n instead if n because you do a binary search (log n) to find one even number each time (e times).
But that idea means that you divide the array in half, test for evenness, then decide which half to keep based on the result.
My question is whether or not you can beat n calls on a fixed testing scheme where you check all the numbers you want for evenness without knowing the outcome, and then figure out where the even numbers are after you've done all the tests based on the results. So I guess it's no-feedback or blind or some term like that.
I was thinking about this for a while and couldn't come up with anything. The binary search idea doesn't work at all with this constraint, but maybe something else does? Even getting down to n/2 calls instead of n (yes, I know they are the same big-O) would be good.
The technical term for "no-feedback or blind" is "non-adaptive". O(e log n) calls still suffice, but the algorithm is rather more involved.
Instead of testing the evenness of products, we're going to test the evenness of sums. Let E ≠ F be distinct subsets of {1, …, n}. If we have one array x1, …, xn with even numbers at positions E and another array y1, …, yn with even numbers at positions F, how many subsets J of {1, …, n} satisfy
(∑i in J xi) mod 2 ≠ (∑i in J yi) mod 2?
The answer is 2n-1. Let i be an index such that xi mod 2 ≠ yi mod 2. Let S be a subset of {1, …, i - 1, i + 1, … n}. Either J = S is a solution or J = S union {i} is a solution, but not both.
For every possible outcome E, we need to make calls that eliminate every other possible outcome F. Suppose we make 2e log n calls at random. For each pair E ≠ F, the probability that we still cannot distinguish E from F is (2n-1/2n)2e log n = n-2e, because there are 2n possible calls and only 2n-1 fail to distinguish. There are at most ne + 1 choices of E and thus at most (ne + 1)ne/2 pairs. By a union bound, the probability that there exists some indistinguishable pair is at most n-2e(ne + 1)ne/2 < 1 (assuming we're looking at an interesting case where e ≥ 1 and n ≥ 2), so there exists a sequence of 2e log n calls that does the job.
Note that, while I've used randomness to show that a good sequence of calls exists, the resulting algorithm is deterministic (and, of course, non-adaptive, because we chose that sequence without knowledge of the outcomes).
You can use the Chinese Remainder Theorem to do this. I'm going to change your notation a bit.
Suppose you have N numbers of which at most E are even. Choose a sequence of distinct prime powers q1,q2,...,qk such that their product is at least N^E, i.e.
qi = pi^ei
where pi is prime and ei > 0 is an integer and
q1 * q2 * ... * qk >= N^E
Now make a bunch of 0-1 matrices. Let Mi be the qi x N matrix where the entry in row r and column c has a 1 if c = r mod qi and a 0 otherwise. For example, if qi = 3^2, then row 2 has ones in columns 2, 11, 20, ... 2 + 9j and 0 elsewhere.
Now stack these matrices vertically to get a Q x N matrix M, where Q = q1 + q2 + ... + qk. The rows of M tell you which numbers to multiply together (the nonzero positions). This gives a total of Q products that you need to test for evenness. Call each row a "trial", and say that a "trial involves j" if the jth column of that row is nonempty. The theorem you need is the following:
THEOREM: The number in position j is even if and only if all trials involving j are even.
So you do a total of Q trials and then look at the results. If you choose the prime powers intelligently, then Q should be significantly smaller than N. There are asymptotic results that show you can always get Q on the order of
(2E log N)^2 / 2log(2E log N)
This theorem is actually a corollary of the Chinese Remainder Theorem. The only place that I've seen this used is in Combinatorial Group Testing. Apparently the problem originally arose when testing soldiers coming back from WWII for syphilis.
The problem you are facing is a form of group testing, type of a problem with the objective of reducing the cost of identifying certain elements of a set (up to d elements of a set of N elements).
As you've already stated, there are two basic principles via which the testing may be carried out:
Non-adaptive Group Testing, where all the tests to be performed are decided a priori.
Adaptive Group Testing, where we perform several tests, basing each test on the outcome of previous tests. Obviously, adaptive testing has a potential to reduce the cost, compared to non-adaptive testing.
Theoretical bounds for both principles have been studied, and are available in this Wiki article, or this paper.
For adaptive testing, the upper bound is O(d*log(N)) (as already described in this answer).
For non-adaptive testing, it can be shown that the upper bound is O(d*d/log(d)*log(N)), which is obviously larger than the upper bound for adaptive testing by a factor of d/log(d).
This upper bound for non-adaptive testing comes from an algorithm which uses disjunct matrices: matrices of dimension T x N ("number of tests" x "number of elements"), where each item can be either true (if an element was included in a test), or false (if it wasn't), with a property that any subset of d columns must differ from all other columns by at least a single row (test inclusion). This allows linear time of decoding (there are also "d-separable" matrices where fewer test are needed, but the time complexity for their decoding is exponential and not computationaly feasible).
Conclusion:
My question is whether or not you can beat n calls on a fixed testing scheme [...]
For such a scheme and a sufficiently large value of N, a disjunct matrix can be constructed which would have less than K * [d*d/log(d)*log(N)] rows. So, for large values of N, yes, you can beat it.
The underlying question (challenge) is kind of silly. If the binary search answer is acceptable (where it sums sub arrays and sends them to IsEven) then I can think of a way to do it with E or less calls to IsEven (assuming the numbers are integers of course).
JavaScript to demonstrate
// sort the array by only the first bit of the number
A.sort(function(x,y) { return (x & 1) - (y & 1); });
// all of the evens will be at the beginning
for(var i=0; i < E && i < A.length; i++) {
if(IsEven(A[i]))
Print(A[i]);
else
break;
}
Not exactly a solution, but just few thoughts.
It is easy to see that if a solution exists for array length n that takes less than n tests, then for any array length m > n it is easy to see that there is always a solution with less than m tests. So, if you have a solution for n = 2 or 3 or 4, then the problem is solved.
You can split the array into pairs of numbers and for each pair: if the sum is odd, then exactly one of them is even, otherwise if one of the numbers is even, then both of them are even. This way for each pair it takes either one or two tests. Best case:n/2 tests, worse case:n tests, if even and odd numbers are chosen with equal probability, then: 3n/4 tests.
My hunch is there is no solution with less than n tests. Not sure how to prove it.
UPDATE: The second solution can be extended in the following way.
Check if the sum of two numbers is even. If odd, then exactly one of them is even. Otherwise label the set as "homogeneous set of size 2". Take two "homogenous set"s of same size n. Pick one number from each set and check if their sum is even. If it is even, combine these two sets to a "homogeneous set of size 2n". Otherwise, it implies that one of those sets purely consists of even numbers and the other one purely odd numbers.
Best case:n/2 tests. Average case: 3*n/2. Worst case is still n. Worst case exists only when all the numbers are even or all the numbers are odd.
If we can add and multiply array elements, then we can compute every Boolean function (up to complementation) on the low-order bits. Simulate a circuit that encodes the positions of the even numbers as a number from 0 to nC0 + nC1 + ... + nCe - 1 represented in binary and use calls to isEven to read off the bits.
Number of calls used: within 1 of the information-theoretic optimum.
See also fully homomorphic encryption.