Algorithm to determine indices i..j of array A containing all the elements of another array B - algorithm

I came across this question on an interview questions thread. Here is the question:
Given two integer arrays A [1..n] and
B[1..m], find the smallest window
in A that contains all elements of
B. In other words, find a pair < i , j >
such that A[i..j] contains B[1..m].
If A doesn't contain all the elements of
B, then i,j can be returned as -1.
The integers in A need not be in the same order as they are in B. If there are more than one smallest window (different, but have the same size), then its enough to return one of them.
Example: A[1,2,5,11,2,6,8,24,101,17,8] and B[5,2,11,8,17]. The algorithm should return i = 2 (index of 5 in A) and j = 9 (index of 17 in A).
Now I can think of two variations.
Let's suppose that B has duplicates.
This variation doesn't consider the number of times each element occurs in B. It just checks for all the unique elements that occur in B and finds the smallest corresponding window in A that satisfies the above problem. For example, if A[1,2,4,5,7] and B[2,2,5], this variation doesn't bother about there being two 2's in B and just checks A for the unique integers in B namely 2 and 5 and hence returns i=1, j=3.
This variation accounts for duplicates in B. If there are two 2's in B, then it expects to see at least two 2's in A as well. If not, it returns -1,-1.
When you answer, please do let me know which variation you are answering. Pseudocode should do. Please mention space and time complexity if it is tricky to calculate it. Mention if your solution assumes array indices to start at 1 or 0 too.
Thanks in advance.

Complexity
Time: O((m+n)log m)
Space: O(m)
The following is provably optimal up to a logarithmic factor. (I believe the log factor cannot be got rid of, and so it's optimal.)
Variant 1 is just a special case of variant 2 with all the multiplicities being 1, after removing duplicates from B. So it's enough to handle the latter variant; if you want variant 1, just remove duplicates in O(m log m) time. In the following, let m denote the number of distinct elements in B. We assume m < n, because otherwise we can just return -1, in constant time.
For each index i in A, we will find the smallest index s[i] such that A[i..s[i]] contains B[1..m], with the right multiplicities. The crucial observation is that s[i] is non-decreasing, and this is what allows us to do it in amortised linear time.
Start with i=j=1. We will keep a tuple (c[1], c[2], ... c[m]) of the number of times each element of B occurs, in the current window A[i..j]. We will also keep a set S of indices (a subset of 1..m) for which the count is "right" (i.e., k for which c[k]=1 in variant 1, or c[k] = <the right number> in variant 2).
So, for i=1, starting with j=1, increment each c[A[j]] (if A[j] was an element of B), check if c[A[j]] is now "right", and add or remove j from S accordingly. Stop when S has size m. You've now found s[1], in at most O(n log m) time. (There are O(n) j's, and each set operation took O(log m) time.)
Now for computing successive s[i]s, do the following. Increment i, decrement c[A[i]], update S accordingly, and, if necessary, increment j until S has size m again. That gives you s[i] for each i. At the end, report the (i,s[i]) for which s[i]-i was smallest.
Note that although it seems that you might be performing up to O(n) steps (incrementing j) for each i, the second pointer j only moves to the right: so the total number of times you can increment j is at most n. (This is amortised analysis.) Each time you increment j, you might perform a set operation that takes O(log m) time, so the total time is O(n log m). The space required was for keeping the tuple of counts, the set of elements of B, the set S, and some constant number of other variables, so O(m) in all.
There is an obvious O(m+n) lower bound, because you need to examine all the elements. So the only question is whether we can prove the log factor is necessary; I believe it is.

Here is the solution I thought of (but it's not very neat).
I am going to illustrate it using the example in the question.
Let A[1,2,5,11,2,6,8,24,101,17,8] and B[5,2,11,8,17]
Sort B. (So B = [2,5,8,11,17]). This step takes O(log m).
Allocate an array C of size A. Iterate through elements of A, binary search for it in the sorted B, if it is found enter it's "index in sorted B + 1" in C. If its not found, enter -1. After this step,
A = [1 , 2, 5, 11, 2, 6, 8, 24, 101, 17, 8] (no changes, quoting for ease).
C = [-1, 1, 2, 4 , 1, -1, 3, -1, -1, 5, 3]
Time: (n log m), Space O(n).
Find the smallest window in C that has all the numbers from 1 to m. For finding the window, I can think of two general directions:
a. A bit oriented approach where in I set the bit corresponding to each position and finally check by some kind of ANDing.
b. Create another array D of size m, go through C and when I encounter p in C, increment D[p]. Use this for finding the window.
Please leave comments regarding the general approach as such, as well as for 3a and 3b.

My solution:
a. Create a hash table with m keys, one for each value in B. Each key in H maps to a dynamic array of sorted indices containing indices in A that are equal to B[i]. This takes O(n) time. We go through each index j in A. If key A[i] exists in H (O(1) time) then add an value containing the index j of A to the list of indices that H[A[i]] maps to.
At this point we have 'binned' n elements into m bins. However, total storage is just O(n).
b. The 2nd part of the algorithm involves maintaining a ‘left’ index and a ‘right’ index for each list in H. Lets create two arrays of size m called L and R that contain these values. Initially in our example,
We also keep track of the “best” minimum window.
We then iterate over the following actions on L and R which are inherently greedy:
i. In each iteration, we compute the minimum and maximum values in L and R.
For L, Lmax - Lmin is the window and for R, Rmax - Rmin is the window. We update the best window if one of these windows is better than the current best window. We use a min heap to keep track of the minimum element in L and a max heap to keep track of the largest element in R. These take O(m*log(m)) time to build.
ii. From a ‘greedy’ perspective, we want to take the action that will minimize the window size in each L and R. For L it intuitively makes sense to increment the minimum index, and for R, it makes sense to decrement the maximum index.
We want to increment the array position for the minimum value until it is larger than the 2nd smallest element in L, and similarly, we want to decrement the array position for the largest value in R until it is smaller than the 2nd largest element in R.
Next, we make a key observation:
If L[i] is the minimum value in L and R[i] is less than the 2nd smallest element in L, ie, if R[i] were to still be the minimum value in L if L[i] were replaced with R[i], then we are done. We now have the “best” index in list i that can contribute to the minimum window. Also, all the other elements in R cannot contribute to the best window since their L values are all larger than L[i]. Similarly if R[j] is the maximum element in R and L[j] is greater than the 2nd largest value in R, we are also done by setting R[j] = L[j]. Any other index in array i to the left of L[j] has already been accounted for as have all indices to the right of R[j], and all indices between L[j] and R[j] will perform poorer than L[j].
Otherwise, we simply increment the array position L[i] until it is larger than the 2nd smallest element in L and decrement array position R[j] (where R[j] is the max in R) until it is smaller than the 2nd largest element in R. We compute the windows and update the best window if one of the L or R windows is smaller than the best window. We can do a Fibonacci search to optimally do the increment / decrement. We keep incrementing L[i] using Fibonacci increments until we are larger than the 2nd largest element in L. We can then perform binary search to get the smallest element L[i] that is larger than the 2nd largest element in L, similar for the set R. After the increment / decrement, we pop the largest element from the max heap for R and the minimum element for the min heap for L and insert the new values of L[i] and R[j] into the heaps. This is an O(log(m)) operation.
Step ii. would terminate when Lmin can’t move any more to the right or Rmax can’t move any more to the left (as the R/L values are the same). Note that we can have scenarios in which L[i] = R[i] but if it is not the minimum element in L or the maximum element in R, the algorithm would still continue.
Runtime analysis:
a. Creation of the hash table takes O(n) time and O(n) space.
b. Creation of heaps: O(m*log(m)) time and O(m) space.
c. The greedy iterative algorithm is a little harder to analyze. Its runtime is really bounded by the distribution of elements. Worst case, we cover all the elements in each array in the hash table. For each element, we perform an O(log(m)) heap update.
Worst case runtime is hence O(n*log(m)) for the iterative greedy algorithm. In the best case, we discover very fast that L[i] = R[i] for the minimum element in L or the maximum element in R…run time is O(1)*log(m) for the greedy algorithm!
Average case seems really hard to analyze. What is the average “convergence” of this algorithm to the minimum window. If we were to assume that the Fibonacci increments / binary search were to help, we could say we only look at m*log(n/m) elements (every list has n/m elements) in the average case. In that case, the running time of the greedy algorithm would be m*log(n/m)*log(m).
Total running time
Best case: O(n + m*log(m) + log(m)) time = O(n) assuming m << n
Average case: O(n + m*log(m) + m*log(n/m)*log(m)) time = O(n) assuming m << n.
Worst case: O(n + n*log(m) + m*log(m)) = O(n*log(m)) assuming m << n.
Space: O(n + m) (hashtable and heaps) always.
Edit: Here is a worked out example:
A[5, 1, 1, 5, 6, 1, 1, 5]
B[5, 6]
H:
{
5 => {1, 4, 8}
6 => {5}
}
Greedy Algorithm:
L => {1, 1}
R => {3, 1}
Iteration 1:
a. Lmin = 1 (since H{5}[1] < H{6}[1]), Lmax = 5. Window: 5 - 1 + 1= 5
Increment Lmin pointer, it now becomes 2.
L => {2, 1}
Rmin = H{6}[1] = 5, Rmax = H{5}[3] = 8. Window = 8 - 5 + 1 = 4. Best window so far = 4 (less than 5 computed above).
We also note the indices in A (5, 8) for the best window.
Decrement Rmax, it now becomes 2 and the value is 4.
R => {2, 1}
b. Now, Lmin = 4 (H{5}[2]) and the index i in L is 1. Lmax = 5 (H{6}[1]) and the index in L is 2.
We can't increment Lmin since L[1] = R[1] = 2. Thus we just compute the window now.
The window = Lmax - Lmin + 1 = 2 which is the best window so far.
Thus, the best window in A = (4, 5).

struct Pair {
int i;
int j;
};
Pair
find_smallest_subarray_window(int *A, size_t n, int *B, size_t m)
{
Pair p;
p.i = -1;
p.j = -1;
// key is array value, value is array index
std::map<int, int> map;
size_t count = 0;
int i;
int j;
for(i = 0; i < n, ++i) {
for(j = 0; j < m; ++j) {
if(A[i] == B[j]) {
if(map.find(A[i]) == map.end()) {
map.insert(std::pair<int, int>(A[i], i));
} else {
int start = findSmallestVal(map);
int end = findLargestVal(map);
int oldLength = end-start;
int oldIndex = map[A[i]];
map[A[i]] = i;
int _start = findSmallestVal(map);
int _end = findLargestVal(map);
int newLength = _end - _start;
if(newLength > oldLength) {
// revert back
map[A[i]] = oldIndex;
}
}
}
}
if(count == m) {
break;
}
}
p.i = findSmallestVal(map);
p.j = findLargestVal(map);
return p;
}

Related

Count of divisors of numbers till N in O(N)?

So, we can count divisors of each number from 1 to N in O(NlogN) algorithm with sieve:
int n;
cin >> n;
for (int i = 1; i <= n; i++) {
for (int j = i; j <= n; j += i) {
cnt[j]++; //// here cnt[x] means count of divisors of x
}
}
Is there way to reduce it to O(N)?
Thanks in advance.
Here is a simple optimization on #גלעד ברקן's solution. Rather than use sets, use arrays. This is about 10x as fast as the set version.
n = 100
answer = [None for i in range(0, n+1)]
answer[1] = 1
small_factors = [1]
p = 1
while (p < n):
p = p + 1
if answer[p] is None:
print("\n\nPrime: " + str(p))
limit = n / p
new_small_factors = []
for i in small_factors:
j = i
while j <= limit:
new_small_factors.append(j)
answer[j * p] = answer[j] + answer[i]
j = j * p
small_factors = new_small_factors
print("\n\nAnswer: " + str([(k,d) for k,d in enumerate(answer)]))
It is worth noting that this is also a O(n) algorithm for enumerating primes. However with the use of a wheel generated from all of the primes below size log(n)/2 it can create a prime list in time O(n/log(log(n))).
How about this? Start with the prime 2 and keep a list of tuples, (k, d_k), where d_k is the number of divisors of k, starting with (1,1):
for each prime, p (ascending and lower than or equal to n / 2):
for each tuple (k, d_k) in the list:
if k * p > n:
remove the tuple from the list
continue
power = 1
while p * k <= n:
add the tuple to the list if k * p^power <= n / p
k = k * p
output (k, (power + 1) * d_k)
power = power + 1
the next number the output has skipped is the next prime
(since clearly all numbers up to the next prime are
either smaller primes or composites of smaller primes)
The method above also generates the primes, relying on O(n) memory to keep finding the next prime. Having a more efficient, independent stream of primes could allow us to avoid appending any tuples (k, d_k) to the list, where k * next_prime > n, as well as free up all memory holding output greater than n / next_prime.
Python code
Consider the total of those counts, sum(phi(i) for i=1,n). That sum is O(N log N), so any O(N) solution would have to bypass individual counting.
This suggests that any improvement would need to depend on prior results (dynamic programming). We already know that phi(i) is the product of each prime degree plus one. For instance, 12 = 2^2 * 3^1. The degrees are 2 and 1, respective. (2+1)*(1+1) = 6. 12 has 6 divisors: 1, 2, 3, 4, 6, 12.
This "reduces" the question to whether you can leverage the prior knowledge to get an O(1) way to compute the number of divisors directly, without having to count them individually.
Think about the given case ... divisor counts so far include:
1 1
2 2
3 2
4 3
6 4
Is there an O(1) way to get phi(12) = 6 from these figures?
Here is an algorithm that is theoretically better than O(n log(n)) but may be worse for reasonable n. I believe that its running time is O(n lg*(n)) where lg* is the https://en.wikipedia.org/wiki/Iterated_logarithm.
First of all you can find all primes up to n in time O(n) using the Sieve of Atkin. See https://en.wikipedia.org/wiki/Sieve_of_Atkin for details.
Now the idea is that we will build up our list of counts only inserting each count once. We'll go through the prime factors one by one, and insert values for everything with that as the maximum prime number. However in order to do that we need a data structure with the following properties:
We can store a value (specifically the count) at each value.
We can walk the list of inserted values forwards and backwards in O(1).
We can find the last inserted number below i "efficiently".
Insertion should be "efficient".
(Quotes are the parts that are hard to estimate.)
The first is trivial, each slot in our data structure needs a spot for the value. The second can be done with a doubly linked list. The third can be done with a clever variation on a skip-list. The fourth falls out from the first 3.
We can do this with an array of nodes (which do not start out initialized) with the following fields that look like a doubly linked list:
value The answer we are looking for.
prev The last previous value that we have an answer for.
next The next value that we have an answer for.
Now if i is in the list and j is the next value, the skip-list trick will be that we will also fill in prev for the first even after i, the first divisible by 4, divisible by 8 and so on until we reach j. So if i = 81 and j = 96 we would fill in prev for 82, 84, 88 and then 96.
Now suppose that we want to insert a value v at k between an existing i and j. How do we do it? I'll present pseudocode starting with only k known then fill it out for i = 81, j = 96 and k = 90.
k.value := v
for temp in searching down from k for increasing factors of 2:
if temp has a value:
our_prev := temp
break
else if temp has a prev:
our_prev = temp.prev
break
our_next := our_prev.next
our_prev.next := k
k.next := our_next
our_next.prev := k
for temp in searching up from k for increasing factors of 2:
if j <= temp:
break
temp.prev = k
k.prev := our_prev
In our particular example we were willing to search downwards from 90 to 90, 88, 80, 64, 0. But we actually get told that prev is 81 when we get to 88. We would be willing to search up to 90, 92, 96, 128, 256, ... however we just have to set 92.prev 96.prev and we are done.
Now this is a complicated bit of code, but its performance is O(log(k-i) + log(j-k) + 1). Which means that it starts off as O(log(n)) but gets better as more values get filled in.
So how do we initialize this data structure? Well we initialize an array of uninitialized values then set 1.value := 0, 1.next := n+1, and 2.prev := 4.prev := 8.prev := 16.prev := ... := 1. And then we start processing our primes.
When we reach prime p we start by searching for the previous inserted value below n/p. Walking backwards from there we keep inserting values for x*p, x*p^2, ... until we hit our limit. (The reason for backwards is that we do not want to try to insert, say, 18 once for 3 and once for 9. Going backwards prevents that.)
Now what is our running time? Finding the primes is O(n). Finding the initial inserts is also easily O(n/log(n)) operations of time O(log(n)) for another O(n). Now what about the inserts of all of the values? That is trivially O(n log(n)) but can we do better?
Well first all of the inserts to density 1/log(n) filled in can be done in time O(n/log(n)) * O(log(n)) = O(n). And then all of the inserts to density 1/log(log(n)) can likewise be done in time O(n/log(log(n))) * O(log(log(n))) = O(n). And so on with increasing numbers of logs. The number of such factors that we get is O(lg*(n)) for the O(n lg*(n)) estimate that I gave.
I haven't shown that this estimate is as good as you can do, but I think that it is.
So, not O(n), but pretty darned close.

Maximum of all possible subarrays of an array

How do I find/store maximum/minimum of all possible non-empty sub-arrays of an array of length n?
I generated the segment tree of the array and the for each possible sub array if did query into segment tree but that's not efficient. How do I do it in O(n)?
P.S n <= 10 ^7
For eg. arr[]= { 1, 2, 3 }; // the array need not to be sorted
sub-array min max
{1} 1 1
{2} 2 2
{3} 3 3
{1,2} 1 2
{2,3} 2 3
{1,2,3} 1 3
I don't think it is possible to store all those values in O(n). But it is pretty easy to create, in O(n), a structure that makes possible to answer, in O(1) the query "how many subsets are there where A[i] is the maximum element".
Naïve version:
Think about the naïve strategy: to know how many such subsets are there for some A[i], you could employ a simple O(n) algorithm that counts how many elements to the left and to the right of the array that are less than A[i]. Let's say:
A = [... 10 1 1 1 5 1 1 10 ...]
This 5 up has 3 elements to the left and 2 to the right lesser than it. From this we know there are 4*3=12 subarrays for which that very 5 is the maximum. 4*3 because there are 0..3 subarrays to the left and 0..2 to the right.
Optimized version:
This naïve version of the check would take O(n) operations for each element, so O(n^2) after all. Wouldn't it be nice if we could compute all these lengths in O(n) in a single pass?
Luckily there is a simple algorithm for that. Just use a stack. Traverse the array normally (from left to right). Put every element index in the stack. But before putting it, remove all the indexes whose value are lesser than the current value. The remaining index before the current one is the nearest larger element.
To find the same values at the right, just traverse the array backwards.
Here's a sample Python proof-of-concept that shows this algorithm in action. I implemented also the naïve version so we can cross-check the result from the optimized version:
from random import choice
from collections import defaultdict, deque
def make_bounds(A, fallback, arange, op):
stack = deque()
bound = [fallback] * len(A)
for i in arange:
while stack and op(A[stack[-1]], A[i]):
stack.pop()
if stack:
bound[i] = stack[-1]
stack.append(i)
return bound
def optimized_version(A):
T = zip(make_bounds(A, -1, xrange(len(A)), lambda x, y: x<=y),
make_bounds(A, len(A), reversed(xrange(len(A))), lambda x, y: x<y))
answer = defaultdict(lambda: 0)
for i, x in enumerate(A):
left, right = T[i]
answer[x] += (i-left) * (right-i)
return dict(answer)
def naive_version(A):
answer = defaultdict(lambda: 0)
for i, x in enumerate(A):
left = next((j for j in range(i-1, -1, -1) if A[j]>A[i]), -1)
right = next((j for j in range(i+1, len(A)) if A[j]>=A[i]), len(A))
answer[x] += (i-left) * (right-i)
return dict(answer)
A = [choice(xrange(32)) for i in xrange(8)]
MA1 = naive_version(A)
MA2 = optimized_version(A)
print 'Array: ', A
print 'Naive: ', MA1
print 'Optimized:', MA2
print 'OK: ', MA1 == MA2
I don't think it is possible to it directly in O(n) time: you need to iterate over all the elements of the subarrays, and you have n of them. Unless the subarrays are sorted.
You could, on the other hand, when initialising the subarrays, instead of making them normal arrays, you could build heaps, specifically min heaps when you want to find the minimum and max heaps when you want to find the maximum.
Building a heap is a linear time operation, and retrieving the maximum and minimum respectively for a max heap and min heap is a constant time operation, since those elements are found at the first place of the heap.
Heaps can be easily implemented just using a normal array.
Check this article on Wikipedia about binary heaps: https://en.wikipedia.org/wiki/Binary_heap.
I do not understand what exactly you mean by maximum of sub-arrays, so I will assume you are asking for one of the following
The subarray of maximum/minimum length or some other criteria (in which case the problem will reduce to finding max element in a 1 dimensional array)
The maximum elements of all your sub-arrays either in the context of one sub-array or in the context of the entire super-array
Problem 1 can be solved by simply iterating your super-array and storing a reference to the largest element. Or building a heap as nbro had said. Problem 2 also has a similar solution. However a linear scan is through n arrays of length m is not going to be linear. So you will have to keep your class invariants such that the maximum/minimum is known after every operation. Maybe with the help of some data structure like a heap.
Assuming you mean contiguous sub-arrays, create the array of partial sums where Yi = SUM(i=0..i)Xi, so from 1,4,2,3 create 0,1,1+4=5,1+4+2=7,1+4+2+3=10. You can create this from left to right in linear time, and the value of any contiguous subarray is one partial sum subtracted from another, so 4+2+3 = 1+4+2+3 - 1= 9.
Then scan through the partial sums from left to right, keeping track of the smallest value seen so far (including the initial zero). At each point subtract this from the current value and keep track of the highest value produced in this way. This should give you the value of the contiguous sub-array with largest sum, and you can keep index information, too, to find where this sub-array starts and ends.
To find the minimum, either change the above slightly or just reverse the sign of all the numbers and do exactly the same thing again: min(a, b) = -max(-a, -b)
I think the question you are asking is to find the Maximum of a subarry.
bleow is the code that cand do that in O(n) time.
int maxSumSubArr(vector<int> a)
{
int maxsum = *max_element(a.begin(), a.end());
if(maxsum < 0) return maxsum;
int sum = 0;
for(int i = 0; i< a.size; i++)
{
sum += a[i];
if(sum > maxsum)maxsum = sum;
if(sum < 0) sum = 0;
}
return maxsum;
}
Note: This code is not tested please add comments if found some issues.

Generate a random integer from 0 to N-1 which is not in the list

You are given N and an int K[].
The task at hand is to generate a equal probabilistic random number between 0 to N-1 which doesn't exist in K.
N is strictly a integer >= 0.
And K.length is < N-1. And 0 <= K[i] <= N-1. Also assume K is sorted and each element of K is unique.
You are given a function uniformRand(int M) which generates uniform random number in the range 0 to M-1 And assume this functions's complexity is O(1).
Example:
N = 7
K = {0, 1, 5}
the function should return any random number { 2, 3, 4, 6 } with equal
probability.
I could get a O(N) solution for this : First generate a random number between 0 to N - K.length. And map the thus generated random number to a number not in K. The second step will take the complexity to O(N). Can it be done better in may be O(log N) ?
You can use the fact that all the numbers in K[] are between 0 and N-1 and they are distinct.
For your example case, you generate a random number from 0 to 3. Say you get a random number r. Now you conduct binary search on the array K[].
Initialize i = K.length/2.
Find K[i] - i. This will give you the number of numbers missing from the array in the range 0 to i.
For example K[2] = 5. So 3 elements are missing from K[0] to K[2] (2,3,4)
Hence you can decide whether you have to conduct the remaining search in the first part of array K or the next part. This is because you know r.
This search will give you a complexity of log(K.length)
EDIT: For example,
N = 7
K = {0, 1, 4} // modified the array to clarify the algorithm steps.
the function should return any random number { 2, 3, 5, 6 } with equal probability.
Random number generated between 0 and N-K.length = random{0-3}. Say we get 3. Hence we require the 4th missing number in array K.
Conduct binary search on array K[].
Initial i = K.length/2 = 1.
Now we see K[1] - 1 = 0. Hence no number is missing upto i = 1. Hence we search on the latter part of the array.
Now i = 2. K[2] - 2 = 4 - 2 = 2. Hence there are 2 missing numbers up to index i = 2. But we need the 4th missing element. So we again have to search in the latter part of the array.
Now we reach an empty array. What should we do now? If we reach an empty array between say K[j] & K[j+1] then it simply means that all elements between K[j] and K[j+1] are missing from the array K.
Hence all elements above K[2] are missing from the array, namely 5 and 6. We need the 4th element out of which we have already discarded 2 elements. Hence we will choose the second element which is 6.
Binary search.
The basic algorithm:
(not quite the same as the other answer - the number is only generated at the end)
Start in the middle of K.
By looking at the current value and it's index, we can determine the number of pickable numbers (numbers not in K) to the left.
Similarly, by including N, we can determine the number of pickable numbers to the right.
Now randomly go either left or right, weighted based on the count of pickable numbers on each side.
Repeat in the chosen subarray until the subarray is empty.
Then generate a random number in the range consisting of the numbers before and after the subarray in the array.
The running time would be O(log |K|), and, since |K| < N-1, O(log N).
The exact mathematics for number counts and weights can be derived from the example below.
Extension with K containing a bigger range:
Now let's say (for enrichment purposes) K can also contain values N or larger.
Then, instead of starting with the entire K, we start with a subarray up to position min(N, |K|), and start in the middle of that.
It's easy to see that the N-th position in K (if one exists) will be >= N, so this chosen range includes any possible number we can generate.
From here, we need to do a binary search for N (which would give us a point where all values to the left are < N, even if N could not be found) (the above algorithm doesn't deal with K containing values greater than N).
Then we just run the algorithm as above with the subarray ending at the last value < N.
The running time would be O(log N), or, more specifically, O(log min(N, |K|)).
Example:
N = 10
K = {0, 1, 4, 5, 8}
So we start in the middle - 4.
Given that we're at index 2, we know there are 2 elements to the left, and the value is 4, so there are 4 - 2 = 2 pickable values to the left.
Similarly, there are 10 - (4+1) - 2 = 3 pickable values to the right.
So now we go left with probability 2/(2+3) and right with probability 3/(2+3).
Let's say we went right, and our next middle value is 5.
We are at the first position in this subarray, and the previous value is 4, so we have 5 - (4+1) = 0 pickable values to the left.
And there are 10 - (5+1) - 1 = 3 pickable values to the right.
We can't go left (0 probability). If we go right, our next middle value would be 8.
There would be 2 pickable values to the left, and 1 to the right.
If we go left, we'd have an empty subarray.
So then we'd generate a number between 5 and 8, which would be 6 or 7 with equal probability.
This can be solved by basically solving this:
Find the rth smallest number not in the given array, K, subject to
conditions in the question.
For that consider the implicit array D, defined by
D[i] = K[i] - i for 0 <= i < L, where L is length of K
We also set D[-1] = 0 and D[L] = N
We also define K[-1] = 0.
Note, we don't actually need to construct D. Also note that D is sorted (and all elements non-negative), as the numbers in K[] are unique and increasing.
Now we make the following claim:
CLAIM: To find the rth smallest number not in K[], we need to find right most occurrence of r' in D (which occurs at position defined by j), where r' is the largest number in D, which is < r. Such an r' exists, because D[-1] = 0. Once we find such an r' (and j), the number we are looking for is r-r' + K[j].
Proof: Basically the definition of r' and j tells us that there are exactlyr' numbers missing from 0 to K[j], and more than r numbers missing from 0 to K[j+1]. Thus all the numbers from K[j]+1 to K[j+1]-1 are missing (and these missing are at least r-r' in number), and the number we seek is among them, given by K[j] + r-r'.
Algorithm:
In order to find (r',j) all we need to do is a (modified) binary search for r in D, where we keep moving to the left even if we find r in the array.
This is an O(log K) algorithm.
If you are running this many times, it probably pays to speed up your generation operation: O(log N) time just isn't acceptable.
Make an empty array G. Starting at zero, count upwards while progressing through the values of K. If a value isn't in K add it to G. If it is in K don't add it and progress your K pointer. (This relies on K being sorted.)
Now you have an array G which has only acceptable numbers.
Use your random number generator to choose a value from G.
This requires O(N) preparatory work and each generation happens in O(1) time. After N look-ups the amortized time of all operations is O(1).
A Python mock-up:
import random
class PRNG:
def __init__(self, K,N):
self.G = []
kptr = 0
for i in range(N):
if kptr<len(K) and K[kptr]==i:
kptr+=1
else:
self.G.append(i)
def getRand(self):
rn = random.randint(0,len(self.G)-1)
return self.G[rn]
prng=PRNG( [0,1,5], 7)
for i in range(20):
print prng.getRand()

Minimum sum that cant be obtained from a set

Given a set S of positive integers whose elements need not to be distinct i need to find minimal non-negative sum that cant be obtained from any subset of the given set.
Example : if S = {1, 1, 3, 7}, we can get 0 as (S' = {}), 1 as (S' = {1}), 2 as (S' = {1, 1}), 3 as (S' = {3}), 4 as (S' = {1, 3}), 5 as (S' = {1, 1, 3}), but we can't get 6.
Now we are given one array A, consisting of N positive integers. Their are M queries,each consist of two integers Li and Ri describe i'th query: we need to find this Sum that cant be obtained from array elements ={A[Li], A[Li+1], ..., A[Ri-1], A[Ri]} .
I know to find it by a brute force approach to be done in O(2^n). But given 1 ≤ N, M ≤ 100,000.This cant be done .
So is their any effective approach to do it.
Concept
Suppose we had an array of bool representing which numbers so far haven't been found (by way of summing).
For each number n we encounter in the ordered (increasing values) subset of S, we do the following:
For each existing True value at position i in numbers, we set numbers[i + n] to True
We set numbers[n] to True
With this sort of a sieve, we would mark all the found numbers as True, and iterating through the array when the algorithm finishes would find us the minimum unobtainable sum.
Refinement
Obviously, we can't have a solution like this because the array would have to be infinite in order to work for all sets of numbers.
The concept could be improved by making a few observations. With an input of 1, 1, 3, the array becomes (in sequence):
(numbers represent true values)
An important observation can be made:
(3) For each next number, if the previous numbers had already been found it will be added to all those numbers. This implies that if there were no gaps before a number, there will be no gaps after that number has been processed.
For the next input of 7 we can assert that:
(4) Since the input set is ordered, there will be no number less than 7
(5) If there is no number less than 7, then 6 cannot be obtained
We can come to a conclusion that:
(6) the first gap represents the minimum unobtainable number.
Algorithm
Because of (3) and (6), we don't actually need the numbers array, we only need a single value, max to represent the maximum number found so far.
This way, if the next number n is greater than max + 1, then a gap would have been made, and max + 1 is the minimum unobtainable number.
Otherwise, max becomes max + n. If we've run through the entire S, the result is max + 1.
Actual code (C#, easily converted to C):
static int Calculate(int[] S)
{
int max = 0;
for (int i = 0; i < S.Length; i++)
{
if (S[i] <= max + 1)
max = max + S[i];
else
return max + 1;
}
return max + 1;
}
Should run pretty fast, since it's obviously linear time (O(n)). Since the input to the function should be sorted, with quicksort this would become O(nlogn). I've managed to get results M = N = 100000 on 8 cores in just under 5 minutes.
With numbers upper limit of 10^9, a radix sort could be used to approximate O(n) time for the sorting, however this would still be way over 2 seconds because of the sheer amount of sorts required.
But, we can use statistical probability of 1 being randomed to eliminate subsets before sorting. On the start, check if 1 exists in S, if not then every query's result is 1 because it cannot be obtained.
Statistically, if we random from 10^9 numbers 10^5 times, we have 99.9% chance of not getting a single 1.
Before each sort, check if that subset contains 1, if not then its result is one.
With this modification, the code runs in 2 miliseconds on my machine. Here's that code on http://pastebin.com/rF6VddTx
This is a variation of the subset-sum problem, which is NP-Complete, but there is a pseudo-polynomial Dynamic Programming solution you can adopt here, based on the recursive formula:
f(S,i) = f(S-arr[i],i-1) OR f(S,i-1)
f(-n,i) = false
f(_,-n) = false
f(0,i) = true
The recursive formula is basically an exhaustive search, each sum can be achieved if you can get it with element i OR without element i.
The dynamic programming is achieved by building a SUM+1 x n+1 table (where SUM is the sum of all elements, and n is the number of elements), and building it bottom-up.
Something like:
table <- SUM+1 x n+1 table
//init:
for each i from 0 to SUM+1:
table[0][i] = true
for each j from 1 to n:
table[j][0] = false
//fill the table:
for each i from 1 to SUM+1:
for each j from 1 to n+1:
if i < arr[j]:
table[i][j] = table[i][j-1]
else:
table[i][j] = table[i-arr[j]][j-1] OR table[i][j-1]
Once you have the table, you need the smallest i such that for all j: table[i][j] = false
Complexity of solution is O(n*SUM), where SUM is the sum of all elements, but note that the algorithm can actually be trimmed after the required number was found, without the need to go on for the next rows, which are un-needed for the solution.

Longest subarray whose elements form a continuous sequence

Given an unsorted array of positive integers, find the length of the longest subarray whose elements when sorted are continuous. Can you think of an O(n) solution?
Example:
{10, 5, 3, 1, 4, 2, 8, 7}, answer is 5.
{4, 5, 1, 5, 7, 6, 8, 4, 1}, answer is 5.
For the first example, the subarray {5, 3, 1, 4, 2} when sorted can form a continuous sequence 1,2,3,4,5, which are the longest.
For the second example, the subarray {5, 7, 6, 8, 4} is the result subarray.
I can think of a method which for each subarray, check if (maximum - minimum + 1) equals the length of that subarray, if true, then it is a continuous subarray. Take the longest of all. But it is O(n^2) and can not deal with duplicates.
Can someone gives a better method?
Algorithm to solve original problem in O(n) without duplicates. Maybe, it helps someone to develop O(n) solution that deals with duplicates.
Input: [a1, a2, a3, ...]
Map original array as pair where 1st element is a value, and 2nd is index of array.
Array: [[a1, i1], [a2, i2], [a3, i3], ...]
Sort this array of pairs with some O(n) algorithm (e.g Counting Sort) for integer sorting by value.
We get some another array:
Array: [[a3, i3], [a2, i2], [a1, i1], ...]
where a3, a2, a1, ... are in sorted order.
Run loop through sorted array of pairs
In linear time we can detect consecutive groups of numbers a3, a2, a1. Consecutive group definition is next value = prev value + 1.
During that scan keep current group size (n), minimum value of index (min), and current sum of indices (actualSum).
On each step inside consecutive group we can estimate sum of indices, because they create arithmetic progression with first element min, step 1, and size of group seen so far n.
This sum estimate can be done in O(1) time using formula for arithmetic progression:
estimate sum = (a1 + an) * n / 2;
estimate sum = (min + min + (n - 1)) * n / 2;
estimate sum = min * n + n * (n - 1) / 2;
If on some loop step inside consecutive group estimate sum equals to actual sum, then seen so far consecutive group satisfy the conditions. Save n as current maximum result, or choose maximum between current maximum and n.
If on value elements we stop seeing consecutive group, then reset all values and do the same.
Code example: https://gist.github.com/mishadoff/5371821
See the array S in it's mathematical set definition :
S = Uj=0k (Ij)
Where the Ij are disjoint integer segments. You can design a specific interval tree (based on a Red-Black tree or a self-balancing tree that you like :) ) to store the array in this mathematical definitions. The node and tree structures should look like these :
struct node {
int d, u;
int count;
struct node *n_left, *n_right;
}
Here, d is the lesser bound of the integer segment and u, the upper bound. count is added to take care of possible duplicates in the array : when trying to insert an already existing element in the tree, instead of doing nothing, we will increment the count value of the node in which it is found.
struct root {
struct node *root;
}
The tree will only store disjoint nodes, thus, the insertion is a bit more complex than a classical Red-Black tree insertion. When inserting intervals, you must scans for potential overflows with already existing intervals. In your case, since you will only insert singletons this should not add too much overhead.
Given three nodes P, L and R, L being the left child of P and R the right child of P. Then, you must enforce L.u < P.d and P.u < R.d (and for each node, d <= u, of course).
When inserting an integer segment [x,y], you must find "overlapping" segments, that is to say, intervals [u,d] that satisfies one of the following inequalities :
y >= d - 1
OR
x <= u + 1
If the inserted interval is a singleton x, then you can only find up to 2 overlapping interval nodes N1 and N2 such that N1.d == x + 1 and N2.u == x - 1. Then you have to merge the two intervals and update count, which leaves you with N3 such that N3.d = N2.d, N3.u = N1.u and N3.count = N1.count + N2.count + 1. Since the delta between N1.d and N2.u is the minimal delta for two segments to be disjoint, then you must have one of the following :
N1 is the right child of N2
N2 is the left child of N1
So the insertion will still be in O(log(n)) in the worst case.
From here, I can't figure out how to handle the order in the initial sequence but here is a result that might be interesting : if the input array defines a perfect integer segment, then the tree only has one node.
UPD2: The following solution is for a problem when it is not required that subarray is contiguous. I misunderstood the problem statement. Not deleting this, as somebody may have an idea based on mine that will work for the actual problem.
Here's what I've come up with:
Create an instance of a dictionary (which is implemented as hash table, giving O(1) in normal situations). Keys are integers, values are hash sets of integers (also O(1)) – var D = new Dictionary<int, HashSet<int>>.
Iterate through the array A and for each integer n with index i do:
Check whether keys n-1 and n+1 are contained in D.
if neither key exists, do D.Add(n, new HashSet<int>)
if only one of the keys exists, e.g. n-1, do D.Add(n, D[n-1])
if both keys exist, do D[n-1].UnionWith(D[n+1]); D[n+1] = D[n] = D[n-1];
D[n].Add(n)
Now go through each key in D and find the hash set with the greatest length (finding length is O(1)). The greatest length will be the answer.
To my understanding, the worst case complexity will be O(n*log(n)), only because of the UnionWith operation. I don't know how to calculate the average complexity, but it should be close to O(n). Please correct me if I am wrong.
UPD: To speak code, here's a test implementation in C# that gives the correct result in both of the OP's examples:
var A = new int[] {4, 5, 1, 5, 7, 6, 8, 4, 1};
var D = new Dictionary<int, HashSet<int>>();
foreach(int n in A)
{
if(D.ContainsKey(n-1) && D.ContainsKey(n+1))
{
D[n-1].UnionWith(D[n+1]);
D[n+1] = D[n] = D[n-1];
}
else if(D.ContainsKey(n-1))
{
D[n] = D[n-1];
}
else if(D.ContainsKey(n+1))
{
D[n] = D[n+1];
}
else if(!D.ContainsKey(n))
{
D.Add(n, new HashSet<int>());
}
D[n].Add(n);
}
int result = int.MinValue;
foreach(HashSet<int> H in D.Values)
{
if(H.Count > result)
{
result = H.Count;
}
}
Console.WriteLine(result);
This will require two passes over the data. First create a hash map, mapping ints to bools. I updated my algorithm to not use map, from the STL, which I'm positive uses sorting internally. This algorithm uses hashing, and can be easily updated for any maximum or minimum combination, even potentially all possible values an integer can obtain.
#include <iostream>
using namespace std;
const int MINIMUM = 0;
const int MAXIMUM = 100;
const unsigned int ARRAY_SIZE = MAXIMUM - MINIMUM;
int main() {
bool* hashOfIntegers = new bool[ARRAY_SIZE];
//const int someArrayOfIntegers[] = {10, 9, 8, 6, 5, 3, 1, 4, 2, 8, 7};
//const int someArrayOfIntegers[] = {10, 6, 5, 3, 1, 4, 2, 8, 7};
const int someArrayOfIntegers[] = {-2, -3, 8, 6, 12, 14, 4, 0, 16, 18, 20};
const int SIZE_OF_ARRAY = 11;
//Initialize hashOfIntegers values to false, probably unnecessary but good practice.
for(unsigned int i = 0; i < ARRAY_SIZE; i++) {
hashOfIntegers[i] = false;
}
//Chage appropriate values to true.
for(int i = 0; i < SIZE_OF_ARRAY; i++) {
//We subtract the MINIMUM value to normalize the MINIMUM value to a zero index for negative numbers.
hashOfIntegers[someArrayOfIntegers[i] - MINIMUM] = true;
}
int sequence = 0;
int maxSequence = 0;
//Find the maximum sequence in the values
for(unsigned int i = 0; i < ARRAY_SIZE; i++) {
if(hashOfIntegers[i]) sequence++;
else sequence = 0;
if(sequence > maxSequence) maxSequence = sequence;
}
cout << "MAX SEQUENCE: " << maxSequence << endl;
return 0;
}
The basic idea is to use the hash map as a bucket sort, so that you only have to do two passes over the data. This algorithm is O(2n), which in turn is O(n)
Don't get your hopes up, this is only a partial answer.
I'm quite confident that the problem is not solvable in O(n). Unfortunately, I can't prove it.
If there is a way to solve it in less than O(n^2), I'd suspect that the solution is based on the following strategy:
Decide in O(n) (or maybe O(n log n)) whether there exists a continuous subarray as you describe it with at least i elements. Lets call this predicate E(i).
Use bisection to find the maximum i for which E(i) holds.
The total running time of this algorithm would then be O(n log n) (or O(n log^2 n)).
This is the only way I could come up with to reduce the problem to another problem that at least has the potential of being simpler than the original formulation. However, I couldn't find a way to compute E(i) in less than O(n^2), so I may be completely off...
here's another way to think of your problem: suppose you have an array composed only of 1s and 0s, you want to find the longest consecutive run of 1s. this can be done in linear time by run-length encoding the 1s (ignore the 0's). in order to transform your original problem into this new run length encoding problem, you compute a new array b[i] = (a[i] < a[i+1]). this doesn't have to be done explicitly, you can just do it implicitly to achieve an algorithm with constant memory requirement and linear complexity.
Here are 3 acceptable solutions:
The first is O(nlog(n)) in time and O(n) space, the second is O(n) in time and O(n) in space, and the third is O(n) in time and O(1) in space.
build a binary search tree then traverse it in order.
keep 2 pointers one for the start of max subset and one for the end.
keep the max_size value while iterating the tree.
it is a O(n*log(n)) time and space complexity.
you can always sort numbers set using counting sort in a linear time
and run through the array, which means O(n) time and space
complexity.
Assuming there isn't overflow or a big integer data type. Assuming the array is a mathematical set (no duplicate values). You can do it in O(1) of memory:
calculate the sum of the array and the product of the array
figure out what numbers you have in it assuming you have the min and max of the original set. Totally it is O(n) time complexity.

Resources