Related
Having a sequence of n <= 10^6 integers, all not exceeding m <= 3*10^6, I'd like to count how many coprime pairs are in it. Two numbers are coprime if their greatest common divisor is 1.
It can be done trivially in O(n^2 log n), but this is obviously way to slow, as the limit suggests something closer to O(n log n). One thing than can be done quickly is factoring out all the numbers, and also throwing out multiple occurences of the same prime in each, but that doesn't lead to any significant improvement. I also thought of counting the opposite - pairs that have a common divisor. It could be done in groups - firstly counting all the pairs that their smallest common prime divisor is 2, then 3, 5, and etc., but it seems to me like an other dead end.
I've come up with a slightly faster alternative based on your answer. On my work PC my C++ implementation (bottom) takes about 350ms to solve any problem instance; on my old laptop, it takes just over 1s. This algorithm avoids all division and modulo operations, and uses only O(m) space.
As with your algorithm, the basic idea is to apply the Inclusion-Exclusion Principle by enumerating every number 2 <= i <= m that contains no repeated factors exactly once, and for each such i, counting the number of numbers in the input that are divisible by i and either adding or subtracting this from the total. The key difference is that we can do the counting part "stupidly", simply by testing whether each possible multiple of i appears in the input, and this still takes just O(m log m) time.
How many times does the innermost line c += v[j].freq; in countCoprimes() repeat? The body of the outer loop is executed once for each number 2 <= i <= m that contains no repeated prime factors; this iteration count is trivially upper-bounded by m. The inner loop advances i steps at a time through the range [2..m], so the number of operations it performs during a single outer loop iteration is upper-bounded by m / i. Therefore the total number of iterations of the innermost line is upper-bounded by the sum from i=2 to m of m/i. The m factor can be moved outside the sum to get an upper bound of
m * sum{i=2..m}(1/i)
That sum is a partial sum in a harmonic series, and it is upper-bounded by log(m), so the total number of innermost loop iterations is O(m log m).
extendedEratosthenes() is designed to reduce constant factors by avoiding all divisions and keeping to O(m) memory usage. All countCoprimes() actually needs to know for a number 2 <= i <= m is (a) whether it has repeated prime factors, and if it doesn't, (b) whether it has an even or odd number of prime factors. To calculate (b) we can make use of the fact that the Sieve of Eratosthenes effectively "hits" any given i with its distinct prime factors in increasing order, so we can just flip a bit (the parity field in struct entry) to keep track of whether i has an even or odd number of factors. Each number starts with a prod field equal to 1; to record (a) we simply "knock out" any number that contains the square of a prime number as a factor by setting its prod field to 0. This field serves a dual purpose: if v[i].prod == 0, it indicates that i was discovered to have repeated factors; otherwise it contains the product of the (necessarily distinct) factors discovered so far. The (fairly minor) utility of this is that it allows us to stop the main sieve loop at the square root of m, instead of going all the way up to m: by now, for any given i that has no repeated factors, either v[i].prod == i, in which case we have found all the factors for i, or v[i].prod < i, in which case i must have exactly one factor > sqrt(3000000) that we have not yet accounted for. We can find all such remaining "large factors" with a second, non-nested loop.
#include <iostream>
#include <vector>
using namespace std;
struct entry {
int freq; // Frequency that this number occurs in the input list
int parity; // 0 for even number of factors, 1 for odd number
int prod; // Product of distinct prime factors
};
const int m = 3000000; // Maximum input value
int n = 0; // Will be number of input values
vector<entry> v;
void extendedEratosthenes() {
int i;
for (i = 2; i * i <= m; ++i) {
if (v[i].prod == 1) {
for (int j = i, k = i; j <= m; j += i) {
if (--k) {
v[j].parity ^= 1;
v[j].prod *= i;
} else {
// j has a repeated factor of i: knock it out.
v[j].prod = 0;
k = i;
}
}
}
}
// Fix up numbers with a prime factor above their square root.
for (; i <= m; ++i) {
if (v[i].prod && v[i].prod != i) {
v[i].parity ^= 1;
}
}
}
void readInput() {
int i;
while (cin >> i) {
++v[i].freq;
++n;
}
}
void countCoprimes() {
__int64 total = static_cast<__int64>(n) * (n - 1) / 2;
for (int i = 2; i <= m; ++i) {
if (v[i].prod) {
// i must have no repeated factors.
int c = 0;
for (int j = i; j <= m; j += i) {
c += v[j].freq;
}
total -= (v[i].parity * 2 - 1) * static_cast<__int64>(c) * (c - 1) / 2;
}
}
cerr << "Total number of coprime pairs: " << total << "\n";
}
int main(int argc, char **argv) {
cerr << "Initialising array...\n";
entry initialElem = { 0, 0, 1 };
v.assign(m + 1, initialElem);
cerr << "Performing extended Sieve of Eratosthenes...\n";
extendedEratosthenes();
cerr << "Reading input...\n";
readInput();
cerr << "Counting coprimes...\n";
countCoprimes();
return 0;
}
Further exploiting the ideas I mentioned in my question, I actually managed to come up with a solution myself. As some of you may be interested in it, I will describe it briefly. It does work in O(m log m + n), I've already implemented it in C++ and tested - solves the biggest cases (10^6 integers) in less than 5 seconds.
We have n integers, all not greater than m. We start by doing Eratosthenes Sieve mapping each integer up to m to it's smalles prime factor, allowing us to factor out any number not greater than m in O(log m) time. Then for all given numbers A[i], as long as there is some prime p than divides A[i] in a power greater than one, we divide A[i] by it, because when asking if two numbers are coprime we can omit the exponents. That leaves us with all A[i] being products of distinct primes.
Now, let us assume that we were able to construct in a reasonable time a table T, such that T[i] is number of entries A[j] such that i divides A[j]. This is somehow similar to the approach #Brainless took in his second answer. Constructing table T quickly was the technic I spoke about in the comments below my question.
From now, we will work by Inclusion-Exclusion Principle. Having T, for each i we calculate P[i] - the amount of pairs (j,k) such that A[j] and A[k] are both divisible by i. Then to compute the answer, sum all P[i], taking minus sign before those P[i] for which i has an even number of prime divisors. Note that all prime divisors of i are distinct, because for all other indices i P[i] equals 0. By Inclusion-Exclusion each pair will be counted only once. To see this differently, take a pair A[i] and A[j], assuming that they share exactly k common prime divisors. Then this pair will be counted k times, then discounted kC2 times, counted kC3 times, discounted kC4 times... for nCk see the Newton's Symbol. Some mathematical manipulation makes us see that the considered pair will be counted 1 - (1-1)^k = 1 times, what concludes the proof.
Steps made so far required O(m log log m) for the Sieve and O(m) for computing the result. The last thing to do is to construct array T. We could for every A[i] just increment T[j] for all j dividing i. As A[i] can have at most O(sqrt(A[i])) divisors (and in practice even less than that) then we could construct T in O(n sqrt m). But we can do better than that!
Take two-dimensional array W. At each moment a following invariant holds - if for each non-zero W[i][j] we would increment the counter in table T by W[i][j] for all numbers that divide i, and also share the exact exponents i has in j smallest primes divisors of i, then T would be constructed properly. As this may seem a little confusing, let's see it in action. At start, to make the invariant true, for each A[i] we just increment W[A[i]][0]. Also note that a number not exceeding m can have at most O(log m) prime divisors, so the overall size of W is O(m log m). Now we see that an information stored in W[i][j] can be "pushed forward" in a following way: consider p to be (j+1)-th prime divisor of i, assuming it has one. Then some divisor of i can either have p with an exponent same as in i, or lower. First of these cases is W[i][j+1] - we add another prime that has to be "fully taken" by a divisor. Second case is W[i/p][j] as a divisor of i that doesn't have p with a highest exponent must also divide i/p. And that's it! We consider all i in descending order, then j in ascending order. We "push forward" information from W[i][j]. See that if i has exactly j prime divisors, then the information from it cannot be pushed, but we don't really need that! If i has j prime divisors, then W[i][j] basically says: increment by W[i][j] only index i in array T. So when all the information has been pushed to "last rows" in each W[i] we pass through those rows and finish constructing T. As each cell of W[i][j] has been visited once, this algorithm takes O(m log m) time, and also O(n) at the begining. That concludes the construction. Here's some C++ code from the actual implementation:
FORD(i,SIZE(W)-1,2) //i in descending order
{
int v = i, p;
FOR(j,0,SIZE(W[i])-2) //exclude last row
{
p = S[v]; //j-th divisor; S[v] - smallest prime divisor of v
while (v%p == 0) v /= p;
W[i][j+1] += W[i][j];
W[i/p][j] += W[i][j];
}
T[i] = W[i].back();
}
At the end I'd say that I think array T can be constructed faster and simpler than what I've shown. If anyone has some neat idea about how it could be done, I would appreciate all feedback.
Here's an idea based on the formula for the complete sequence 1..n, found on http://oeis.org/A018805:
a(n) = 2*( Sum phi(j), j=1..n ) - 1, where phi is Euler's totient function
Iterate over the sequence, S. For each term, S_i:
for each of the prime factors, p, of S_i:
if a hash for p does not exist:
create a hash with index p that points to a set of all indexes of S except i,
and a counter set to 1, representing how many terms of S are divisible by p so far
else:
delete i in the existing set of indexes and increment the counter
Sort the hashes for S_i's prime factors by their counters in descending order. Starting with
the largest counter (which means the smallest set), make a list of indexes up to i that are also
members of the next smallest set, until the sets are exhausted. Add the remaining number of
indexes in the list to the cumulative total.
Example:
sum phi' [4,7,10,15,21]
S_0: 4
prime-hash [2:1-4], counters [2:1]
0 indexes up to i in the set for prime 2
total 0
S_1: 7
prime hash [2:1-4; 7:0,2-4], counters [2:1, 7:1]
1 index up to i in the set for prime 7
total 1
S_2: 10
prime hash [2:1,3-4; 5:0-1,3-4; 7:0,2-4], counters [2:2, 5:1, 7:1]
1 index up to i in the set for prime 2, which is also a member
of the set for prime 5
total 2
S_3: 15
prime hash [2:1,3-4; 5:0-1,4; 7:0,2-4; 3:0-2,4], counters [2:2: 5:2, 7:1, 3:1]
2 indexes up to i in the set for prime 5, which are also members
of the set for prime 3
total 4
S_4: 21
prime hash [2:1,3-4; 5:0-1,4; 7:0,2-3; 3:0-2], counters [2:2: 5:2, 7:2, 3:2]
2 indexes up to i in the set for prime 7, which are also members
of the set for prime 3
total 6
6 coprime pairs:
(4,7),(4,15),(4,21),(7,10),(7,15),(10,21)
I would suggest :
1) Use Eratosthene to get a list of sorted prime numbers under 10^6.
2) For each number n in the list, get it's prime factors. Associate it another number f(n) in the following way : let's say that the prime factors of n are 3, 7 and 17. Then the binary representation of f(n) is :
`0 1 0 1 0 0 1`
The first digit (0 here) is associated to the prime number 2, the second (1 here) is associated to the prime number 3, etc ...
Therefore 2 numbers n and m are coprime iff f(n) & f(m) = 0.
3) It's easy to see that there is a N such that for each n : f(n) <= (2^N) - 1. This means that the biggest number f(n) is smaller or equal to a number whose binary representation is :
`1 1 1 1 1 1 1 1 1 1 1 1 1 1 1`
Here N is the number of 1 in the above sequence. Get this N and sort the list of numbers f(n). Let's call this list L.
If you want to optimize: in this list, instead of sorting duplicates, store a pair containing f(n) and the number of times f(n) is duplicated.
4) Iterate from 1 to N in this way : initialize i = 1 0 0 0 0, and at each iteration, move the digit 1 to the right with all other values kept to 0 (implement it using bitshift).
At each iteration, iterate over L to get the number d(i) of elements l in L such that i & l != 0 (be careful if you use the above optimization). In other words, for each i, get the number of elements in L which are not coprimes with i, and name this number d(i). Add the total
D = d(1) + d(2) + ... + d(N)
5) This number D is the number of pairs which are not coprime in the original list. The number of coprime pairs is :
M*(M-1)/2 - D
where M is the number of elements in the original list. The complexity of this method is O(n log(n)).
Good luck !
My previous answer was wrong, apologies. I propose here a modification:
Once you get the prime divisors of each number of the list, associate to each prime number p the number l(p) of numbers in the list which has p as divisor. For example consider the prime number 5, and the list's number which can be divided by 5 are 15, 100 and 255. Then l(5)=3.
To achieve it in O(n logn), iterate over the list and for each number in this list, iterate over it's prime factors; for each prime factor p, increment its l(p).
Then the number of pairs which are not coprime and can be divided by p is
l(p)*(l(p) - 1) / 2
Sum this number for all prime p, and you will get the number of pairs in the list which are not coprime (note that l(p) can be 0). Let say this sum is D, then the answer is
M*(M-1)/2 - D
where M is the length of the list. Good luck !
The pseudocode is as below. how to calculate time complexity for this programme
Algorithm MinValue(A, n):
Input: An integer array A of size n //1
Output: The smallest value in A
minValue <- A[0] //1
for k=1 to n-1 do //n
if (minValue > A[k]) then //n-1
minValue <- A[k] //1
return minValue //1
so, it's 1+1+n+n-1+1+1 = 2n+3, is it correct?
This is a more simple programme
Algorithm MaxInt(a, b):
Input: Two integers a and b //1
Output: The larger of the two integers
if a > b then //1
return a //1
else
return b. // 1
total operations = 4, is it correct?
Could anyone tell me the correct answer? Thanks
You were close.
In the first program, the number of times minValue <- A[k] is executed can be n-1 at the worst case (if the numbers are sorted in descending order).
Therefore the total number of operations is bound by 1 + n-1 + n-1 + n-1 + 1 = 3*n-1 = O(n).
For the second program, either return a or return b is executed. Therefore the number of operations is 2 (the condition + the chosen return statement).
Complexity of a conditional statement is maximum of complexities of the branches, plus complexity of the condition expression.
Is there any method to do this?
I mean, we even cannot work with "in" array of {0,1,..,N-1} (because it's at least O(N) memory).
M can be = N. N can be > 2^64. Result should be uniformly random and would better be every possible sequence (but may not).
Also full-range PRNGs (and friends) aren't suitable, because it will give same sequence each time.
Time complexity doesn't matter.
If you don't care what order the random selection comes out in, then it can be done in constant memory. The selection comes out in order.
The answer hinges on estimating the probability that the smallest value in a random selection of M distinct values of the set {0, ..., N-1} is i, for each possible i. Call this value p(i, M, N). With more mathematics than I have the patience to type into an interface which doesn't support Latex, you can derive some pretty good estimates for the p function; here, I'll just show the simple, non-time-efficient approach.
Let's just focus on p(0, M, N), which is the probability that a random selection of M out of N objects will include the first object. Then we can iterate through the objects (that is, the numbers 0...N-1) one at a time; deciding for each one whether it is included or not by flipping a weighted coin. We just need to compute the coin's weights for each flip.
By definition, there are MCN possible M-selections of a set of N objects. Of these MCN-1 do not include the first element. (That's the count of M-selections of N-1 objects, which is all the M-selections of the set missing one element). Similarly, M-1CN-1 selections do include the first element (that is, all the M-1-selections of the N-1-set, with the first element added to each selection).
These two values add up to MCN; the well-known recursive algorithm for computing C.
So p(0, M, N) is just M-1CN-1/MCN. Since MCN = N!/(M!*(N-M)!), we can simplify that fraction to M/N. As expected, if M == N, that works out to 1 (M of N objects must include every object).
So now we know what the probability that the first object will be in the selection. We can then reduce the size of the set, and either reduce the remaining selection size or not, depending on whether the coin flip determined that we did or did not include the first object. So here's the final algorithm, in pseudo-code, based on the existence of the weighted random boolean function:
w(x, y) => true with probability X / Y; otherwise false.
I'll leave the implementation of w for the reader, since it's trivial.
So:
Generate a random M-selection from the set 0...N-1
Parameters: M, N
Set i = 0
while M > 0:
if w(M, N):
output i
M = M - 1
N = N - 1
i = i + 1
It might not be immediately obvious that that works, but note that:
the output i statement must be executed exactly M times, since it is coupled with a decrement of M, and the while loop executes until M is 0
The closer M gets to N, the higher the probability that M will be decremented. If we ever get to the point where M == N, then both will be decremented in lockstep until they both reach 0.
i is incremented exactly when N is decremented, so it must always be in the range 0...N-1. In fact, it's redundant; we could output N-1 instead of outputting i, which would change the algorithm to produce sets in decreasing order instead of increasing order. I didn't do that because I think the above is easier to understand.
The time complexity of that algorithm is O(N+M) which must be O(N). If N is large, that's not great, but the problem statement said that time complexity doesn't matter, so I'll leave it there.
PRNGs that don't map their state space to a lower number of bits for output should work fine. Examples include Linear Congruential Generators and Tausworthe generators. They will give the same sequence if you use the same seed to start them, but that's easy to change.
Brute force:
if time complexity doesn't matter it would be a solution for 0 < M <= N invariant. nextRandom(N) is a function which returns random integer in [0..N):
init() {
for (int idx = 0; idx < N; idx++) {
a[idx] = -1;
}
for (int idx = 0; idx < M; idx++) {
getNext();
}
}
int getNext() {
for (int idx = 1; idx < M; idx++) {
a[idx -1] = a[idx];
}
while (true) {
r = nextRandom(N);
idx = 0;
while (idx < M && a[idx] != r) idx++;
if (idx == M) {
a[idx - 1] = r;
return r;
}
}
}
O(M) solution: It is recursive solution for simplicity. It supposes to run nextRandom() which returns a random number in [0..1):
rnd(0, 0, N, M); // to get next M distinct random numbers
int rnd(int idx, int n1, int n2, int m) {
if (n1 >= n2 || m <= 0) return idx;
int r = nextRandom(n2 - n1) + n1;
int m1 = (int) ((m-1.0)*(r-n1)/(n2-n1) + nextRandom()); // gives [0..m-1]
int m2 = m - m1 - 1;
idx = rnd(idx, n1, r-1, m1);
print r;
return rnd(idx+1, r+1, n2, m2);
}
the idea is to select a random r in between [0..N) on first step which splits the range on two sub-ranges by N1 and N2 elements in each (N1+N2==N-1). We need to repeat the same step for [0..r) which has N1 elements and [r+1..N) (N2 elements) choosing M1 and M2 (M1+M2==M-1) so as M1/M2 == N1/N2. M1 and M2 must be integers, but the proportion can give real results, we need to round values with probabilities (1.2 will give 1 with p=0.8 and 2 with p=0.2 etc.).
Let A[1 .. n] be an array of n distinct numbers. If i < j and A[i] > A[j], then the pair (i, j) is called an inversion of A. (See Problem 2-4 for more on inversions.) Suppose that each element of A is chosen randomly, independently, and uniformly from the range 1 through n. Use indicator random variables to compute the expected number of inversions.
The problem is from exercise 5.2-5 in Introduction to Algorithms by Cormen. Here is my recursive solution:
Suppose x(i) is the number of inversions in a[1..i], and E(i) is the expected value of x(i), then E(i+1) can be computed as following:
Image we have i+1 positions to place all the numbers, if we place i+1 on the first position, then x(i+1) = i + x(i); if we place i+1 on the second position, then x(i+1) = i-1 + x(i),..., so E(i+1) = 1/(i+1)* sum(k) + E(i), where k = [0,i]. Finally we get E(i+1) = i/2 + E(i).
Because we know that E(2) = 0.5, so recursively we get: E(n) = (n-1 + n-2 + ... + 2)/2 + 0.5 = n* (n-1)/4.
Although the deduction above seems to be right, but I am still not very sure of that. So I share it here.
If there is something wrong, please correct me.
All the solutions seem to be correct, but the problem says that we should use indicator random variables. So here is my solution using the same:
Let Eij be the event that i < j and A[i] > A[j].
Let Xij = I{Eij} = {1 if (i, j) is an inversion of A
0 if (i, j) is not an inversion of A}
Let X = Σ(i=1 to n)Σ(j=1 to n)(Xij) = No. of inversions of A.
E[X] = E[Σ(i=1 to n)Σ(j=1 to n)(Xij)]
= Σ(i=1 to n)Σ(j=1 to n)(E[Xij])
= Σ(i=1 to n)Σ(j=1 to n)(P(Eij))
= Σ(i=1 to n)Σ(j=i + 1 to n)(P(Eij)) (as we must have i < j)
= Σ(i=1 to n)Σ(j=i + 1 to n)(1/2) (we can choose the two numbers in
C(n, 2) ways and arrange them
as required. So P(Eij) = C(n, 2) / n(n-1))
= Σ(i=1 to n)((n - i)/2)
= n(n - 1)/4
Another solution is even simpler, IMO, although it does not use "indicator random variables".
Since all of the numbers are distinct, every pair of elements is either an inversion (i < j with A[i] > A[j]) or a non-inversion (i < j with A[i] < A[j]). Put another way, every pair of numbers is either in order or out of order.
So for any given permutation, the total number of inversions plus non-inversions is just the total number of pairs, or n*(n-1)/2.
By symmetry of "less than" and "greater than", the expected number of inversions equals the expected number of non-inversions.
Since the expectation of their sum is n*(n-1)/2 (constant for all permutations), and they are equal, they are each half of that or n*(n-1)/4.
[Update 1]
Apparently my "symmetry of 'less than' and 'greater than'" statement requires some elaboration.
For any array of numbers A in the range 1 through n, define ~A as the array you get when you subtract each number from n+1. For example, if A is [2,3,1], then ~A is [2,1,3].
Now, observe that for any pair of numbers in A that are in order, the corresponding elements of ~A are out of order. (Easy to show because negating two numbers exchanges their ordering.) This mapping explicitly shows the symmetry (duality) between less-than and greater-than in this context.
So, for any A, the number of inversions equals the number of non-inversions in ~A. But for every possible A, there corresponds exactly one ~A; when the numbers are chosen uniformly, both A and ~A are equally likely. Therefore the expected number of inversions in A equals the expected number of inversions in ~A, because these expectations are being calculated over the exact same space.
Therefore the expected number of inversions in A equals the expected number of non-inversions. The sum of these expectations is the expectation of the sum, which is the constant n*(n-1)/2, or the total number of pairs.
[Update 2]
A simpler symmetry: For any array A of n elements, define ~A as the same elements but in reverse order. Associate the element at position i in A with the element at position n+1-i in ~A. (That is, associate each element with itself in the reversed array.)
Now any inversion in A is associated with a non-inversion in ~A, just as with the construction in Update 1 above. So the same argument applies: The number of inversions in A equals the number of inversions in ~A; both A and ~A are equally likely sequences; etc.
The point of the intuition here is that the "less than" and "greater than" operators are just mirror images of each other, which you can see either by negating the arguments (as in Update 1) or by swapping them (as in Update 2). So the expected number of inversions and non-inversions is the same, since you cannot tell whether you are looking at any particular array through a mirror or not.
Even simpler (similar to Aman's answer above, but perhaps clearer) ...
Let Xij be a random variable with Xij=1 if A[i] > A[j] and Xij=0 otherwise.
Let X=sum(Xij) over i, j where i < j
Number of pairs (ij)*: n(n-1)/2
Probability that Xij=1 (Pr(Xij=1))): 1/2
By linearity of expectation**: E(X) = E(sum(Xij))
= sum(E(Xij))
= sum(Pr(Xij=1))
= n(n-1)/2 * 1/2
= n(n-1)/4
* I think of this as the size of the upper triangle of a square matrix.
** All sums here are over i, j, where i < j.
I think it's right, but I think the proper way to prove it is to use conditionnal expectations :
for all X and Y we have : E[X] =E [E [X|Y]]
then in your case :
E(i+1) = E[x(i+1)] = E[E[x(i+1) | x(i)]] = E[SUM(k)/(1+i) + x(i)] = i/2 + E[x(i)] = i/2 + E(i)
about the second statement :
if :
E(n) = n* (n-1)/4.
then E(n+1) = (n+1)*n/4 = (n-1)*n/4 + 2*n/4 = (n-1)*n/4 + n/2 = E(n) +n/2
So n* (n-1)/4. verify the recursion relation for all n >=2 and it verifies it for n=2
So E(n) = n*(n-1)/4
Hope I understood your problem and it helps
Using indicator random variables:
Let X = random variable which is equal to the number of inversions.
Let Xij = 1 if A[i] and A[j] form an inversion pair, and Xij = 0 otherwise.
Number of inversion pairs = Sum over 1 <= i < j <= n of (Xij)
Now P[Xij = 1] = P[A[i] > A[j]] = (n choose 2) / (2! * n choose 2) = 1/2
E[X] = E[sum over all ij pairs such that i < j of Xij] = sum over all ij pairs such that i < j of E[Xij] = n(n - 1) / 4
An interesting interview question that a colleague of mine uses:
Suppose that you are given a very long, unsorted list of unsigned 64-bit integers. How would you find the smallest non-negative integer that does not occur in the list?
FOLLOW-UP: Now that the obvious solution by sorting has been proposed, can you do it faster than O(n log n)?
FOLLOW-UP: Your algorithm has to run on a computer with, say, 1GB of memory
CLARIFICATION: The list is in RAM, though it might consume a large amount of it. You are given the size of the list, say N, in advance.
If the datastructure can be mutated in place and supports random access then you can do it in O(N) time and O(1) additional space. Just go through the array sequentially and for every index write the value at the index to the index specified by value, recursively placing any value at that location to its place and throwing away values > N. Then go again through the array looking for the spot where value doesn't match the index - that's the smallest value not in the array. This results in at most 3N comparisons and only uses a few values worth of temporary space.
# Pass 1, move every value to the position of its value
for cursor in range(N):
target = array[cursor]
while target < N and target != array[target]:
new_target = array[target]
array[target] = target
target = new_target
# Pass 2, find first location where the index doesn't match the value
for cursor in range(N):
if array[cursor] != cursor:
return cursor
return N
Here's a simple O(N) solution that uses O(N) space. I'm assuming that we are restricting the input list to non-negative numbers and that we want to find the first non-negative number that is not in the list.
Find the length of the list; lets say it is N.
Allocate an array of N booleans, initialized to all false.
For each number X in the list, if X is less than N, set the X'th element of the array to true.
Scan the array starting from index 0, looking for the first element that is false. If you find the first false at index I, then I is the answer. Otherwise (i.e. when all elements are true) the answer is N.
In practice, the "array of N booleans" would probably be encoded as a "bitmap" or "bitset" represented as a byte or int array. This typically uses less space (depending on the programming language) and allows the scan for the first false to be done more quickly.
This is how / why the algorithm works.
Suppose that the N numbers in the list are not distinct, or that one or more of them is greater than N. This means that there must be at least one number in the range 0 .. N - 1 that is not in the list. So the problem of find the smallest missing number must therefore reduce to the problem of finding the smallest missing number less than N. This means that we don't need to keep track of numbers that are greater or equal to N ... because they won't be the answer.
The alternative to the previous paragraph is that the list is a permutation of the numbers from 0 .. N - 1. In this case, step 3 sets all elements of the array to true, and step 4 tells us that the first "missing" number is N.
The computational complexity of the algorithm is O(N) with a relatively small constant of proportionality. It makes two linear passes through the list, or just one pass if the list length is known to start with. There is no need to represent the hold the entire list in memory, so the algorithm's asymptotic memory usage is just what is needed to represent the array of booleans; i.e. O(N) bits.
(By contrast, algorithms that rely on in-memory sorting or partitioning assume that you can represent the entire list in memory. In the form the question was asked, this would require O(N) 64-bit words.)
#Jorn comments that steps 1 through 3 are a variation on counting sort. In a sense he is right, but the differences are significant:
A counting sort requires an array of (at least) Xmax - Xmin counters where Xmax is the largest number in the list and Xmin is the smallest number in the list. Each counter has to be able to represent N states; i.e. assuming a binary representation it has to have an integer type (at least) ceiling(log2(N)) bits.
To determine the array size, a counting sort needs to make an initial pass through the list to determine Xmax and Xmin.
The minimum worst-case space requirement is therefore ceiling(log2(N)) * (Xmax - Xmin) bits.
By contrast, the algorithm presented above simply requires N bits in the worst and best cases.
However, this analysis leads to the intuition that if the algorithm made an initial pass through the list looking for a zero (and counting the list elements if required), it would give a quicker answer using no space at all if it found the zero. It is definitely worth doing this if there is a high probability of finding at least one zero in the list. And this extra pass doesn't change the overall complexity.
EDIT: I've changed the description of the algorithm to use "array of booleans" since people apparently found my original description using bits and bitmaps to be confusing.
Since the OP has now specified that the original list is held in RAM and that the computer has only, say, 1GB of memory, I'm going to go out on a limb and predict that the answer is zero.
1GB of RAM means the list can have at most 134,217,728 numbers in it. But there are 264 = 18,446,744,073,709,551,616 possible numbers. So the probability that zero is in the list is 1 in 137,438,953,472.
In contrast, my odds of being struck by lightning this year are 1 in 700,000. And my odds of getting hit by a meteorite are about 1 in 10 trillion. So I'm about ten times more likely to be written up in a scientific journal due to my untimely death by a celestial object than the answer not being zero.
As pointed out in other answers you can do a sort, and then simply scan up until you find a gap.
You can improve the algorithmic complexity to O(N) and keep O(N) space by using a modified QuickSort where you eliminate partitions which are not potential candidates for containing the gap.
On the first partition phase, remove duplicates.
Once the partitioning is complete look at the number of items in the lower partition
Is this value equal to the value used for creating the partition?
If so then it implies that the gap is in the higher partition.
Continue with the quicksort, ignoring the lower partition
Otherwise the gap is in the lower partition
Continue with the quicksort, ignoring the higher partition
This saves a large number of computations.
To illustrate one of the pitfalls of O(N) thinking, here is an O(N) algorithm that uses O(1) space.
for i in [0..2^64):
if i not in list: return i
print "no 64-bit integers are missing"
Since the numbers are all 64 bits long, we can use radix sort on them, which is O(n). Sort 'em, then scan 'em until you find what you're looking for.
if the smallest number is zero, scan forward until you find a gap. If the smallest number is not zero, the answer is zero.
For a space efficient method and all values are distinct you can do it in space O( k ) and time O( k*log(N)*N ). It's space efficient and there's no data moving and all operations are elementary (adding subtracting).
set U = N; L=0
First partition the number space in k regions. Like this:
0->(1/k)*(U-L) + L, 0->(2/k)*(U-L) + L, 0->(3/k)*(U-L) + L ... 0->(U-L) + L
Find how many numbers (count{i}) are in each region. (N*k steps)
Find the first region (h) that isn't full. That means count{h} < upper_limit{h}. (k steps)
if h - count{h-1} = 1 you've got your answer
set U = count{h}; L = count{h-1}
goto 2
this can be improved using hashing (thanks for Nic this idea).
same
First partition the number space in k regions. Like this:
L + (i/k)->L + (i+1/k)*(U-L)
inc count{j} using j = (number - L)/k (if L < number < U)
find first region (h) that doesn't have k elements in it
if count{h} = 1 h is your answer
set U = maximum value in region h L = minimum value in region h
This will run in O(log(N)*N).
I'd just sort them then run through the sequence until I find a gap (including the gap at the start between zero and the first number).
In terms of an algorithm, something like this would do it:
def smallest_not_in_list(list):
sort(list)
if list[0] != 0:
return 0
for i = 1 to list.last:
if list[i] != list[i-1] + 1:
return list[i-1] + 1
if list[list.last] == 2^64 - 1:
assert ("No gaps")
return list[list.last] + 1
Of course, if you have a lot more memory than CPU grunt, you could create a bitmask of all possible 64-bit values and just set the bits for every number in the list. Then look for the first 0-bit in that bitmask. That turns it into an O(n) operation in terms of time but pretty damned expensive in terms of memory requirements :-)
I doubt you could improve on O(n) since I can't see a way of doing it that doesn't involve looking at each number at least once.
The algorithm for that one would be along the lines of:
def smallest_not_in_list(list):
bitmask = mask_make(2^64) // might take a while :-)
mask_clear_all (bitmask)
for i = 1 to list.last:
mask_set (bitmask, list[i])
for i = 0 to 2^64 - 1:
if mask_is_clear (bitmask, i):
return i
assert ("No gaps")
Sort the list, look at the first and second elements, and start going up until there is a gap.
We could use a hash table to hold the numbers. Once all numbers are done, run a counter from 0 till we find the lowest. A reasonably good hash will hash and store in constant time, and retrieves in constant time.
for every i in X // One scan Θ(1)
hashtable.put(i, i); // O(1)
low = 0;
while (hashtable.get(i) <> null) // at most n+1 times
low++;
print low;
The worst case if there are n elements in the array, and are {0, 1, ... n-1}, in which case, the answer will be obtained at n, still keeping it O(n).
You can do it in O(n) time and O(1) additional space, although the hidden factor is quite large. This isn't a practical way to solve the problem, but it might be interesting nonetheless.
For every unsigned 64-bit integer (in ascending order) iterate over the list until you find the target integer or you reach the end of the list. If you reach the end of the list, the target integer is the smallest integer not in the list. If you reach the end of the 64-bit integers, every 64-bit integer is in the list.
Here it is as a Python function:
def smallest_missing_uint64(source_list):
the_answer = None
target = 0L
while target < 2L**64:
target_found = False
for item in source_list:
if item == target:
target_found = True
if not target_found and the_answer is None:
the_answer = target
target += 1L
return the_answer
This function is deliberately inefficient to keep it O(n). Note especially that the function keeps checking target integers even after the answer has been found. If the function returned as soon as the answer was found, the number of times the outer loop ran would be bound by the size of the answer, which is bound by n. That change would make the run time O(n^2), even though it would be a lot faster.
Thanks to egon, swilden, and Stephen C for my inspiration. First, we know the bounds of the goal value because it cannot be greater than the size of the list. Also, a 1GB list could contain at most 134217728 (128 * 2^20) 64-bit integers.
Hashing part
I propose using hashing to dramatically reduce our search space. First, square root the size of the list. For a 1GB list, that's N=11,586. Set up an integer array of size N. Iterate through the list, and take the square root* of each number you find as your hash. In your hash table, increment the counter for that hash. Next, iterate through your hash table. The first bucket you find that is not equal to it's max size defines your new search space.
Bitmap part
Now set up a regular bit map equal to the size of your new search space, and again iterate through the source list, filling out the bitmap as you find each number in your search space. When you're done, the first unset bit in your bitmap will give you your answer.
This will be completed in O(n) time and O(sqrt(n)) space.
(*You could use use something like bit shifting to do this a lot more efficiently, and just vary the number and size of buckets accordingly.)
Well if there is only one missing number in a list of numbers, the easiest way to find the missing number is to sum the series and subtract each value in the list. The final value is the missing number.
int i = 0;
while ( i < Array.Length)
{
if (Array[i] == i + 1)
{
i++;
}
if (i < Array.Length)
{
if (Array[i] <= Array.Length)
{//SWap
int temp = Array[i];
int AnoTemp = Array[temp - 1];
Array[temp - 1] = temp;
Array[i] = AnoTemp;
}
else
i++;
}
}
for (int j = 0; j < Array.Length; j++)
{
if (Array[j] > Array.Length)
{
Console.WriteLine(j + 1);
j = Array.Length;
}
else
if (j == Array.Length - 1)
Console.WriteLine("Not Found !!");
}
}
Here's my answer written in Java:
Basic Idea:
1- Loop through the array throwing away duplicate positive, zeros, and negative numbers while summing up the rest, getting the maximum positive number as well, and keep the unique positive numbers in a Map.
2- Compute the sum as max * (max+1)/2.
3- Find the difference between the sums calculated at steps 1 & 2
4- Loop again from 1 to the minimum of [sums difference, max] and return the first number that is not in the map populated in step 1.
public static int solution(int[] A) {
if (A == null || A.length == 0) {
throw new IllegalArgumentException();
}
int sum = 0;
Map<Integer, Boolean> uniqueNumbers = new HashMap<Integer, Boolean>();
int max = A[0];
for (int i = 0; i < A.length; i++) {
if(A[i] < 0) {
continue;
}
if(uniqueNumbers.get(A[i]) != null) {
continue;
}
if (A[i] > max) {
max = A[i];
}
uniqueNumbers.put(A[i], true);
sum += A[i];
}
int completeSum = (max * (max + 1)) / 2;
for(int j = 1; j <= Math.min((completeSum - sum), max); j++) {
if(uniqueNumbers.get(j) == null) { //O(1)
return j;
}
}
//All negative case
if(uniqueNumbers.isEmpty()) {
return 1;
}
return 0;
}
As Stephen C smartly pointed out, the answer must be a number smaller than the length of the array. I would then find the answer by binary search. This optimizes the worst case (so the interviewer can't catch you in a 'what if' pathological scenario). In an interview, do point out you are doing this to optimize for the worst case.
The way to use binary search is to subtract the number you are looking for from each element of the array, and check for negative results.
I like the "guess zero" apprach. If the numbers were random, zero is highly probable. If the "examiner" set a non-random list, then add one and guess again:
LowNum=0
i=0
do forever {
if i == N then leave /* Processed entire array */
if array[i] == LowNum {
LowNum++
i=0
}
else {
i++
}
}
display LowNum
The worst case is n*N with n=N, but in practice n is highly likely to be a small number (eg. 1)
I am not sure if I got the question. But if for list 1,2,3,5,6 and the missing number is 4, then the missing number can be found in O(n) by:
(n+2)(n+1)/2-(n+1)n/2
EDIT: sorry, I guess I was thinking too fast last night. Anyway, The second part should actually be replaced by sum(list), which is where O(n) comes. The formula reveals the idea behind it: for n sequential integers, the sum should be (n+1)*n/2. If there is a missing number, the sum would be equal to the sum of (n+1) sequential integers minus the missing number.
Thanks for pointing out the fact that I was putting some middle pieces in my mind.
Well done Ants Aasma! I thought about the answer for about 15 minutes and independently came up with an answer in a similar vein of thinking to yours:
#define SWAP(x,y) { numerictype_t tmp = x; x = y; y = tmp; }
int minNonNegativeNotInArr (numerictype_t * a, size_t n) {
int m = n;
for (int i = 0; i < m;) {
if (a[i] >= m || a[i] < i || a[i] == a[a[i]]) {
m--;
SWAP (a[i], a[m]);
continue;
}
if (a[i] > i) {
SWAP (a[i], a[a[i]]);
continue;
}
i++;
}
return m;
}
m represents "the current maximum possible output given what I know about the first i inputs and assuming nothing else about the values until the entry at m-1".
This value of m will be returned only if (a[i], ..., a[m-1]) is a permutation of the values (i, ..., m-1). Thus if a[i] >= m or if a[i] < i or if a[i] == a[a[i]] we know that m is the wrong output and must be at least one element lower. So decrementing m and swapping a[i] with the a[m] we can recurse.
If this is not true but a[i] > i then knowing that a[i] != a[a[i]] we know that swapping a[i] with a[a[i]] will increase the number of elements in their own place.
Otherwise a[i] must be equal to i in which case we can increment i knowing that all the values of up to and including this index are equal to their index.
The proof that this cannot enter an infinite loop is left as an exercise to the reader. :)
The Dafny fragment from Ants' answer shows why the in-place algorithm may fail. The requires pre-condition describes that the values of each item must not go beyond the bounds of the array.
method AntsAasma(A: array<int>) returns (M: int)
requires A != null && forall N :: 0 <= N < A.Length ==> 0 <= A[N] < A.Length;
modifies A;
{
// Pass 1, move every value to the position of its value
var N := A.Length;
var cursor := 0;
while (cursor < N)
{
var target := A[cursor];
while (0 <= target < N && target != A[target])
{
var new_target := A[target];
A[target] := target;
target := new_target;
}
cursor := cursor + 1;
}
// Pass 2, find first location where the index doesn't match the value
cursor := 0;
while (cursor < N)
{
if (A[cursor] != cursor)
{
return cursor;
}
cursor := cursor + 1;
}
return N;
}
Paste the code into the validator with and without the forall ... clause to see the verification error. The second error is a result of the verifier not being able to establish a termination condition for the Pass 1 loop. Proving this is left to someone who understands the tool better.
Here's an answer in Java that does not modify the input and uses O(N) time and N bits plus a small constant overhead of memory (where N is the size of the list):
int smallestMissingValue(List<Integer> values) {
BitSet bitset = new BitSet(values.size() + 1);
for (int i : values) {
if (i >= 0 && i <= values.size()) {
bitset.set(i);
}
}
return bitset.nextClearBit(0);
}
def solution(A):
index = 0
target = []
A = [x for x in A if x >=0]
if len(A) ==0:
return 1
maxi = max(A)
if maxi <= len(A):
maxi = len(A)
target = ['X' for x in range(maxi+1)]
for number in A:
target[number]= number
count = 1
while count < maxi+1:
if target[count] == 'X':
return count
count +=1
return target[count-1] + 1
Got 100% for the above solution.
1)Filter negative and Zero
2)Sort/distinct
3)Visit array
Complexity: O(N) or O(N * log(N))
using Java8
public int solution(int[] A) {
int result = 1;
boolean found = false;
A = Arrays.stream(A).filter(x -> x > 0).sorted().distinct().toArray();
//System.out.println(Arrays.toString(A));
for (int i = 0; i < A.length; i++) {
result = i + 1;
if (result != A[i]) {
found = true;
break;
}
}
if (!found && result == A.length) {
//result is larger than max element in array
result++;
}
return result;
}
An unordered_set can be used to store all the positive numbers, and then we can iterate from 1 to length of unordered_set, and see the first number that does not occur.
int firstMissingPositive(vector<int>& nums) {
unordered_set<int> fre;
// storing each positive number in a hash.
for(int i = 0; i < nums.size(); i +=1)
{
if(nums[i] > 0)
fre.insert(nums[i]);
}
int i = 1;
// Iterating from 1 to size of the set and checking
// for the occurrence of 'i'
for(auto it = fre.begin(); it != fre.end(); ++it)
{
if(fre.find(i) == fre.end())
return i;
i +=1;
}
return i;
}
Solution through basic javascript
var a = [1, 3, 6, 4, 1, 2];
function findSmallest(a) {
var m = 0;
for(i=1;i<=a.length;i++) {
j=0;m=1;
while(j < a.length) {
if(i === a[j]) {
m++;
}
j++;
}
if(m === 1) {
return i;
}
}
}
console.log(findSmallest(a))
Hope this helps for someone.
With python it is not the most efficient, but correct
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import datetime
# write your code in Python 3.6
def solution(A):
MIN = 0
MAX = 1000000
possible_results = range(MIN, MAX)
for i in possible_results:
next_value = (i + 1)
if next_value not in A:
return next_value
return 1
test_case_0 = [2, 2, 2]
test_case_1 = [1, 3, 44, 55, 6, 0, 3, 8]
test_case_2 = [-1, -22]
test_case_3 = [x for x in range(-10000, 10000)]
test_case_4 = [x for x in range(0, 100)] + [x for x in range(102, 200)]
test_case_5 = [4, 5, 6]
print("---")
a = datetime.datetime.now()
print(solution(test_case_0))
print(solution(test_case_1))
print(solution(test_case_2))
print(solution(test_case_3))
print(solution(test_case_4))
print(solution(test_case_5))
def solution(A):
A.sort()
j = 1
for i, elem in enumerate(A):
if j < elem:
break
elif j == elem:
j += 1
continue
else:
continue
return j
this can help:
0- A is [5, 3, 2, 7];
1- Define B With Length = A.Length; (O(1))
2- initialize B Cells With 1; (O(n))
3- For Each Item In A:
if (B.Length <= item) then B[Item] = -1 (O(n))
4- The answer is smallest index in B such that B[index] != -1 (O(n))