Related
I want to develop a way to be able to represent all combinations of b bits with k bits set (equal to 1). It needs to be a way that given an index, can get quickly the binary sequence related, and the other way around too. For instance, the tradicional approach which I thought would be to generate the numbers in order, like:
For b=4 and k=2:
0- 0011
1- 0101
2- 0110
3- 1001
4-1010
5-1100
If I am given the sequence '1010', I want to be able to quickly generate the number 4 as a response, and if I give the number 4, I want to be able to quickly generate the sequence '1010'. However I can't figure out a way to do these things without having to generate all the sequences that come before (or after).
It is not necessary to generate the sequences in that order, you could do 0-1001, 1-0110, 2-0011 and so on, but there has to be no repetition between 0 and the (combination of b choose k) - 1 and all sequences have to be represented.
How would you approach this? Is there a better algorithm than the one I'm using?
pkpnd's suggestion is on the right track, essentially process one digit at a time and if it's a 1, count the number of options that exist below it via standard combinatorics.
nCr() can be replaced by a table precomputation requiring O(n^2) storage/time. There may be another property you can exploit to reduce the number of nCr's you need to store by leveraging the absorption property along with the standard recursive formula.
Even with 1000's of bits, that table shouldn't be intractably large. Storing the answer also shouldn't be too bad, as 2^1000 is ~300 digits. If you meant hundreds of thousands, then that would be a different question. :)
import math
def nCr(n,r):
return math.factorial(n) // math.factorial(r) // math.factorial(n-r)
def get_index(value):
b = len(value)
k = sum(c == '1' for c in value)
count = 0
for digit in value:
b -= 1
if digit == '1':
if b >= k:
count += nCr(b, k)
k -= 1
return count
print(get_index('0011')) # 0
print(get_index('0101')) # 1
print(get_index('0110')) # 2
print(get_index('1001')) # 3
print(get_index('1010')) # 4
print(get_index('1100')) # 5
Nice question, btw.
I am trying to answer for the below question : You have an array of integers, such that each integer is present an odd number of time, except 3 of them. Find the three numbers.
so far I came with the brute force method :
public static void main(String[] args) {
// TODO Auto-generated method stub
int number[] = { 1, 6, 4, 1, 4, 5, 8, 8, 4, 6, 8, 8, 9, 7, 9, 5, 9 };
FindEvenOccurance findEven = new FindEvenOccurance();
findEven.getEvenDuplicates(number);
}
// Brute force
private void getEvenDuplicates(int[] number) {
Map<Integer, Integer> map = new HashMap<Integer, Integer>();
for (int i : number) {
if (map.containsKey(i)) {
// a XOR a XOR a ---- - -- - - odd times = a
// a XOR a ---- -- -- --- - even times = 0
int value = map.get(i) ^ i;
map.put(i,value);
} else {
map.put(i, i);
}
}
for (Entry<Integer, Integer> entry : map.entrySet()) {
if (entry.getValue() == 0) {
System.out.println(entry.getKey());
}
}
}
It works fine but not efficient.
The o/p :
1
5
6
8
But the questions specifies we need to do this in O(1) space and O(N) time complexity. For my solution, the time complexity is O(N) but space also O(N). Can some one suggest me a better way of doing this with O(1) space ?
Thanks.
I spent some time solving this problem. Seems that I found solution. In any case I believe, that community will help me to check ideas listed below.
First of all, I claim that we can solve this problem when the number of non-paired integers is equal to 1 or 2. In case of 1 non-paired integer we just need to find XOR of all array elements and it'll be the answer. In case of 2 non-paired integers solution becomes more complicated. But it was already discussed earlier. For example you can find it here.
Now let's try to solve problem when the number of non-paired integers is equal to 3.
At the beginning we also calculate XOR of all elements. Let's denote it as X.
Consider the i-th bit in X. I assume that it's equal to 0. If it's equal to 1 the next procedure is practically the same, we just change 0 to 1 and vice versa.
So, if the i-th in X bit is equal to 0 we have two possible situations. One situation is when all non-paired integers have 0 in the i-th bit. Another situation is when one non-paired integer has 0 in the i-th bit, and two non-paired integers have 1 in i-th bit. This statement is based on simple XOR operation properties. So we have one or three non-paired integers with 0 in the i-th bit.
Now let's divide all elements into the two groups. The first group is for integers with 0 in the i-th bit position, the second is for integers with 1 in the i-th bit position. Also our first group contains one or three non-paired integers with '0' in the i-th bit.
How we can obtain the certain number of non-paired integers in the first group? We just need to calculate XOR of all elements in the second group. If it's equal to zero, than all non-paired integers are in the first group and we need to check another i. In other case only one non-paired integer is in the first group and two others are in the second and we can solve problem separately for this two groups using methods from the beginning of this answer.
The key observation is that there's i such that one non-paired integer has i-th bit that differs from the i-th bits of the two other non-paired integers. In this case non-paired integers are in both groups. It's based on the fact that if there's no such i then bits in all positions in non-paired integers are similar and they are equal to each other. But it's impossible according to the problem statement.
This solution can be implemented without any additional memory. Total complexity is linear with some constant depending on the number of bits in array elements.
Unfortunately it is not possible to achieve such a solution with O(1) space and O(n) complexity if we use a strict sense of space, i.e. O(1) space is bound by the max space used in the input array.
In a weak sense of space, where one arbitrary large Integer number does still fit into O(1), you can just encode your counter into the bits of this one integer. Start with all bits set to 1. Toggle the n-th bit, when you encounter number n in the input array. All bits remaining 1 at the end represent the 3 numbers which were encountered an even number of times.
There's two ways to look at your problem.
The first way, as a mathematical problem with an infinite set of integer, it seems unsolvable.
The second way, as a computing problem with a finite integers set, you've already solved it (congratulations !). Why ? Because storage space is bounded by MAX_INT, independently of N.
NB an obvious space optimization would be to store the values only once, erasing the previous value for even counts, you'll gain half the space.
About the other answers by #Lashane and #SGM1: they also solve the "computing" problem, but are arguably less efficient than yours in most real-world scenarios. Why ? Because they pre-allocate a 512MB array, instead of allocating proportionaly to the number of different values in the array. As the array is likely to use much less than MAX_INT different values, you're likely to use much less than 512MB, even if you store 32bits for each value instead of 1. And that's with 32 bits integers, with more bits the pre-allocated array would grow exponentially, OTOH your solution only depends on the actual values in the array, so is unaffected by the number of bits of the system (i.e. max int value).
See also this and this for better (less space) algorithms.
consider for example the numbers allowed are of size 4 bits, which means the range of numbers allowed from 0 to 24-1 which is a constant number 16, for every possible input we run over all array and xor the occurrence of this number, if the result of xor is zero, we add current value to the overall result. this solution is O(16N) which is O(N) and use only one extra variable to evaluate the xor of current number which is O(1) in terms of space complexity.
we can extend this method to our original problem, but it will have a very big constant number in terms of run time complexity which will be proportional to the number of bits allowed in the original input.
we can enhance this approach by run over all elements and find the Most significant bit over all input data, suppose it is the 10th bit, then our run time complexity will become O(210N) which is also O(N).
another enhancement can be found in the below image, but still with the worst case complexity as discussed before.
finally I believe that, there exist another better solution for this problem but I decided to share my thought.
Edit:
the algorithm in the image may not be clear, here is some explanation to the algorithm.
it start with the idea of trying to divide the elements according to there bits, in other words make the bits as a filter, at each stage xor the divided elements, until the xor result is zero, then it is worth to check this group one by one as it will for sure contain at least one of the desired outputs. or if two consultative filters result in the same size we will stop this filter, it will be more clear with example below.
input: 1,6,4,1,4,5,8,8,4,6,8,8,9,7,9,5,9
we start by dividing the elements according to the Least significant bit.
1st bit zero : 6,4,4,8,8,4,6,8,8
6 xor 4 xor 4 xor 8 xor 8 xor 4 xor 6 xor 8 xor 8 = 4
so we will continue dividing this group according to the 2nd bit.
1st bit zero and 2nd bit zero : 4,4,4,8,8,8,8
4 xor 4 xor 4 xor 8 xor 8 xor 8 xor 8 xor 8 = 4.
so we will continue dividing this group according to the 3rd bit.
1st bit zero and 2nd bit zero and 3rd bit zero : 8,8,8,8
8 xor 8 xor 8 xor 8 = 0
so we will go through every element under this filter as the result of xor is zero and we will add 8 to our result so far.
1st bit zero and 2nd bit zero and 3rd bit one : 4,4,4
4 xor 4 xor 4 = 4
1st bit zero and 2nd bit zero and 3rd bit one and 4th bit zero : 4,4,4
4 xor 4 xor 4 = 4.
so we will stop here as this filter contain the same size as previous filter
now we will go back to the filter of 1st and 2nd bit
1st bit zero and 2nd bit one : 6,6
6 xor 6 = 0.
so we will go through every element under this filter as the result of xor is zero and we will add 6 to our result so far.
now we will go back to the filter of 1st bit
1st bit one : 9,5,9,7,9,1,1
now we will continue under this filter as the same procedure before.
for complete example see the above image.
Your outline of the problem and the example do not match. You say you're looking for 3 integers in your question, but the example shows 4.
I'm not sure this is possible without additional constraints. It seems to me that worst case size complexity will always be at least O(N-6) => O(N) without a sorted list and with the full set of integers.
If we started with sorted array, then yes, easy, but this constraint is not specified. Sorting the array ourselves will be too time or space complex.
My stab at the an answer, using Lashane's proposal in slightly different way:
char negBits[268435456]; // 2 ^ 28 = 2 ^ 30 (number of negative integer numbers) / 8 (size of char)
char posBits[268435456]; // ditto except positive
int number[] = { 1, 6, 4, 1, 4, 5, 8, 8, 4, 6, 8, 8, 9, 7, 9, 5, 9 };
for (int num : number){
if (num < 0){
num = -(num + 1);// Integer.MIN_VALUE would be excluded without this + 1
negBits[ << 4] ^= ((num & 0xf) >> 1);
}
else {
posBits[num << 4] ^= ((num & 0xf) >> 1);
// grab the rite char to mess with
// toggle the bit to represent the integer value.
}
}
// Now the hard part, find what values after all the toggling:
for (int i = 0; i < Integer.MAX_VALUE; i++){
if (negBits[i << 4] & ((i & 0xf) >> 1)){
System.out.print(" " + (-i - 1));
}
if (posBits[i << 4] & ((i & 0xf) >> 1)){
System.out.print(" " + i);
}
}
As per discussion in comments, below points are worth noting to this answer:
Assumes Java in 32 bit.
Java array have an inherent limit of Integer.MAX_INT
Let's assume we will consider binary numbers which has length 2n and n might be about 1000. We are looking for kth number (k is limited by 10^9) which has following properties:
Amount of 1's is equal to amount of 0's what can be described as following: #(1) = #(0)
Every prefix of this number has to contain atleast as much 0's as 1's. It might be easier to understand it after negating the sentence, which is: There is no prefix which would contain more 1's than 0's.
And basically that's it.
So to make it clear let's do some example:
n=2, k=2
we have to take binary number of length 2n:
0000
0001
0010
0011
0100
0101
0110
0111
1000
and so on...
And now we have to find 2nd number which fulfill those two requirements. So we see 0011 is the first one, and 0101 is second one.
If we change k=3, then answer doesn't exist since there are number which have same amount of opposite bits, but for 0110, there is prefix 011 so number doesn't fulfill second constraint and same would be with all numbers which has 1 as most significant bit.
So what I did so far to find algorithm?
Well my first idea was to generate all possible bits settings, and check whether it has those two properties, but generate them all would take O(2^(2n)) which is not an option for n=1000.
Additionally I realize there is no need to check all numbers which are smaller than 0011 for n=2, 000111 for n=3, and so on... frankly speaking those which half of most significant bits remains "untouched" because those numbers have no possibility to fulfill #(1) = #(0) condition. Using that I can reduce n by half, but it doesn't help much. Instead of 2 * forever I have forever running algorithm. It's still O(2^n) complexity, which is way too big.
Any idea for algorithm?
Conclusion
This text has been created as a result of my thoughts after reading Andy Jones post.
First of all I wouldn't post code I have used since it's point 6 in following document from Andy's post Kasa 2009. All you have to do is consider nr as that what I described as k. Unranking Dyck words algorithm, would help us find out answer much faster. However it has one bottleneck.
while (k >= C(n-i,j))
Considering that n <= 1000, Catalan number can be quite huge, even C(999,999). We can use some big number arithmetic, but on the other hand I came up with little trick to overpass it and use standard integer.
We don't want to know how big actually Catalan number is as long as it's bigger than k. So now we will create Catalan numbers caching partial sums in n x n table.
... ...
5 | 42 ...
4 | 14 42 ...
3 | 5 14 28 ...
2 | 2 5 9 14 ...
1 | 1 2 3 4 5 ...
0 | 1 1 1 1 1 1 ...
---------------------------------- ...
0 1 2 3 4 5 ...
To generate it is quite trivial:
C(x,0) = 1
C(x,y) = C(x,y-1) + C(x-1,y) where y > 0 && y < x
C(x,y) = C(x,y-1) where x == y
So what we can see only this:
C(x,y) = C(x,y-1) + C(x-1,y) where y > 0 && y < x
can cause overflow.
Let's stop at this point and provide definition.
k-flow - it's not real overflow of integer but rather information that value of C(x,y) is bigger than k.
My idea is to check after each running of above formula whether C(x,y) is grater than k or any of sum components is -1. If it is we put -1 instead, which would act as a marker, that k-flow has happened. I guess it quite obvious that if k-flow number is sum up with any positive number it's still be k-flowed in particular sum of 2 k-flowed numbers is k-flowed.
The last what we have to prove is that there is no possibility to create real overflow. Real overflow might only happen if we sum up a + b which non of them is k-flowed but as sum they generated the real overflow.
Of course it's impossible since maximum value can be described as a + b <= 2 * k <= 2*10^9 <= 2,147,483,647 where last value in this inequality is value of int with sign. I assume also that int has 32 bits, as in my case.
The numbers you are describing correspond to Dyck words. Pt 2 of Kasa 2009 gives a simple algorithm for enumerating them in lexicographic order. Its references should be helpful if you want to do any further reading.
As an aside (and be warned I'm half asleep as I write this, so it might be wrong), the wikipedia article notes that the number of Dyck words of length 2n is the n th Catalan number, C(n). You might want to find the smallest n such that C(n) is larger than the k you're looking for, and then enumerate Dyck words starting from X^n Y^n.
I'm sorry for misunderstood this problem last time, so I edit it and now I can promise the correction and you can test the code first, the complexity is O(n^2), the detail answer is follow
First, we can equal the problem to the next one
We are looking for kth largest number (k is limited by 10^9) which has following properties:
Amount of 1's is equal to amount of 0's what can be described as following: #(1) = #(0)
Every prefix of this number has to contain at least as much [[1's as 0's]], which means: There is no prefix which would contain more [[0's than 1's]].
Let's give an example to explain it: let n=3 and k=4, the amount of satisfied numbers is 5, and the picture below has explain what we should determine in previous problem and new problem:
| 000111 ------> 111000 ^
| 001011 ------> 110100 |
| 001101 ------> 110010 |
| previous 4th number 010011 ------> 101100 new 4th largest number |
v 010101 ------> 101010 |
so after we solve the new problem, we just need to bitwise not.
Now the main problem is how to solve the new problem. First, let A be the array, so A[m]{1<=m<=2n} only can be 1 or 0, let DP[v][q] be the amount of numbers which satisfy condition2 and condition #(1)=q in {A[2n-v+1]~A[2n]}, so the DP[2n][n] is the amount of satisfied numbers.
A[1] only can be 1 or 0, if A[1]=1, the amount of numbers is DP[2n-1][n-1], if A[1]=0, the amount of numbers is DP[2n-1][n], now we want to find the kth largest number, if k<=DP[2n-1][n-1], kth largest number's A[1] must be 1, then we can judge A[2] with DP[2n-2][n-2]; if k>DP[2n-1][n-1], kth largest number's A[1] must be 0 and k=k-DP[2n-1][n-1], then we can judge A[2] with DP[2n-2][n-1]. So with the same theory, we can judge A[j] one by one until there is no number to compare. Now we can give a example to understand (n=3, k=4)
(We use dynamic programming to determine DP matrix, the DP equation is DP[v][q]=DP[v-1][q-1]+DP[v-1][q])
Intention: we need the number in leftest row can be compared,
so we add a row on DP's left row, but it's not include by DP matrix
in the row, all the number is 1.
the number include by bracket are initialized by ourselves
the theory of initialize just follow the mean of DP matrix
DP matrix = (1) (0) (0) (0) 4<=DP[5][2]=5 --> A[1]=1
(1) (1) (0) (0) 4>DP[4][1]=3 --> A[2]=0, k=4-3=1
(1) (2) (0) (0) 1<=DP[3][1]=3 --> A[3]=1
(1) (3) 2 (0) 1<=1 --> a[4]=1
(1) (4) 5 (0) no number to compare, A[5]~A[6]=0
(1) (5) 9 5 so the number is 101100
If you have not understand clearly, you can use the code to understand
Intention:DP[2n][n] increase very fast, so the code can only work when n<=19, in the problem n<1000, so you can use big number programming, and the code can be optimize by bit operation, so the code is just a reference
/*--------------------------------------------------
Environment: X86 Ubuntu GCC
Author: Cong Yu
Blog: aimager.com
Mail: funcemail#gmail.com
Build_Date: Mon Dec 16 21:52:49 CST 2013
Function:
--------------------------------------------------*/
#include <stdio.h>
int DP[2000][1000];
// kth is the result
int kth[1000];
void Oper(int n, int k){
int i,j,h;
// temp is the compare number
// jishu is the
int temp,jishu=0;
// initialize
for(i=1;i<=2*n;i++)
DP[i-1][0]=i-1;
for(j=2;j<=n;j++)
for(i=1;i<=2*j-1;i++)
DP[i-1][j-1]=0;
for(i=1;i<=2*n;i++)
kth[i-1]=0;
// operate DP matrix with dynamic programming
for(j=2;j<=n;j++)
for(i=2*j;i<=2*n;i++)
DP[i-1][j-1]=DP[i-2][j-2]+DP[i-2][j-1];
// the main thought
if(k>DP[2*n-1][n-1])
printf("nothing\n");
else{
i=2*n;
j=n;
for(;j>=1;i--,jishu++){
if(j==1)
temp=1;
else
temp=DP[i-2][j-2];
if(k<=temp){
kth[jishu]=1;
j--;
}
else{
kth[jishu]=0;
if(j==1)
k-=1;
else
k-=DP[i-2][j-2];
}
}
for(i=1;i<=2*n;i++){
kth[i-1]=1-kth[i-1];
printf("%d",kth[i-1]);
}
printf("\n");
}
}
int main(){
int n,k;
scanf("%d",&n);
scanf("%d",&k);
Oper(n,k);
return 0;
}
If there is any number in the range [0 .. 264] which can not be generated by any XOR composition of one or more numbers from a given set, is there a efficient method which prints at least one of the unreachable numbers, or terminates with the information, that there are no unreachable numbers?
Does this problem have a name? Is it similar to another problem or do you have any idea, how to solve it?
Each number can be treated as a vector in the vector space (Z/2)^64 over Z/2. You basically want to know if the vectors given span the whole space, and if not, to produce one not spanned (except that the span always includes the zero vector – you'll have to special case this if you really want one or more). This can be accomplished via Gaussian elimination.
Over this particular vector space, Gaussian elimination is pretty simple. Start with an empty set for the basis. Do the following until there are no more numbers. (1) Throw away all of the numbers that are zero. (2) Scan the lowest bits set of the remaining numbers (lowest bit for x is x & ~(x - 1)) and choose one with the lowest order bit set. (3) Put it in the basis. (4) Update all of the other numbers with that same bit set by XORing it with the new basis element. No remaining number has this bit or any lower order bit set, so we terminate after 64 iterations.
At the end, if there are 64 elements, then the subspace is everything. Otherwise, we went fewer than 64 iterations and skipped a bit: the number with only this bit on is not spanned.
To special-case zero: zero is an option if and only if we never throw away a number (i.e., the input vectors are independent).
Example over 4-bit numbers
Start with 0110, 0011, 1001, 1010. Choose 0011 because it has the ones bit set. Basis is now {0011}. Other vectors are {0110, 1010, 1010}; note that the first 1010 = 1001 XOR 0011.
Choose 0110 because it has the twos bit set. Basis is now {0011, 0110}. Other vectors are {1100, 1100}.
Choose 1100. Basis is now {0011, 0110, 1100}. Other vectors are {0000}.
Throw away 0000. We're done. We skipped the high order bit, so 1000 is not in the span.
As rap music points out you can think of the problem as finding a base in a vector space. However, it is not necessary to actually solve it completely, just to find if it is possible to do or not, and if not: give an example value (that is a binary vector) that can not be described in terms of the supplied set.
This can be done in O(n^2) in terms of the size of the input set. This should be compared to Gauss elimination which is O(n^3), http://en.wikipedia.org/wiki/Gaussian_elimination.
64 bits are no problem at all. With the example python code below 1000 bits with a set with 1000 random values from 0 to 2^1000-1 takes about a second.
Instead of performing Gauss elimination it's enough to find out if we can rewrite the matrix of all bits on triangular form, such as: (for the 4 bit version:)
original triangular
1110 14 1110 14
1011 11 111 7
111 7 11 3
11 3 1 1
1 1 0 0
The solution works like this: First all original values with the same most significant bit are places together in a list of lists. For our example:
[[14,11],[7],[3],[1],[]]
The last empty entry represents that there were no zeros in the original list. Now, take a value from the first entry and replace that entry with a list containing only that number:
[[14],[7],[3],[1],[]]
and then store the xor of the kept number with all the removed entries at the right place in the vector. For our case we have 14^11 = 5 so:
[[14],[7,5],[3],[1],[]]
The trick is that we do not need to scan and update all other values, just the values with the same most significant bit.
Now process the item 7,5 in the same way. Keep 7, add 7^5 = 2 to the list:
[[14],[7],[3,2],[1],[]]
Now 3,2 leaves [3] and adds 1 :
[[14],[7],[3],[1,1],[]]
And 1,1 leaves [1] and adds 0 to the last entry allowing values with no set bit:
[[14],[7],[3],[1],[0]]
If in the end the vector contains at least one number at each vector entry (as in our example) the base is complete and any number fits.
Here's the complete code:
# return leading bit index ir -1 for 0.
# example 1 -> 0
# example 9 -> 3
def leadbit(v):
# there are other ways, yes...
return len(bin(v))-3 if v else -1
def examinebits(baselist,nbitbuckets):
# index 1 is least significant bit.
# index 0 represent the value 0
bitbuckets=[[] for x in range(nbitbuckets+1)]
for j in baselist:
bitbuckets[leadbit(j)+1].append(j)
for i in reversed(range(len(bitbuckets))):
if bitbuckets[i]:
# leave just the first value of all in bucket i
bitbuckets[i],newb=[bitbuckets[i][0]],bitbuckets[i][1:]
# distribute the subleading values into their buckets
for ni in newb:
q=bitbuckets[i][0]^ni
lb=leadbit(q)+1
if lb:
bitbuckets[lb].append(q)
else:
bitbuckets[0]=[0]
else:
v=2**(i-1) if i else 0
print "bit missing: %d. Impossible value: %s == %d"%(i-1,bin(v),v)
return (bitbuckets,[i])
return (bitbuckets,[])
Example use: (8 bit)
import random
nbits=8
basesize=8
topval=int(2**nbits)
# random set of values to try:
basel=[random.randint(0,topval-1) for dummy in range(basesize)]
bl,ii=examinebits(basel,nbits)
bl is now the triangular list of values, up to the point where it was not possible (in that case). The missing bit (if any) is found in ii[0].
For the following tried set of values: [242, 242, 199, 197, 177, 177, 133, 36] the triangular version is:
base value: 10110001 177
base value: 1110110 118
base value: 100100 36
base value: 10000 16
first missing bit: 3 val: 8
( the below values where not completely processed )
base value: 10 2
base value: 1 1
base value: 0 0
The above list were printed like this:
for i in range(len(bl)):
bb=bl[len(bl)-i-1]
if ii and len(bl)-ii[0] == i:
print "example missing bit:" ,(ii[0]-1), "val:", 2**(ii[0]-1)
print "( the below values where not completely processed )"
if len(bb):
b=bb[0]
print ("base value: %"+str(nbits)+"s") %(bin(b)[2:]), b
I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.
Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).
Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.
A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
The code to implement the bitset is simple: (taken from solutions page )
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.
HTH.
Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)
For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break
Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.
This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers, 0 to 4294967295.
Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.
EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)
They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)
Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.
Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.
To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.
If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.
If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.
The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
0 - 00
1 - 01
2 - 10
3 - 11
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).
Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.
Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.
The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;
Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.
Of course you'd need a backup plan in case n+1 causes an integer overflow.
Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
void maxNum(ulong filesize)
{
ulong bitcount = filesize * 8; //number of bits in file
for (ulong i = 0; i < bitcount; i++)
{
Console.Write(9);
}
}
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.
From Reddit by Carbonetc.
For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.
You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.
Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from int_min to int_max, and
bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)
for (integer i = int_min; i <= int_max; ++i)
{
if (isNotInFile(i)) {
return i;
}
}
For the 10 MB memory constraint:
Convert the number to its binary representation.
Create a binary tree where left = 0 and right = 1.
Insert each number in the tree using its binary representation.
If a number has already been inserted, the leafs will already have been created.
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.
I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s.
foreach (var number in file) {
bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0.
}
for (var i = 0; i < 4294967296; i++) {
if (bitArray[i] == 0) {
return i - 2147483648;
}
}
As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.
Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.
2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.
I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}
As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.
If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!
(Joking... kind of.)