Random Numbers based on the ANU Quantum Random Numbers Server - ruby

I have been asked to use the ANU Quantum Random Numbers Service to create random numbers and use Random.rand only as a fallback.
module QRandom
def next
RestClient.get('http://qrng.anu.edu.au/API/jsonI.php?type=uint16&length=1'){ |response, request, result, &block|
case response.code
when 200
_json=JSON.parse(response)
if _json["success"]==true && _json["data"]
_json["data"].first || Random.rand(65535)
else
Random.rand(65535) #fallback
end
else
puts response #log problem
Random.rand(65535) #fallback
end
}
end
end
Their API service gives me a number between 0-65535. In order to create a random for a bigger set, like a random number between 0-99999, I have to do the following:
(QRandom.next.to_f*(99999.to_f/65535)).round
This strikes me as the wrong way of doing, since if I were to use a service (quantum or not) that creates numbers from 0-3 and transpose them into space of 0-9999 I have a choice of 4 numbers that I always get. How can I use the service that produces numbers between 0-65535 to create random numbers for a larger number set?

Since 65535 is 1111111111111111 in binary, you can just think of the random number server as a source of random bits. The fact that it gives the bits to you in chunks of 16 is not important, since you can make multiple requests and you can also ignore certain bits from the response.
So after performing that abstraction, what we have now is a service that gives you a random bit (0 or 1) whenever you want it.
Figure out how many bits of randomness you need. Since you want a number between 0 and 99999, you just need to find a binary number that is all ones and is greater than or equal to 99999. Decimal 99999 is equal to binary 11000011010011111, which is 17 bits long, so you will need 17 bits of randomness.
Now get 17 bits of randomness from the service and assemble them into a binary number. The number will be between 0 and 2**17-1 (131071), and it will be evenly distributed. If the random number happens to be greater than 99999, then throw away the bits you have and try again. (The probability of needing to retry should be less than 50%.)
Eventually you will get a number between 0 and 99999, and this algorithm should give you a totally uniform distribution.

How about asking for more numbers? Using the length parameter of that API you can just ask for extra numbers and sum them so you get bigger numbers like you want.
http://qrng.anu.edu.au/API/jsonI.php?type=uint16&length=2
You can use inject for the sum and the modulo operation to make sure the number is not bigger than you want.
json["data"].inject(:+) % MAX_NUMBER
I made some other changes to your code like using SecureRandom instead of the regular Random. You can find the code here:
https://gist.github.com/matugm/bee45bfe637f0abf8f29#file-qrandom-rb

Think of the individual numbers you are getting as 16 bits of randomness. To make larger random numbers, you just need more bits. The tricky bit is figuring out how many bits is enough. For example, if you wanted to generate numbers from an absolutely fair distribution from 0 to 65000, then it should be pretty obvious that 16 bits are not enough; even though you have the range covered, some numbers will have twice the probability of being selected than others.
There are a couple of ways around this problem. Using Ruby's Bignum (technically that happens behind the scenes, it works well in Ruby because you won't overflow your Integer type) it is possible to use a method that simply collects more bits until the result of a division could never be ambiguous - i.e. the difference when adding more significant bits to the division you are doing could never change the result.
This what it might look like, using your QRandom.next method to fetch bits in batches of 16:
def QRandom.rand max
max = max.to_i # This approach requires integers
power = 1
sum = 0
loop do
sum = 2**16 * sum + QRandom.next
power *= 2**16
lower_bound = sum * max / power
break lower_bound if lower_bound == ( (sum + 1) * max ) / power
end
end
Because it costs you quite a bit to fetch random bits from your chosen source, you may benefit from taking this to the most efficient form possible, which is similar in principle to Arithmetic Coding and squeezes out the maximum possible entropy from your source whilst generating unbiased numbers in 0...max. You would need to implement a method QRandom.next_bits( num ) that returned an integer constructed from a bitstream buffer originating with your 16-bit numbers:
def QRandom.rand max
max = max.to_i # This approach requires integers
# I prefer this: start_bits = Math.log2( max ).floor
# But this also works (and avoids suggestions the algo uses FP):
start_bits = max.to_s(2).length
sum = QRandom.next_bits( start_bits )
power = 2 ** start_bits
# No need for fractional bits if max is power of 2
return sum if power == max
# Draw 1 bit at a time to resolve fractional powers of 2
loop do
lower_bound = (sum * max) / power
break lower_bound if lower_bound == ((sum + 1) * max)/ power
sum = 2 * sum + QRandom.next_bits(1) # 0 or 1
power *= 2
end
end
This is the most efficient use of bits from your source possible. It is always as efficient or better than re-try schemes. The expected number of bits used per call to QRandom.rand( max ) is 1 + Math.log2( max ) - i.e. on average this allows you to draw just over the fractional number of bits needed to represent your range.

Related

Early termination of fractional exponent calculation?

I need to write a function that takes the sixth root of something (equivalently, raises something to the 1/6 power), and checks if the answer is an integer. I want this function to be as fast and as optimized as possible, and since this function needs to run a lot, I'm thinking it might be best to not have to calculate the whole root.
How would I write a function (language agnostic, although Python/C/C++ preferred) that returns False (or 0 or something equivalent) before having to compute the entirety of the sixth root? For instance, if I was taking the 6th root of 65, then my function should, upon realizing that that the result is not an int, stop calculating and return False, instead of first computing that the 6th of 65 is 2.00517474515, then checking if 2.00517474515 is an int, and finally returning False.
Of course, I'm asking this question under the impression that it is faster to do the early termination thing than the complete computation, using something like
print(isinstance(num**(1/6), int))
Any help or ideas would be greatly appreciated. I would also be interested in answers that are generalizable to lots of fractional powers, not just x^(1/6).
Here are some ideas of things you can try that might help eliminate non-sixth-powers quickly. For actual sixth powers, you'll still end up eventually needing to compute the sixth root.
Check small cases
If the numbers you're given have a reasonable probability of being small (less than 12 digits, say), you could build a table of small cases and check against that. There are only 100 sixth powers smaller than 10**12. If your inputs will always be larger, then there's little value in this test, but it's still a very cheap test to make.
Eliminate small primes
Any small prime factor must appear with an exponent that's a multiple of 6. To avoid too many trial divisions, you can bundle up some of the small factors.
For example, 2 * 3 * 5 * 7 * 11 * 13 * 17 * 19 * 23 = 223092870, which is small enough to fit in single 30-bit limb in Python, so a single modulo operation with that modulus should be fast.
So given a test number n, compute g = gcd(n, 223092870), and if the result is not 1, check that n is exactly divisible by g ** 6. If not, n is not a sixth power, and you're done. If n is exactly divisible by g**6, repeat with n // g**6.
Check the value modulo 124488 (for example)
If you carried out the previous step, then at this point you have a value that's not divisible by any prime smaller than 25. Now you can do a modulus test with a carefully chosen modulus: for example, any sixth power that's relatively prime to 124488 = 8 * 9 * 7 * 13 * 19 is congruent to one of the six values [1, 15625, 19657, 28729, 48385, 111385] modulo 124488. There are larger moduli that could be used, at the expense of having to check more possible residues.
Check whether it's a square
Any sixth power must be a square. Since Python (at least, Python >= 3.8) has a built-in integer square root function that's reasonably fast, it's efficient to check whether the value is a square before going for computing a full sixth root. (And if it is a square and you've already computed the square root, now you only need to extract a cube root rather than a sixth root.)
Use floating-point arithmetic
If the input is not too large, say 90 digits or smaller, and it's a sixth power then floating-point arithmetic has a reasonable chance of finding the sixth root exactly. However, Python makes no guarantees about the accuracy of a power operation, so it's worth making some additional checks to make sure that the result is within the expected range. For larger inputs, there's less chance of floating-point arithmetic getting the right result. The sixth root of (2**53 + 1)**6 is not exactly representable as a Python float (making the reasonable assumption that Python's float type matches the IEEE 754 binary64 format), and once n gets past 308 digits or so it's too large to fit into a float anyway.
Use integer arithmetic
Once you've exhausted all the cheap tricks, you're left with little choice but to compute the floor of the sixth root, then compare the sixth power of that with the original number.
Here's some Python code that puts together all of the tricks listed above. You should do your own timings targeting your particular use-case, and choose which tricks are worth keeping and which should be adjusted or thrown out. The order of the tricks will also be significant.
from math import gcd, isqrt
# Sixth powers smaller than 10**12.
SMALL_SIXTH_POWERS = {n**6 for n in range(100)}
def is_sixth_power(n):
"""
Determine whether a positive integer n is a sixth power.
Returns True if n is a sixth power, and False otherwise.
"""
# Sanity check (redundant with the small cases check)
if n <= 0:
return n == 0
# Check small cases
if n < 10**12:
return n in SMALL_SIXTH_POWERS
# Try a floating-point check if there's a realistic chance of it working
if n < 10**90:
s = round(n ** (1/6.))
if n == s**6:
return True
elif (s - 1) ** 6 < n < (s + 1)**6:
return False
# No conclusive result; fall through to the next test.
# Eliminate small primes
while True:
g = gcd(n, 223092870)
if g == 1:
break
n, r = divmod(n, g**6)
if r:
return False
# Check modulo small primes (requires that
# n is relatively prime to 124488)
if n % 124488 not in {1, 15625, 19657, 28729, 48385, 111385}:
return False
# Find the square root using math.isqrt, throw out non-squares
s = isqrt(n)
if s**2 != n:
return False
# Compute the floor of the cube root of s
# (which is the same as the floor of the sixth root of n).
# Code stolen from https://stackoverflow.com/a/35276426/270986
a = 1 << (s.bit_length() - 1) // 3 + 1
while True:
d = s//a**2
if a <= d:
return a**3 == s
a = (2*a + d)//3

Generate a unique number out of the combination of 'n' different numbers?

To clarify, as input I have 'n' (n1, n2, n3,...) numbers (integers) such as each number is unique within this set.
I would like to generate a number out of this set (lets call the generated number big 'N') that is also unique, and that allows me to verify that a number 'n1' belongs to the set 'n' just by using 'N'.
is that possible?
Edit:
Thanks for the answers guys, I am looking into them atm. For those requesting an example, here is a simple one:
imagine i have those paths (bi-directional graph) with a random unique value (let's call it identifier):
P1 (N1): A----1----B----2----C----3----D
P2 (N2): A----4----E----5----D
So I want to get the full path (unique path, not all paths) from A knowing N1 and this path as a result should be P1.
Mind you that 1,2,...are just unique numbers in this graph, not weights or distances, I just use them for my heuristic.
If you are dealing with small numbers, no problem. You are doing the same thing with digits every time you compose a number: a digit is a number from 0 to 9 and a full number is a combination of them that:
is itself a number
is unique for given digits
allows you to easily verify if a digit is inside
The gotcha is that the numbers must have an upper limit, like 10 is for digits. Let's say 1000 here for simplicity, the similar composed number could be:
n1*1000^k + n2*1000^(k-1) + n3*1000^(k-2) ... + nk*1000^(0)
So if you have numbers 33, 44 and 27 you will get:
33*1000000 + 44*1000 + 27, and that is number N: 33044027
Of course you can do the same with bigger limits, and binary like 256,1024 or 65535, but it grows big fast.
A better idea, if possible is to convert it into a string (a string is still a number!) with some separator (a number in base 11, that is 10 normal digits + 1 separator digit). This is more flexible as there are no upper limits. Imagine to use digits 0-9 + a separator digit 'a'. You can obtain number 33a44a27 in base 11. By translating this to base 10 or base 16 you can get an ordinary computer number (65451833 if I got it right). Then converting 65451833 to undecimal (base11) 33a44a27, and splitting by digit 'a' you can get the original numbers back to test.
EDIT: A VARIABLE BASE NUMBER?
Of course this would work better digitally in base 17 (16 digits+separator). But I suspect there are more optimal ways, for example if the numbers are unique in the path, the more numbers you add, the less are remaining, the shorter the base could shrink. Can you imagine a number in which the first digit is in base 20, the second in base 19, the third in base 18, and so on? Can this be done? Meh?
In this variating base world (in a 10 nodes graph), path n0-n1-n2-n3-n4-n5-n6-n7-n8-n9 would be
n0*10^0 + (n1*9^1)+(offset:1) + n2*8^2+(offset:18) + n3*7^3+(offset:170)+...
offset1: 10-9=1
offset2: 9*9^1-1*8^2+1=81-64+1=18
offset3: 8*8^2-1*7^3+1=343-512+1=170
If I got it right, in this fiddle: http://jsfiddle.net/Hx5Aq/ the biggest number path would be: 102411
var path="9-8-7-6-5-4-3-2-1-0"; // biggest number
o2=(Math.pow(10,1)-Math.pow(9,1)+1); // offsets so digits do not overlap
o3=(Math.pow(9,2)-Math.pow(8,2)+1);
o4=(Math.pow(8,3)-Math.pow(7,3)+1);
o5=(Math.pow(7,4)-Math.pow(6,4)+1);
o6=(Math.pow(6,5)-Math.pow(5,5)+1);
o7=(Math.pow(5,6)-Math.pow(4,6)+1);
o8=(Math.pow(4,7)-Math.pow(3,7)+1);
o9=(Math.pow(3,8)-Math.pow(2,8)+1);
o10=(Math.pow(2,9)-Math.pow(1,9)+1);
o11=(Math.pow(1,10)-Math.pow(0,10)+1);
var n=path.split("-");
var res;
res=
n[9]*Math.pow(10,0) +
n[8]*Math.pow(9,1) + o2 +
n[7]*Math.pow(8,2) + o3 +
n[6]*Math.pow(7,3) + o4 +
n[5]*Math.pow(6,4) + o5 +
n[4]*Math.pow(5,5) + o6 +
n[3]*Math.pow(4,6) + o7 +
n[2]*Math.pow(3,7) + o8 +
n[1]*Math.pow(2,8) + o9 +
n[0]*Math.pow(1,9) + o10;
alert(res);
So N<=102411 would represent any path of ten nodes? Just a trial. You have to find a way of naming them, for instance if they are 1,2,3,4,5,6... and you use 5 you will have to compact the remaining 1,2,3,4,6->5,7->6... => 1,2,3,4,5,6... (that is revertable and unique if you start from the first)
Theoretically, yes it is.
By defining p_i as the i'th prime number, you can generate N=p_(n1)*p_(n2)*..... Now, all you have to do is to check if N%p_(n) == 0 or not.
However, note that N will grow to huge numbers very fast, so I am not sure this is a very practical solution.
One very practical probabilistic solution is using bloom filters. Note that bloom filters is a set of bits, that can be translated easily to any number N.
Bloom filters have no false negatives (if you said a number is not in the set, it really isn't), but do suffer from false positives with an expected given probability (that is dependent on the size of the sets, number of functions used and number of bits used).
As a side note, to get a result that is 100% accurate, you are going to need at the very least 2^k bits (where k is the range of the elements) to represent the number N by looking at this number as a bitset, where each bit indicates existence or non-existence of a number in the set. You can show that there is no 100% accurate solution that uses less bits (peigeon hole principle). Note that for integers for example with 32 bits, it means you are going to need N with 2^32 bits, which is unpractical.

"interval is empty", Lua math.random isn't working for large numbers?

I didn't know if this is a bug in Lua itself or if I was doing something wrong. I couldn't find anything about it anywhere. I am using Lua for Windows (Lua 5.1.4):
>return math.random(0, 1000000000)
1251258
This returns a random integer between 0 and 10000000000, as expected. This seems to work for all other values. But if I add a single 0:
>return math.random(0, 10000000000)
stdin:1: bad argument #2 to 'random' (interval is empty)
Any number higher than that does the same thing.
I tried to figure out exactly how high a number has to be to cause this and found something even weirder:
>return math.random(0, 2147483647)
-75617745
If the value is 2147483647 then it gives me negative numbers. Any higher than that and it throws an error. Any lower than that and it works fine.
That's 0b1111111111111111111111111111111 in binary, 31 binary digits exactly. I am not sure what that means though.
This unexpected behavior (bug?) is due to how math.random treats the input arguments passed in Lua 5.1. From lmathlib.c:
case 2: { /* lower and upper limits */
int l = luaL_checkint(L, 1);
int u = luaL_checkint(L, 2);
luaL_argcheck(L, l<=u, 2, "interval is empty");
lua_pushnumber(L, floor(r*(u-l+1))+l); /* int between `l' and `u' */
break;
}
As you may know in C, a standard int can represent values -2,147,483,648 to 2,147,483,647. Adding +1 to 2,147,483,647, like in your use-case, will overflow and wrap around the value giving -2,147,483,648. The end result is negative since you're multiplying a positive with a negative number.
Furthermore, anything above 2,147,483,647 will fail the luaL_argcheck due to overflow wraparound.
There are a few ways to address this problem:
Upgrade to Lua 5.2. That one has since fixed this issue by treating the input arguments as lua_Number instead.
Switch to LuaJIT which does not have this integer overflow issue.
Patch the Lua 5.1 source yourself with the fix and recompile.
Modify your random range so it does not overflow.
If you need a range that is larger than what the random function supports (32 bit signed integers or 2^31 due to sign bit, because math.random is at C level), but smaller than the range of Lua "number" type (based on What is the maximum value of a number in Lua?, 2^52, or maybe even 2^53), you could try generating two random numbers: scale the first to the range desired; add the second to "fill the gap". For example, say you want a range of 0 to 2^36. The largest from math.random is 2^31. So you could do:
-- 2^36 = 2^31 * 2^5 so
scale = 2^5
baseRand = scale * math.random(0, 2^31)
-- baseRand is now between 0 and 2^36 but there are gaps of 2^5 in the set
-- of possible values; fill the gaps with second random number:
fillGap = math.random(0, 2^5)
randNum = baseRand + fillGap
This will work as long as the desired range is less than the Lua interpreter's maximum for Lua numbers, which is a configurable compile time parameter but if you use stock build it is 2^52, a very large number (although not as large as largest long integer, 2^63).
Note also that largest positive N-bit integer is 2^N-1 (not 2^N), but the above technique can be applied to any range, you could have for instance scale = 10^6 then randNum = 10^6 * math.random(0, 10^8) + math.random(0, 10^6).

Generating strongly biased random numbers for tests

I want to run tests with randomized inputs and need to generate 'sensible' random
numbers, that is, numbers that match good enough to pass the tested function's
preconditions, but hopefully wreak havoc deeper inside its code.
math.random() (I'm using Lua) produces uniformly distributed random
numbers. Scaling these up will give far more big numbers than small numbers,
and there will be very few integers.
I would like to skew the random numbers (or generate new ones using the old
function as a randomness source) in a way that strongly favors 'simple' numbers,
but will still cover the whole range, i.e., extending up to positive/negative infinity
(or ±1e309 for double). This means:
numbers up to, say, ten should be most common,
integers should be more common than fractions,
numbers ending in 0.5 should be the most common fractions,
followed by 0.25 and 0.75; then 0.125,
and so on.
A different description: Fix a base probability x such that probabilities
will sum to one and define the probability of a number n as xk
where k is the generation in which n is constructed as a surreal
number1. That assigns x to 0, x2 to -1 and +1,
x3 to -2, -1/2, +1/2 and +2, and so on. This
gives a nice description of something close to what I want (it skews a bit too
much), but is near-unusable for computing random numbers. The resulting
distribution is nowhere continuous (it's fractal!), I'm not sure how to
determine the base probability x (I think for infinite precision it would be
zero), and computing numbers based on this by iteration is awfully
slow (spending near-infinite time to construct large numbers).
Does anyone know of a simple approximation that, given a uniformly distributed
randomness source, produces random numbers very roughly distributed as
described above?
I would like to run thousands of randomized tests, quantity/speed is more
important than quality. Still, better numbers mean less inputs get rejected.
Lua has a JIT, so performance is usually not much of an issue. However, jumps based
on randomness will break every prediction, and many calls to math.random()
will be slow, too. This means a closed formula will be better than an
iterative or recursive one.
1 Wikipedia has an article on surreal numbers, with
a nice picture. A surreal number is a pair of two surreal
numbers, i.e. x := {n|m}, and its value is the number in the middle of the
pair, i.e. (for finite numbers) {n|m} = (n+m)/2 (as rational). If one side
of the pair is empty, that's interpreted as increment (or decrement, if right
is empty) by one. If both sides are empty, that's zero. Initially, there are
no numbers, so the only number one can build is 0 := { | }. In generation
two one can build numbers {0| } =: 1 and { |0} =: -1, in three we get
{1| } =: 2, {|1} =: -2, {0|1} =: 1/2 and {-1|0} =: -1/2 (plus some
more complex representations of known numbers, e.g. {-1|1} ? 0). Note that
e.g. 1/3 is never generated by finite numbers because it is an infinite
fraction – the same goes for floats, 1/3 is never represented exactly.
How's this for an algorithm?
Generate a random float in (0, 1) with a library function
Generate a random integral roundoff point according to a desired probability density function (e.g. 0 with probability 0.5, 1 with probability 0.25, 2 with probability 0.125, ...).
'Round' the float by that roundoff point (e.g. floor((float_val << roundoff)+0.5))
Generate a random integral exponent according to another PDF (e.g. 0, 1, 2, 3 with probability 0.1 each, and decreasing thereafter)
Multiply the rounded float by 2exponent.
For a surreal-like decimal expansion, you need a random binary number.
Even bits tell you whether to stop or continue, odd bits tell you whether to go right or left on the tree:
> 0... => 0.0 [50%] Stop
> 100... => -0.5 [<12.5%] Go, Left, Stop
> 110... => 0.5 [<12.5%] Go, Right, Stop
> 11100... => 0.25 [<3.125%] Go, Right, Go, Left, Stop
> 11110... => 0.75 [<3.125%] Go, Right, Go, Right, Stop
> 1110100... => 0.125
> 1110110... => 0.375
> 1111100... => 0.625
> 1111110... => 0.875
One way to quickly generate a random binary number is by looking at the decimal digits in math.random() and replace 0-4 with '1' and 5-9 with '1':
0.8430419054348022
becomes
1000001010001011
which becomes -0.5
0.5513009827118367
becomes
1100001101001011
which becomes 0.25
etc
Haven't done much lua programming, but in Javascript you can do:
Math.random().toString().substring(2).split("").map(
function(digit) { return digit >= "5" ? 1 : 0 }
);
or true binary expansion:
Math.random().toString(2).substring(2)
Not sure which is more genuinely "random" -- you'll need to test it.
You could generate surreal numbers in this way, but most of the results will be decimals in the form a/2^b, with relatively few integers. On Day 3, only 2 integers are produced (-3 and 3) vs. 6 decimals, on Day 4 it is 2 vs. 14, and on Day n it is 2 vs (2^n-2).
If you add two uniform random numbers from math.random(), you get a new distribution which has a "triangle" like distribution (linearly decreasing from the center). Adding 3 or more will get a more 'bell curve' like distribution centered around 0:
math.random() + math.random() + math.random() - 1.5
Dividing by a random number will get a truly wild number:
A/(math.random()+1e-300)
This will return an results between A and (theoretically) A*1e+300,
though my tests show that 50% of the time the results are between A and 2*A
and about 75% of the time between A and 4*A.
Putting them together, we get:
round(6*(math.random()+math.random()+math.random() - 1.5)/(math.random()+1e-300))
This has over 70% of the number returned between -9 and 9 with a few big numbers popping up rarely.
Note that the average and sum of this distribution will tend to diverge towards a large negative or positive number, because the more times you run it, the more likely it is for a small number in the denominator to cause the number to "blow up" to a large number such as 147,967 or -194,137.
See gist for sample code.
Josh
You can immediately calculate the nth born surreal number.
Example, the 1000th Surreal number is:
convert to binary:
1000 dec = 1111101000 bin
1's become pluses and 0's minuses:
1111101000
+++++-+---
The first '1' bit is 0 value, the next set of similar numbers is +1 (for 1's) or -1 (for 0's), then the value is 1/2, 1/4, 1/8, etc for each subsequent bit.
1 1 1 1 1 0 1 0 0 0
+ + + + + - + - - -
0 1 1 1 1 h h h h h
+0+1+1+1+1-1/2+1/4-1/8-1/16-1/32
= 3+17/32
= 113/32
= 3.53125
The binary length in bits of this representation is equal to the day on which that number was born.
Left and right numbers of a surreal number are the binary representation with its tail stripped back to the last 0 or 1 respectively.
Surreal numbers have an even distribution between -1 and 1 where half of the numbers created to a particular day will exist. 1/4 of the numbers exists evenly distributed between -2 to -1 and 1 to 2 and so on. The max range will be negative to positive integers matching the number of days you provide. The numbers go to infinity slowly because each day only adds one to the negative and positive ranges and days contain twice as many numbers as the last.
Edit:
A good name for this bit representation is "sinary"
Negative numbers are transpositions. ex:
100010101001101s -> negative number (always start 10...)
111101010110010s -> positive number (always start 01...)
and we notice that all bits flip accept the first one which is a transposition.
Nan is => 0s (since all other numbers start with 1), which makes it ideal for representation in bit registers in a computer since leading zeros are required (we don't make ternary computer anymore... too bad)
All Conway surreal algebra can be done on these number without needing to convert to binary or decimal.
The sinary format can be seem as a one plus a simple one's counter with a 2's complement decimal representation attached.
Here is an incomplete report on finary (similar to sinary): https://github.com/peawormsworth/tools/blob/master/finary/Fine%20binary.ipynb

Generate an integer that is not among four billion given ones

I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.
Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).
Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.
A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
The code to implement the bitset is simple: (taken from solutions page )
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.
HTH.
Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)
For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break
Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.
This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers, 0 to 4294967295.
Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.
EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)
They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)
Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.
Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.
To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.
If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.
If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.
The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
0 - 00
1 - 01
2 - 10
3 - 11
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).
Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.
Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.
The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;
Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.
Of course you'd need a backup plan in case n+1 causes an integer overflow.
Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
void maxNum(ulong filesize)
{
ulong bitcount = filesize * 8; //number of bits in file
for (ulong i = 0; i < bitcount; i++)
{
Console.Write(9);
}
}
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.
From Reddit by Carbonetc.
For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.
You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.
Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from int_min to int_max, and
bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)
for (integer i = int_min; i <= int_max; ++i)
{
if (isNotInFile(i)) {
return i;
}
}
For the 10 MB memory constraint:
Convert the number to its binary representation.
Create a binary tree where left = 0 and right = 1.
Insert each number in the tree using its binary representation.
If a number has already been inserted, the leafs will already have been created.
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.
I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s.
foreach (var number in file) {
bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0.
}
for (var i = 0; i < 4294967296; i++) {
if (bitArray[i] == 0) {
return i - 2147483648;
}
}
As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.
Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.
2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.
I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}
As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.
If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!
(Joking... kind of.)

Resources