Fast generation of random numbers that appear random - algorithm

I am looking for an efficient way to generate numbers that a human would perceive as being random. Basically, I think of this as avoiding long sequences of 0 or 1 bits. I expect humans to be viewing the bit pattern, and a very low powered cpu should be able to calculate near a thousand of these per second.
There are two different concepts that I can think of to do this, but I am lost finding a efficient way of accomplishing them.
Generate a random number with a fixed number of one bits. For a 32-bit random number, this requires up to 31 random numbers, using the Knuth selection algorithm. is there a more efficient way to generate a random number with some number of bits set? Unfortunately, 0000FFFF doesn't look very random.
Some form of "part-wise' density seems like it'd look better - but I can't come up with a clear way of doing so - I'd imagine going through each chunk, and calculate how far it is from the ideal density, and try to increase the bit density of the next chunk. This sounds complex.
Hopefully there's another algorithm that I haven't thought about for this. Thanks in advance for your help.
[EDIT]
I should be clearer with what I ask -
(a) Is there an efficient way to generate random numbers without "long" runs of a single bit, where "long" is a tunable parameter?
(b) Other suggestions on what would make a number appear to be less-random?

A linear feedback shift register probably does what you want.
Edit in light of an updated question: You should look at a shuffle bag, although I'm not sure how fast this could run. See also this question.

I don't really know what you mean by bit patterns that "look" random. Is there some algorithm for defining what that is? One way might be to formulate an array consisting of only those numbers which are random enough for your purpose, then, randomly select elements from that array and push them onto the stream. The thing you seem to be trying to do seems bizarre to me and may be doomed to failure though. What happens if you have two 32 bit numbers which taken individually would meet your criteria for apparent randomicity, but when placed side by side make a sufficiently long stream of 0's or 1's to look made up?
Finally, I couldn't resist this.

You need to decide by exactly what rules you decide if something "looks random". Then you take a random number generator that produces enough "real randomness" for your purpose, and every time it generates a number that doesn't look random enough, you throw that number away and generate a new one.
Or you directly produce a sequence of "random" bits and every time the random generator outputs the "wrong" next bit (that would make it look not-random), you just flip that bit.

Here's what I'd do. I'd use a number like 00101011100101100110100101100101 and rotate it by some random amount each time.
But are you sure that a typical pseudo random generator wouldn't do? Have you tried it? You con't very many long strings of 0s and 1s anyhow.
If you're going to use a library random number and you're worried about too many or too few bits being set, there are cheap ways of counting bits.

Random numbers often have long sequences of 1s and 0s, so I'm not sure I fully understand why you can't use a simple linear congruential generator and shift in or out how ever many bits you need. They're blazing fast, look extremely random to the naked eye, and you can choose coefficients that will yield random integers in whatever positive range you need. If you need 32 "random looking" bits, just generate four random numbers and take the low 8 bits from each.
You don't really need to implement your own at all though, since in most languages the random library already implements one.
If you're determined that you want a particular density of 1s, though, you could always start with a number that has the required number of 1s set
int a = 0x00FF;
then use a bit twiddling hack to implement a bit-level shuffle of the bits in that number.

If you are looking to avoid long runs, how about something simple like:
#include <cstdlib>
class generator {
public:
generator() : last_num(0), run_count(1) { }
bool next_bit() {
const bool flip = rand() > RAND_MAX / pow( 2, run_count);
// RAND_MAX >> run_count ?
if(flip) {
run_count = 1;
last_num = !last_num;
} else
++run_count;
return last_num;
}
private:
bool last_num;
int run_count;
};
Runs become less likely the longer they go on. You could also do RAND_MAX / 1+run_count if you wanted longer runs

Since you care most about run length, you could generate random run lengths instead of random bits, so as to give them the exact distribution you want.
The mean run length in random binary data is of course 4 (sum of n/(2^(n-1))), and the mode average 1. Here are some random bits (I swear this is a single run, I didn't pick a value to make my point):
0111111011111110110001000101111001100000000111001010101101001000
See there's a run length of 8 in there. This is not especially surprising, since run length 8 should occur roughly every 256 bits and I've generated 64 bits.
If this doesn't "look random" to you because of excessive run lengths, then generate run lengths with whatever distribution you want. In pseudocode:
loop
get a random number
output that many 1 bits
get a random number
output that many 0 bits
endloop
You'd probably want to discard some initial data from the stream, or randomise the first bit, to avoid the problem that as it stands, the first bit is always 1. The probability of the Nth bit being 1 depends on how you "get a random number", but for anything that achieves "shortish but not too short" run lengths it will soon be as close to 50% as makes no difference.
For instance "get a random number" might do this:
get a uniformly-distributed random number n from 1 to 81
if n is between 1 and 54, return 1
if n is between 55 and 72, return 2
if n is between 72 and 78, return 3
if n is between 79 and 80, return 4
return 5
The idea is that the probability of a run of length N is one third the probability of a run of length N-1, instead of one half. This will give much shorter average run lengths, and a longest run of 5, and would therefore "look more random" to you. Of course it would not "look random" to anyone used to dealing with sequences of coin tosses, because they'd think the runs were too short. You'd also be able to tell very easily with statistical tests that the value of digit N is correlated with the value of digit N-1.
This code uses at least log(81) = 6.34 "random bits" to generate on average 1.44 bits of output, so is slower than just generating uniformly-distributed bits. But it shouldn't be much more than about 7/1.44 = 5 times slower, and a LFSR is pretty fast to start with.

This is how I would examine the number:
const int max_repeated_bits = 4; /* or any other number that you prefer */
int examine_1(unsigned int x) {
for (int i=0; i<max_repeated_bits; ++i) x &= (x << 1);
return x == 0;
}
int examine(unsigned int x) {
return examine_1(x) && examine_1(~x);
}
Then, just generate a number x, if examine(x) return 0, reject it and try again. The probability to get a 32-bit number with more than 4 bits in a row is about 2/3, so you would need about 3 random generator callse per number. However, If you allow more than 4 bits, it gets better. Say, the probability to get more than 6 bits in a row only about 20%, so you would need only 1.25 calls per number.

There are various variants of linear feedback shift registers, such as shrinking and self-shrinking which modify the output of one LFSR based on the output of another.
The design of these attempts to create random numbers, where the probability of getting two bits the same in a row is 0.5, of getting three in a row is 0.25 as so on.
It should be possible to chain two LFSRs to inhibit or invert the output when a sequence of similar bits occurs - the first LFSR uses a conventional primitive polynomial, and the feed the output of the first into the second. The second shift register is shorter, doesn't have a primitive polynomial. Instead it is tapped to invert the output if all its bits are the same, so no run can exceed the size of the second shift register.
Obviously this destroys the randomness of the output - if you have N bits in a row, the next bit is completely predictable. Messing around with using the output of another random source to determine whether or not to invert the output would defeat the second shift register - you wouldn't be able to detect the difference between that and just one random source.

Check out the GSL. I believe it has some functions that do just what you want. They at least are guaranteed to be random bit strings. I'm not sure if they would LOOK random, since thats more of a psychological question.

Can't believe nobody mentioned this:
If you want a longest run (period) of 2N repeats:
PeopleRandom()
{
while(1)
{
Number = randomN_bitNumber();
if(Number && Number != MaxN_BitNumber)
return Number;
}
}
this gives much better results in terms of amount of tosses than using a 32-bit, etc rand
pros:
you only toss values 2/2^N of the time.
larger N give better results.
Since the number of values that do not split the value with a 1 in the middle bit is exactly half, you can go with a larger N than you otherwise would have if you can tolerate a larger largest run less than half the time.

One simple approach would be to generate one bit at a time, with a tuning parameter to control the probability that each new bit matches the previous one. By setting the probability below 0.5, you can generate sequences that are less likely to contain long runs of repeating bits (and you can tune that likelihood). Setting p = 0 gives a repeating 1010101010101010 sequence; setting p = 1 gives a sequence of all 0s or all 1s.
Here is some C# to demonstrate:
double p = 0.3; // 0 <= p <= 1, probability of duplicating a bit
var r = new Random();
int bit = r.Next(2);
for (int i = 0; i < 100; i++)
{
if (r.NextDouble() > p)
{
bit = (bit + 1) % 2;
}
Console.Write(bit);
}
This might well be too slow for your needs, since you need to generate a random double in order to obtain each new random bit. You could, instead, generate a random byte and use each pair of bits to generate the new bit (i.e. if both are zero then keep the same bit, otherwise flip it, if you're happy with the equivalent of a fixed p = 0.25).
Furthermore, it's still possible to get long sequences of repeated bits, you've just lowered the probability of doing so.

Related

Reversibly shuffle a set of a million numbers

I need to issue a series {1, 2, 3, 4 …} of tickets that are (at least seemingly) random numbers {10,934, 3,453,867, 122, 4,386,564 …}. When presented back, I must be able to compute their original index (e.g. 122 → 3.)
In other words, I need a seemingly random permutation p on the interval [1 … N] that has an inverse permutation p-1. N is about 107.
The reasons for that are:
It is a cipher: When receiving a ticket, it should not be easy to
guess the tickets that where issued before.
The tickets should be short alphanumeric strings that can be noted down.
I want to avoid recording every ticket issued.
I would use some well-known cipher (e.g., DES) in counter mode.
DES is generally considered fairly broken for normal purposes, but it seems to fit your needs reasonably well, and has a smaller block size than most newer algorithms. For you, that means it produces a smaller result (64 bits, if memory serves). Once you've converted that to readable characters (e.g,. base 64) you end up with something like 10 characters or so.
To retrieve the original number, you simply decrypt with your secret key.
Results look quite random--essentially the only known way to sort them back into order would be to break DES, which can be done (has been done) but the resources to do so are quite non-trivial.
If you really do need a lot better security than that, you can use something like AES instead of DES (at the expense of producing a longer "key" value).
1 to generate a pseudo random shuffle, you could use Fisher-Yates algo:
https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
What distribution do you get from this broken random shuffle?
for (int i = tickets.Length - 1; i > 0; i--)
{
int n = random(i + 1);
Swap(tickets[i], tickets[n]);
}
beware of not using the "wrong" algorithm (he has bias).
You will get the permutation, then the inverse permutation.
2 Problem comes with the randomness of the shuffle.
As there is 10000000 ! permutations, you should have a very big size of seed
Then problem is in the random generator. standard ones are about 32 bits, perhaps a little more, but far from 10000000!
you should see at something like fortuna :
https://en.wikipedia.org/wiki/Fortuna_%28PRNG%29
You can generate such sequence using a Linear congruential generator.
X0 is the seed (or the index of the permutation if you wish). m should be equal to N+1. Select c and a to assure full period length (as described in the section 'period length' in the link above). This will give you a one-to-one mapping with size N.
To restore the index, you can crack the LCG using a small number of consecutive pseudo-random numbers from the series, which is not too hard. Of course you can keep m, a and c and save the trouble.
For more secure methods look at David Eisenstat's comment. You'll need only the secret key to restore the index. On the downside, if you'll use a standard FPE, N would have to be 2^x-1 (e.g. 2^128-1).

How many hash functions are required in a minhash algorithm

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.
The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.
Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?
To quote from the blog
It is tough to get the error bar on our similarity estimate much
smaller than [7%] because of the way error bars on statistically
sampled values scale — to cut the error bar in half we would need four
times as many samples.
Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?
One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).
Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.
Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.
That means that a reported measurement of 78% could easily come from only 50% similarity..
Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.
Mathematically, if you're reporting two digits the second should be meaningful.
Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.
I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.
This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.
The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.
The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.
There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.
Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.
Thanks to #BillDimm for his input & persistence in pointing out that XOR was necessary.
What you want can be be easily obtained from universal hashing. Popular textbooks like Corman et al as very readable information in section 11.3.3 pp 265-268. In short, you can generate family of hash functions using following simple equation:
h(x,a,b) = ((ax+b) mod p) mod m
x is key you want to hash
a is any odd number you can choose between 1 to p-1 inclusive.
b is any number you can choose between 0 to p-1 inclusive.
p is a prime number that is greater than max possible value of x
m is a max possible value you want for hash code + 1
By selecting different values of a and b you can generate many hash codes that are independent of each other.
An optimized version of this formula can be implemented as follows in C/C++/C#/Java:
(unsigned) (a*x+b) >> (w-M)
Here,
- w is size of machine word (typically 32)
- M is size of hash code you want in bits
- a is any odd integer that fits in to machine word
- b is any integer less than 2^(w-M)
Above works for hashing a number. To hash a string, get the hash code that you can get using built-in functions like GetHashCode and then use that value in above formula.
For example, let's say you need 200 16-bit hash code for string s, then following code can be written as implementation:
public int[] GetHashCodes(string s, int count, int seed = 0)
{
var hashCodes = new int[count];
var machineWordSize = sizeof(int);
var hashCodeSize = machineWordSize / 2;
var hashCodeSizeDiff = machineWordSize - hashCodeSize;
var hstart = s.GetHashCode();
var bmax = 1 << hashCodeSizeDiff;
var rnd = new Random(seed);
for(var i=0; i < count; i++)
{
hashCodes[i] = ((hstart * (i*2 + 1)) + rnd.Next(0, bmax)) >> hashCodeSizeDiff;
}
}
Notes:
I'm using hash code word size as half of machine word size which in most cases would be 16-bit. This is not ideal and has far more chance of collision. This can be used by upgrading all arithmetic to 64-bit.
Normally you want to select a and b both randomly within above said ranges.
Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)
Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.
When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.
It seems like another way to get N number of good hashed values would be to salt the same hash with N different salt values.
In practice, if applying the salt second, it seems you could hash the data, then "clone" the internal state of your hasher, add the first salt and get your first value. You'd reset this clone to the clean cloned state, add the second salt, and get your second value. Rinse and repeat for all N items.
Likely not as cheap as XOR against N values, but seems like there's possibility for better quality results, at a minimal extra cost, especially if the data being hashed is much larger than the salt value.

Given a true random number generator which outputs either a 1 or 0 per call, how do you use this to pick a number from an arbitrary range?

If I have a true random number generator (TRNG) which can give me either a 0 or a 1 each time I call it, then it is trivial to then generate any number in a range with a length equal to a power of 2. For example, if I wanted to generate a random number between 0 and 63, I would simply poll the TRNG 5 times, for a maximum value of 11111 and a minimum value of 00000. The problem is when I want a number in a rangle not equal to 2^n. Say I wanted to simulate the roll of a dice. I would need a range between 1 and 6, with equal weighting. Clearly, I would need three bits to store the result, but polling the TRNG 3 times would introduce two eroneous values. We could simply ignore them, but then that would give one side of the dice a much lower odds of being rolled.
My question of ome most effectively deals with this.
The easiest way to get a perfectly accurate result is by rejection sampling. For example, generate a random value from 1 to 8 (3 bits), rejecting and generating a new value (3 new bits) whenever you get a 7 or 8. Do this in a loop.
You can get arbitrarily close to accurate just by generating a large number of bits, doing the mod 6, and living with the bias. In cases like 32-bit values mod 6, the bias will be so small that it will be almost impossible to detect, even after simulating millions of rolls.
If you want a number in range 0 .. R - 1, pick least n such that R is less or equal to 2n. Then generate a random number r in the range 0 .. 2n-1 using your method. If it is greater or equal to R, discard it and generate again. The probability that your generation fails in this manner is at most 1/2, you will get a number in your desired range with less than two attempts on the average. This method is balanced and does not impair the randomness of the result in any fashion.
As you've observed, you can repeatedly double the range of a possible random values through powers of two by concatenating bits, but if you start with an integer number of bits (like zero) then you cannot obtain any range with prime factors other than two.
There are several ways out; none of which are ideal:
Simply produce the first reachable range which is larger than what you need, and to discard results and start again if the random value falls outside the desired range.
Produce a very large range, and distribute that as evenly as possible amongst your desired outputs, and overlook the small bias that you get.
Produce a very large range, distribute what you can evenly amongst your desired outputs, and if you hit upon one of the [proportionally] few values which fall outside of the set which distributes evenly, then discard the result and start again.
As with 3, but recycle the parts of the value that you did not convert into a result.
The first option isn't always a good idea. Numbers 2 and 3 are pretty common. If your random bits are cheap then 3 is normally the fastest solution with a fairly small chance of repeating often.
For the last one; supposing that you have built a random value r in [0,31], and from that you need to produce a result x [0,5]. Values of r in [0,29] could be mapped to the required output without any bias using mod 6, while values [30,31] would have to be dropped on the floor to avoid bias.
In the former case, you produce a valid result x, but there's some more randomness left over -- the difference between the ranges [0,5], [6,11], etc., (five possible values in this case). You can use this to start building your new r for the next random value you'll need to produce.
In the latter case, you don't get any x and are going to have to try again, but you don't have to throw away all of r. The specific value picked from the illegal range [30,31] is left-over and free to be used as a starting value for your next r (two possible values).
The random range you have from that point on needn't be a power of two. That doesn't mean it'll magically reach the range you need at the time, but it does mean you can minimise what you throw away.
The larger you make r, the more bits you may need to throw away if it overflows, but the smaller the chances of that happening. Adding one bit halves your risk but increases the cost only linearly, so it's best to use the largest r you can handle.

How to program a function to return values on some sort of probability?

This question arose to me while I was playing FIFA.
Assumingly, they programmed a complex function which includes all the factors like shooting skills, distance, shot power etc. to calculate the probability that the shot hits the target. How would they have programmed something that the goal happens according to that probability?
In other words, like a function X() has the probability that it return 1 89% and 0 11%. How would I program it so that it returns 1 (approximately) 89 times in 100 trials?
Generate a uniformly-distributed random number between 0 and 1, and return true if the number is less than the desired probability (0.89).
For example, in IPython:
In [13]: from random import random
In [14]: vals = [random() < 0.89 for i in range(10000)]
In [15]: sum(vals)
Out[15]: 8956
In this realisation, 8956 out of the 10000 boolean outcomes are true. If we repeat the experiment, the number will vary around 8900.
That is not how goals are determined in FIFA or other video games. They don't have a function that says, with some probability, the shot makes it or doesn't.
Rather, they simulate a ball actually being kicked into a goal.
The ball will have some speed (based on the "shot power") and some trajectory angle (based on where the player aimed, and some variability based on the character's "shot skill"). Then they allow physics - and the AI of the goalee, if there is one - to take over, and count it as a point only when the ball physically enters the goal.
There is of course still randomness involved, but there is no single variable that decides whether or not a shot will make it.
I'm not 100% sure but one way i would achieve:
Generate a random number (between 0 and 100). If the number is 89 or greater than return 1, elsewise return 0.
If you have a random number generator, then you would do something like:
bool return_true_89_out_of_100() {
double random_n = rand(); // returns random between 0 and 89
return (random_n < 0.89);
}
You can generate a crudely random number by, for example, sampling lower bits of the CPU clock or some mathematical tricks.
You're tagged language agnostic, but the answer depends on what random number function(s) are available to you. Furthermore the accuracy may depend on how close to being truly random your generator is (generally they're not that close).
As to random number functions, there tend to be two kinds -- those which generate a number between 0 and 1, and those that generate a number between m and n. Each can be used to derive a percentage easily.

Programming Pearls: find one integer appears at least twice

It's in the section 2.6 and problem 2, the original problem is like this:
"Given a sequential file containing 4,300,000,000 32-bit integers, how can you find one that appears at least twice?"
My question toward this exercise is that: what is the tricks of the above problem and what kind of general algorithm category this problem is in?
Create a bit array of length 2^32 bits (initialize to zero), that would be about 512MB and will fit into RAM on any modern machine.
Start reading the file, int by int, check bit with the same index as the value of the int, if the bit is set you have found a duplicate, if it is zero, set to one and proceed with the next int from the file.
The trick is to find a suitable data structure and algorithm. In this case everything fits into RAM with a suitable data structure and a simple and efficient algorithm can be used.
If the numbers are int64 you need to find a suitable sorting strategy or make multiple passes, depending on how much additional storage you have available.
The Pigeonhole Principle -- If you have N pigeons in M pigeonholes, and N>M, there are at least 2 pigeons in a hole. The set of 32-bit integers are our 2^32 pigeonholes, the 4.3 billion numbers in our file are the pigeons. Since 4.3x10^9 > 2^32, we know there are duplicates.
You can apply this principle to test if a duplicate we're looking for is in a subset of the numbers at the cost of reading the whole file, without loading more than a little at a time into RAM-- just count the number of times you see a number in your test range, and compare to the total number of integers in that range. For example, to check for a duplicate between 1,000,000 and 2,000,000 inclusive:
int pigeons = 0;
int pigeonholes = 2000000 - 1000000 + 1; // include both fenceposts
for (each number N in file) {
if ( N >= 1000000 && N <= 2000000 ) {
pigeons++
}
}
if (pigeons > pigeonholes) {
// one of the duplicates is between 1,000,000 and 2,000,000
// try again with a narrower range
}
Picking how big of range(s) to check vs. how many times you want to read 16GB of data is up to you :)
As far as a general algorithm category goes, this is a combinatorics (math about counting) problem.
If what do you mean is 32 bit positive integers,
I think this problem doesn't require some special algorithm
or trick to solve. Just a simple observation will lead to the intended solution.
My observation goes like this, the sequential file will contain only
32 bit integers (which is from 0 to 2 ^ 31 - 1). Assume you put all of them
in that file uniquely, you will end up with 2 ^ 31 lines. You can see
that if you put those positive integers once again, you will end up with 2 ^ 31 * 2 lines
and it is smaller than 4,300,000,000.
Thus, the answer is the whole positive integers ranging from 0 to 2 ^ 31 - 1.
Sort the integers and loop through them to see if consecutive integers are duplicates. If you want to do this in memory, it requires 16GB memory that is possible with todays machines. If this is not possible, you could sort the numbers using mergesort and by store intermediate arrays to disk.
My first implementation attempt would be to use sort and uniq commands from unix.

Resources