Programming Pearls: find one integer appears at least twice - algorithm

It's in the section 2.6 and problem 2, the original problem is like this:
"Given a sequential file containing 4,300,000,000 32-bit integers, how can you find one that appears at least twice?"
My question toward this exercise is that: what is the tricks of the above problem and what kind of general algorithm category this problem is in?

Create a bit array of length 2^32 bits (initialize to zero), that would be about 512MB and will fit into RAM on any modern machine.
Start reading the file, int by int, check bit with the same index as the value of the int, if the bit is set you have found a duplicate, if it is zero, set to one and proceed with the next int from the file.
The trick is to find a suitable data structure and algorithm. In this case everything fits into RAM with a suitable data structure and a simple and efficient algorithm can be used.
If the numbers are int64 you need to find a suitable sorting strategy or make multiple passes, depending on how much additional storage you have available.

The Pigeonhole Principle -- If you have N pigeons in M pigeonholes, and N>M, there are at least 2 pigeons in a hole. The set of 32-bit integers are our 2^32 pigeonholes, the 4.3 billion numbers in our file are the pigeons. Since 4.3x10^9 > 2^32, we know there are duplicates.
You can apply this principle to test if a duplicate we're looking for is in a subset of the numbers at the cost of reading the whole file, without loading more than a little at a time into RAM-- just count the number of times you see a number in your test range, and compare to the total number of integers in that range. For example, to check for a duplicate between 1,000,000 and 2,000,000 inclusive:
int pigeons = 0;
int pigeonholes = 2000000 - 1000000 + 1; // include both fenceposts
for (each number N in file) {
if ( N >= 1000000 && N <= 2000000 ) {
pigeons++
}
}
if (pigeons > pigeonholes) {
// one of the duplicates is between 1,000,000 and 2,000,000
// try again with a narrower range
}
Picking how big of range(s) to check vs. how many times you want to read 16GB of data is up to you :)
As far as a general algorithm category goes, this is a combinatorics (math about counting) problem.

If what do you mean is 32 bit positive integers,
I think this problem doesn't require some special algorithm
or trick to solve. Just a simple observation will lead to the intended solution.
My observation goes like this, the sequential file will contain only
32 bit integers (which is from 0 to 2 ^ 31 - 1). Assume you put all of them
in that file uniquely, you will end up with 2 ^ 31 lines. You can see
that if you put those positive integers once again, you will end up with 2 ^ 31 * 2 lines
and it is smaller than 4,300,000,000.
Thus, the answer is the whole positive integers ranging from 0 to 2 ^ 31 - 1.

Sort the integers and loop through them to see if consecutive integers are duplicates. If you want to do this in memory, it requires 16GB memory that is possible with todays machines. If this is not possible, you could sort the numbers using mergesort and by store intermediate arrays to disk.
My first implementation attempt would be to use sort and uniq commands from unix.

Related

Radix sort in linear time vs. converting input to proper base

I was thinking about the linear time sorting problem which appears in quite a few sources which prompts you to sort an array of numbers in the range from 0 to n^3-1 in linear time.
So one way to do this is to use radix sort which normally runs in O(wn) where w is the max word size by observing that we can obtain word size 3 for any number in that range by using base n.
And herein lies my question - while it looks ok on paper, in practice converting all the numbers to base n the naive way is going to take quite a lot of time, quite possibly even more than the later sorting itself. Is there any way to convert to base n faster than naively or to somehow trick one's way out of this limitation or do you just have to live with it?
One useful observation is that the runtime of this algorithm is the same if you choose as your base not the number n, but the smallest power of two greater than or equal to n. Let's imagine that that number is 2k. Now, to read off the base-2k digits of a number, you can just inspect blocks of bits of size k in the number, which is doable quite quickly using some bit shifts and logical ANDs. This will likely be fast even if your numbers are stored as variable-length integers, assuming that the variable-length integer uses some nice sort of binary encoding.
There's an ideal size for choosing k bit fields to use 2k as the base for radix sort depending on the size of the array, but it doesn't make a lot of difference, less than 10% for choosing r = 8 (1 read pass + 4 radix sort passes), versus r = 16 (1 read pass + 2 radix sort passes), because r = 8 is more cache friendly. On my system, (Intel 2600K 3.4 ghz), for array size = 2^20, r = 8 is slightly fastest. For array size = 2^24, r = 10.67 (using 10,11,11 bit field) is slightly fastest. For array size = 2^26, r = 16 is slightly fastest.
For signed integers, the sign bit can be toggled during the radix sort.
In your case, the max value of an integer is given, so this would help in choosing the bit field sizes.

using 10 MB of memory for four billion integers (about finding the optimized block size) [duplicate]

This question already has answers here:
Generate an integer that is not among four billion given ones
(38 answers)
Closed 7 years ago.
The problem is, given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file, assume only have 10 MB of memory.
Searched for some solutions, one of which is to store integers into bit-vector blocks (each block representing a specific range of integers among 4 billion range, each bit in the block represent for an integer), and using another counter for each block, to count the number of integers in each block. So that if number of integers is less than the block capacity for integers, scan the bit-vector of the block to find which are missing integers.
My confusion for this solution is, it is mentioned the optimal smallest footprint is, when the array of block counters occupies the same memory as the bit vector. I am confused why in such situation it is the optimal smallest footprint?
Here are calculation details I referred,
Let N = 2^32.
counters (bytes): blocks * 4
bit vector (bytes): (N / blocks) / 8
blocks * 4 = (N / blocks) / 8
blocks^2 = N / 32
blocks = sqrt(N/2)/4
thanks in advance,
Lin
Why it is the smallest memory footprint:
In the solution you proposed, there are two phases:
Count number of integers in each block
This uses 4*(#blocks) bytes of memory.
Use a bit-vector each bit representing an integer in the block.
This uses (blocksize/8) bytes of memory, which is (N/blocks)/8.
Setting the 2 to be equal results in blocks = sqrt(N/32) as you have mentioned.
This is the optimal because the memory required is the maximum of the memory required in each phase (which must both be executed). After the 1st phase, you can forget the counters, except for which block to search in for phase 2.
Optimization
If your counter saturates when it reaches capacity, you don't really need 4 bytes per counter, but rather 3 bytes. A counter reaches capacity when it exceeds the number of integers in the block.
In this case, phase 1 uses 3*blocks of memory, and phase 2 uses (N/blocks)/8. Therefore, the optimal is blocks = sqrt(N/24). If N is 4 billion, the number of blocks is approx 12910, and the block size is 309838 integers per block. This fits in 3 bytes.
Caveats, and alternative with good average case performance
The algorithm you proposed only works if all input integers are distinct. In case the integers are not distinct, I suggest you simply go with a randomized candidate set of integers approach. In a randomized candidate set of integers approach, you can select say 1000 candidate integers at random, and check if any are not found in the input file. If you fail, you can try find another random set of candidate integers. While this has poor worst case performance, it would be faster in the average case for most input. For example, if the input integers have a coverage of 99% of possible integers, then on average, with 1000 candidate integers, 10 of them will not be found. You can select the candidate integers pseudo-randomly so that you never repeat a candidate integer, and also to guarantee that in a fixed number of tries, you will have tested all possible integers.
If each time, you check sqrt(N) candidate integers, then the worst case performance can be as good as N*sqrt(N), because you might have to scan all N integers sqrt(N) times.
You can avoid the worst case time if you use this alternative, and if it doesn't work for the first set of candidate integers, you switch to your proposed solution. This might give better average case performance (this is a common strategy in sorting where quicksort is used first, before switching to heapsort for example if it appears that the worst case input is present).
# assumes all integers are positive and fit into an int
# very slow, but definitely uses less than 10MB RAM
int generate_unique_integer(file input-file)
{
int largest=0
while (not eof(input-file))
{
i=read(integer)
if (i>largest) largest=i
}
return i++; #larger than the largest integer in the input file
}

Understanding assumptions about machine word size in analyzing computer algorithms

I am reading the book Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein.. In the second chapter under "Analyzing Algorithms" it is mentioned that :
We also assume a limit on the size of each word of data. For example , when working with inputs of size n , we typically assume that integers are represented by c lg n bits for some constant c>=1 . We require c>=1 so that each word can hold the value of n , enabling us to index the individual input elements , and we restrict c to be a constant so that the word size doesn't grow arbitrarily .( If the word size could grow arbitrarily , we could store huge amounts of data in one word and operate on it all in constant time - clearly an unrealistic scenario.)
My questions are why this assumption that each integer should be represented by c lg n bits and also how c>=1 being the case allows us to index the individual input elements ?
first, by lg they apparently mean log base 2, so lg n is the number of bits in n.
then what they are saying is that if they have an algorithm that takes a list of numbers (i am being more specific in my example to help make it easier to understand) like 1,2,3,...n then they assume that:
a "word" in memory is big enough to hold any of those numbers.
a "word" in memory is not big enough to hold all the numbers (in one single word, packed in somehow).
when calculating the number of "steps" in an algorithm, an operation on one "word" takes one step.
the reason they are doing this is to keep the analysis realistic (you can only store numbers up to some size in "native" types; after that you need to switch to arbitrary precision libraries) without choosing a particular example (like 32 bit integers) that might be inappropriate in some cases, or become outdated.
You need at least lg n bits to represent integers of size n, so that's a lower bound on the number of bits needed to store inputs of size n. Setting the constant c >= 1 makes it a lower bound. If the constant multiplier were less than 1, you wouldn't have enough bits to store n.
This is a simplifying step in the RAM model. It allows you to treat each individual input value as though it were accessible in a single slot (or "word") of memory, instead of worrying about complications that might arise otherwise. (Loading, storing, and copying values of different word sizes would take differing amounts of time if we used a model that allowed varying word lengths.) This is what's meant by "enabling us to index the individual input elements." Each input element of the problem is assumed to be accessible at a single address, or index (meaning it fits in one word of memory), simplifying the model.
This question was asked very long ago and the explanations really helped me, but I feel like there could still be a little more clarification about how the lg n came about. For me talking through things really helps:
Lets choose a random number in base 10, like 27, we need 5 bits to store this. Why? Well because 27 is 11011 in binary. Notice 11011 has 5 digits each 'digit' is what we call a bit hence 5 bits.
Think of each bit as being a slot. For binary, each of those slots can hold a 0 or 1. What's the largest number I can store with 5 bits? Well, the largest number would fill each slot: 11111
11111 = 31 = 2^5 so to store 31 we need 5 bits and 31 is 2^5
Generally (and I will use very explicit names for clarity):
numToStore = 2 ^ numBitsNeeded
Since log is the mathematical inverse of exponent we get:
log(numToStore) = numBitsNeeded
Since this is likely to not result in an integer, we use ceil to round our answer up. So applying our example to find how many bits are needed to store the number 31:
log(31) = 4.954196310386876 = 5 bits

Given 4 billion numbers, how to find a number which is not in these 4 billion numbers? We have only 1GB of memory [duplicate]

This question already has answers here:
Generate an integer that is not among four billion given ones
(38 answers)
Closed 4 years ago.
Given 4 billion numbers, how to find a number which is not in these 4 billion numbers? We have only 1GB of memory.
The numbers can be non-consecutive.
How to do the same in 10MB of memory?
Assuming that this is a routine you are just going to run once, use the amount of memory available as a limiting factor. Load the numbers into an array, up to the amount of memory you have available. Sort the array using your favorite sorting algorithm. Do a binary search to see if the value is present. If it's there, you're done, if not, then clear the array and start loading numbers from the file in the last place you left off. Repeat the process until you find a match or you reach the end of the file.
For example, if you have 1 GB to work with and the numbers are 4 bytes large (say, C# int), set the upper array bound at something like 1024 ^ 3 / 4 = 268435456 * i (where i is some value < 1 to make sure we leave a little memory left over for other processes). Fill the array, sort, check, repeat.
If you only have 10 MB to work with, set the upper array bound to 1024 ^ 2 * 10 / 4 = 10485760 * i.
And really, now that I think about it, since the sort has to touch every value anyway, you're better off just scanning the list and leaving out the sort. The sort would be useful though if you wanted to save the list in ordered sets (up to the size of your array) for later processing. In that case you would also want to store the size of the arrays so you could rely on the fact they were sorted for each consecutive run.
Well, with the assumption that I can choose a number which is larger than the largest of the 4 billion numbers:
set i = 0
for each number:
load the number into memory
set i = max(i, number + 1)
Find the maximum of the set of numbers (O(N) in time, constant space) and the number not in the set is max+1.
That doesn't strike me as a very challenging question. Finding the smallest natural number not in the set might be a better question.
If the question is "There are 4 billion random numbers in a list, now choose a new random number that is not in the list." then I would Merge sort the list to put it in order O(n lg n). Then walk down the list comparing the current element to the next one in order O(n). Something like...
MergeSort(thelist);
for(int i = 0; i < thelist.Length; i++)
{
if(thelist[i] + 1 == thelist[i+1])
{
//it's just a duplicate element
continue;
}
else if (thelist[i] + 1 != thelist[i+1])
{
Console.WriteLine("the number is {0}", thelist[i] + 1);
break;
}
Considering the broadness of the OP question, One could venture that there are 4 billion random numbers in there and another random number is chosen and need to check wheter or not that new number is already in the list.
If that's the case, a simple Binary Search (considering that the numbers are ordered) should suffice with a a comparison time of O(logN).
Goto http://en.wikipedia.org/wiki/Sorting_algorithm.
Pick an algorithm where space complexity is 1. Maybe pick an obscure one so you can seem like a real aficionado of algorithms. Or pick the popular one that every has heard of if you want to seem like team player.
Memorize the pseudo-code and pretend that you have every sorting algorithm memorized as this what you do for fun to show off in the interview.
Profit.

Fast generation of random numbers that appear random

I am looking for an efficient way to generate numbers that a human would perceive as being random. Basically, I think of this as avoiding long sequences of 0 or 1 bits. I expect humans to be viewing the bit pattern, and a very low powered cpu should be able to calculate near a thousand of these per second.
There are two different concepts that I can think of to do this, but I am lost finding a efficient way of accomplishing them.
Generate a random number with a fixed number of one bits. For a 32-bit random number, this requires up to 31 random numbers, using the Knuth selection algorithm. is there a more efficient way to generate a random number with some number of bits set? Unfortunately, 0000FFFF doesn't look very random.
Some form of "part-wise' density seems like it'd look better - but I can't come up with a clear way of doing so - I'd imagine going through each chunk, and calculate how far it is from the ideal density, and try to increase the bit density of the next chunk. This sounds complex.
Hopefully there's another algorithm that I haven't thought about for this. Thanks in advance for your help.
[EDIT]
I should be clearer with what I ask -
(a) Is there an efficient way to generate random numbers without "long" runs of a single bit, where "long" is a tunable parameter?
(b) Other suggestions on what would make a number appear to be less-random?
A linear feedback shift register probably does what you want.
Edit in light of an updated question: You should look at a shuffle bag, although I'm not sure how fast this could run. See also this question.
I don't really know what you mean by bit patterns that "look" random. Is there some algorithm for defining what that is? One way might be to formulate an array consisting of only those numbers which are random enough for your purpose, then, randomly select elements from that array and push them onto the stream. The thing you seem to be trying to do seems bizarre to me and may be doomed to failure though. What happens if you have two 32 bit numbers which taken individually would meet your criteria for apparent randomicity, but when placed side by side make a sufficiently long stream of 0's or 1's to look made up?
Finally, I couldn't resist this.
You need to decide by exactly what rules you decide if something "looks random". Then you take a random number generator that produces enough "real randomness" for your purpose, and every time it generates a number that doesn't look random enough, you throw that number away and generate a new one.
Or you directly produce a sequence of "random" bits and every time the random generator outputs the "wrong" next bit (that would make it look not-random), you just flip that bit.
Here's what I'd do. I'd use a number like 00101011100101100110100101100101 and rotate it by some random amount each time.
But are you sure that a typical pseudo random generator wouldn't do? Have you tried it? You con't very many long strings of 0s and 1s anyhow.
If you're going to use a library random number and you're worried about too many or too few bits being set, there are cheap ways of counting bits.
Random numbers often have long sequences of 1s and 0s, so I'm not sure I fully understand why you can't use a simple linear congruential generator and shift in or out how ever many bits you need. They're blazing fast, look extremely random to the naked eye, and you can choose coefficients that will yield random integers in whatever positive range you need. If you need 32 "random looking" bits, just generate four random numbers and take the low 8 bits from each.
You don't really need to implement your own at all though, since in most languages the random library already implements one.
If you're determined that you want a particular density of 1s, though, you could always start with a number that has the required number of 1s set
int a = 0x00FF;
then use a bit twiddling hack to implement a bit-level shuffle of the bits in that number.
If you are looking to avoid long runs, how about something simple like:
#include <cstdlib>
class generator {
public:
generator() : last_num(0), run_count(1) { }
bool next_bit() {
const bool flip = rand() > RAND_MAX / pow( 2, run_count);
// RAND_MAX >> run_count ?
if(flip) {
run_count = 1;
last_num = !last_num;
} else
++run_count;
return last_num;
}
private:
bool last_num;
int run_count;
};
Runs become less likely the longer they go on. You could also do RAND_MAX / 1+run_count if you wanted longer runs
Since you care most about run length, you could generate random run lengths instead of random bits, so as to give them the exact distribution you want.
The mean run length in random binary data is of course 4 (sum of n/(2^(n-1))), and the mode average 1. Here are some random bits (I swear this is a single run, I didn't pick a value to make my point):
0111111011111110110001000101111001100000000111001010101101001000
See there's a run length of 8 in there. This is not especially surprising, since run length 8 should occur roughly every 256 bits and I've generated 64 bits.
If this doesn't "look random" to you because of excessive run lengths, then generate run lengths with whatever distribution you want. In pseudocode:
loop
get a random number
output that many 1 bits
get a random number
output that many 0 bits
endloop
You'd probably want to discard some initial data from the stream, or randomise the first bit, to avoid the problem that as it stands, the first bit is always 1. The probability of the Nth bit being 1 depends on how you "get a random number", but for anything that achieves "shortish but not too short" run lengths it will soon be as close to 50% as makes no difference.
For instance "get a random number" might do this:
get a uniformly-distributed random number n from 1 to 81
if n is between 1 and 54, return 1
if n is between 55 and 72, return 2
if n is between 72 and 78, return 3
if n is between 79 and 80, return 4
return 5
The idea is that the probability of a run of length N is one third the probability of a run of length N-1, instead of one half. This will give much shorter average run lengths, and a longest run of 5, and would therefore "look more random" to you. Of course it would not "look random" to anyone used to dealing with sequences of coin tosses, because they'd think the runs were too short. You'd also be able to tell very easily with statistical tests that the value of digit N is correlated with the value of digit N-1.
This code uses at least log(81) = 6.34 "random bits" to generate on average 1.44 bits of output, so is slower than just generating uniformly-distributed bits. But it shouldn't be much more than about 7/1.44 = 5 times slower, and a LFSR is pretty fast to start with.
This is how I would examine the number:
const int max_repeated_bits = 4; /* or any other number that you prefer */
int examine_1(unsigned int x) {
for (int i=0; i<max_repeated_bits; ++i) x &= (x << 1);
return x == 0;
}
int examine(unsigned int x) {
return examine_1(x) && examine_1(~x);
}
Then, just generate a number x, if examine(x) return 0, reject it and try again. The probability to get a 32-bit number with more than 4 bits in a row is about 2/3, so you would need about 3 random generator callse per number. However, If you allow more than 4 bits, it gets better. Say, the probability to get more than 6 bits in a row only about 20%, so you would need only 1.25 calls per number.
There are various variants of linear feedback shift registers, such as shrinking and self-shrinking which modify the output of one LFSR based on the output of another.
The design of these attempts to create random numbers, where the probability of getting two bits the same in a row is 0.5, of getting three in a row is 0.25 as so on.
It should be possible to chain two LFSRs to inhibit or invert the output when a sequence of similar bits occurs - the first LFSR uses a conventional primitive polynomial, and the feed the output of the first into the second. The second shift register is shorter, doesn't have a primitive polynomial. Instead it is tapped to invert the output if all its bits are the same, so no run can exceed the size of the second shift register.
Obviously this destroys the randomness of the output - if you have N bits in a row, the next bit is completely predictable. Messing around with using the output of another random source to determine whether or not to invert the output would defeat the second shift register - you wouldn't be able to detect the difference between that and just one random source.
Check out the GSL. I believe it has some functions that do just what you want. They at least are guaranteed to be random bit strings. I'm not sure if they would LOOK random, since thats more of a psychological question.
Can't believe nobody mentioned this:
If you want a longest run (period) of 2N repeats:
PeopleRandom()
{
while(1)
{
Number = randomN_bitNumber();
if(Number && Number != MaxN_BitNumber)
return Number;
}
}
this gives much better results in terms of amount of tosses than using a 32-bit, etc rand
pros:
you only toss values 2/2^N of the time.
larger N give better results.
Since the number of values that do not split the value with a 1 in the middle bit is exactly half, you can go with a larger N than you otherwise would have if you can tolerate a larger largest run less than half the time.
One simple approach would be to generate one bit at a time, with a tuning parameter to control the probability that each new bit matches the previous one. By setting the probability below 0.5, you can generate sequences that are less likely to contain long runs of repeating bits (and you can tune that likelihood). Setting p = 0 gives a repeating 1010101010101010 sequence; setting p = 1 gives a sequence of all 0s or all 1s.
Here is some C# to demonstrate:
double p = 0.3; // 0 <= p <= 1, probability of duplicating a bit
var r = new Random();
int bit = r.Next(2);
for (int i = 0; i < 100; i++)
{
if (r.NextDouble() > p)
{
bit = (bit + 1) % 2;
}
Console.Write(bit);
}
This might well be too slow for your needs, since you need to generate a random double in order to obtain each new random bit. You could, instead, generate a random byte and use each pair of bits to generate the new bit (i.e. if both are zero then keep the same bit, otherwise flip it, if you're happy with the equivalent of a fixed p = 0.25).
Furthermore, it's still possible to get long sequences of repeated bits, you've just lowered the probability of doing so.

Resources