Efficient Indexing method for a 2 by 2 matrix - data-structures

If I fill numbers from 1 to 4 in a 2 by 2 matrix, there are 16 possible combinations. What I want to do is store values in an array of size 24 corresponding to each matrix. So given a
2 by 2 matrix, I want a efficient indexing method to index into the array directly.( I dont want comparing all 4 elements for each of 16 positions). Something similar to bit vector ? but not able to figure out how?
I want it for a 4 by 4 matrix also filling from 1 to 9

to clarify: you're looking for an efficient hash function for 2x2 matrices. you want to use the results of the hash function to compare matrices to see if they're the same.
first, lets assume you actually want the numbers 0 to 3 instead of 1 to 4 - this makes it easier, and is more computer-sciency. Next, 16 is not right. there are 24 possible permutations of the numbers 0-3. There are 4^4 = 256 possible strings of length 4 that use a four-letter alphabet (you can repeat already-used numbers).
either one is trivial to encode into a single byte. Let the first 2 bits represent the (0,0) position, the next 2 bits represent (0,1), and so forth. Than, to hash your 2x2 matrix, simply do:
hash = m[0][0] | (m[0][1] << 2) | (m[1][0] << 4) | (m[1][1] << 6
random example: the number 54 in binary is 00110110 which represents a matrix like:
2 1
3 0

When you need efficiency, sometimes code clarity goes out the window :)
First you need to be sure you want efficiency - you have profiling info to be sure that the simple comparison code is too inefficient for you?
You can simply treat it as an array of bytes of the same size. memcmp does comparisons of arbitary memory:
A data structure such as:
int matrix[2][2];
is stored the same as:
int matrix[2*2];
which could be dynamically allocated as:
typedef int matrix[2*2];
matrix* m = (matrix*)malloc(sizeof(matrix));
I'm not suggesting you dynamically allocate them, I'm illustrating how the bytes in your original type is actually layed out in memory.
Therefore, the following is valid:
matrix lookup[16];
int matrix_cmp(const void* a,const void* b) {
return memcmp(a,b,sizeof(matrix));
}
void init_matrix_lookup() {
int i;
for(i=0; i<16; i++) {
...
}
qsort(lookup,16,sizeof(matrix),matrix_cmp));
}
int matrix_to_lookup(matrix* m) {
// in this example I'm sorting them so we can bsearch;
// but with only 16 elements, its probably not worth the effort,
// and you could safely just loop over them...
return bsearch(m,lookup,16,sizeof(matrix),matrix_cmp);
}

Related

how to read all 1's in an Array of 1's and 0's spread-ed all over the array randomly

I have an Array with 1 and 0 spread over the array randomly.
int arr[N] = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N}
Now I want to retrive all the 1's in the array as fast as possible, but the condition is I should not loose the exact position(based on index) of the array , so sorting option not valid.
So the only option left is linear searching ie O(n) , is there anything better than this.
The main problem behind linear scan is , I need to run the scan even
for X times. So I feel I need to have some kind of other datastructure
which maintains this list once the first linear scan happens, so that
I need not to run the linear scan again and again.
Let me be clear about final expectations-
I just need to find the number of 1's in a certain range of array , precisely I need to find numbers of 1's in the array within range of 40-100. So this can be random range and I need to find the counts of 1 within that range. I can't do sum and all as I need to iterate over the array over and over again because of different range requirements
I'm surprised you considered sorting as a faster alternative to linear search.
If you don't know where the ones occur, then there is no better way than linear searching. Perhaps if you used bits or char datatypes you could do some optimizations, but it depends on how you want to use this.
The best optimization that you could do on this is to overcome branch prediction. Because each value is zero or one, you can use it to advance the index of the array that is used to store the one-indices.
Simple approach:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
if( arr[i] ) indices[end++] = i; // Slow due to branch prediction
}
Without branching:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
indices[end] = i;
end += arr[i];
}
[edit] I tested the above, and found the version without branching was almost 3 times faster (4.36s versus 11.88s for 20 repeats on a randomly populated 100-million element array).
Coming back here to post results, I see you have updated your requirements. What you want is really easy with a dynamic programming approach...
All you do is create a new array that is one element larger, which stores the number of ones from the beginning of the array up to (but not including) the current index.
arr : 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1
count : 0 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 5 6 6 6 6 7
(I've offset arr above so it lines up better)
Now you can compute the number of 1s in any range in O(1) time. To compute the number of 1s between index A and B, you just do:
int num = count[B+1] - count[A];
Obviously you can still use the non-branch-prediction version to generate the counts initially. All this should give you a pretty good speedup over the naive approach of summing for every query:
int *count = new int[N+1];
int total = 0;
count[0] = 0;
for( int i = 0; i < N; i++ )
{
total += arr[i];
count[i+1] = total;
}
// to compute the ranged sum:
int range_sum( int *count, int a, int b )
{
if( b < a ) return range_sum(b,a);
return count[b+1] - count[a];
}
Well one time linear scanning is fine. Since you are looking for multiple scans across ranges of array I think that can be done in constant time. Here you go:
Scan the array and create a bitmap where key = key of array = sequence (1,2,3,4,5,6....).The value storedin bitmap would be a tuple<IsOne,cumulativeSum> where isOne is whether you have a one in there and cumulative Sum is addition of 1's as and wen you encounter them
Array = 1 1 0 0 1 0 1 1 1 0 1 0
Tuple: (1,1) (1,2) (0,2) (0,2) (1,3) (0,3) (1,4) (1,5) (1,6) (0,6) (1,7) (0,7)
CASE 1: When lower bound of cumulativeSum has a 0. Number of 1's [6,11] =
cumulativeSum at 11th position - cumulativeSum at 6th position = 7 - 3 = 4
CASE 2: When lower bound of cumulativeSum has a 1. Number of 1's [2,11] =
cumulativeSum at 11th position - cumulativeSum at 2nd position + 1 = 7-2+1 = 6
Step 1 is O(n)
Step 2 is 0(1)
Total complexity is linear no doubt but for your task where you have to work with the ranges several times the above Algorithm seems to be better if you have ample memory :)
Does it have to be a simple linear array data structure? Or can you create your own data structure which happens to have the desired properties, for which you're able to provide the required API, but whose implementation details can be hidden (encapsulated)?
If you can implement your own and if there is some guaranteed sparsity (to either 1s or 0s) then you might be able to offer better than linear performance. I see that you want to preserve (or be able to regenerate) the exact stream, so you'll have to store an array or bitmap or run-length encoding for that. (RLE will be useless if the stream is actually random rather than arbitrary but could be quite useful if there are significant sparsity or patterns with long strings of one or the other. For example a black&white raster of a bitmapped image is often a good candidate for RLE).
Let's say that your guaranteed that the stream will be sparse --- that no more than 10%, for example, of the bits will be 1s (or, conversely that more than 90% will be). If that's the case then you might model your solution on an RLE and maintain a count of all 1s (simply incremented as you set bits and decremented as you clear them). If there might be a need to quickly get the number of set bits for arbitrary ranges of these elements then instead of a single counter you can have a conveniently sized array of counters for partitions of the stream. (Conveniently-sized, in this case, means something which fits easily within memory, within your caches, or register sets, but which offers a reasonable trade off between computing a sum (all the partitions fully within the range) and the linear scan. The results for any arbitrary range is the sum of all the partitions fully enclosed by the range plus the results of linear scans for any fragments that are not aligned on your partition boundaries.
For a very, very, large stream you could even have a multi-tier "index" of partition sums --- traversing from the largest (most coarse) granularity down toward the "fragments" to either end (using the next layer of partition sums) and finishing with the linear search of only the small fragments.
Obviously such a structure represents trade offs between the complexity of building and maintaining the structure (inserting requires additional operations and, for an RLE, might be very expensive for anything other than appending/prepending) vs the expense of performing arbitrarily long linear search/increment scans.
If:
the purpose is to be able to find the number of 1s in the array at any time,
given that relatively few of the values in the array might change between one moment when you want to know the number and another moment, and
if you have to find the number of 1s in a changing array of n values m times,
... you can certainly do better than examining every cell in the array m times by using a caching strategy.
The first time you need the number of 1s, you certainly have to examine every cell, as others have pointed out. However, if you then store the number of 1s in a variable (say sum) and track changes to the array (by, for instance, requiring that all array updates occur through a specific update() function), every time a 0 is replaced in the array with a 1, the update() function can add 1 to sum and every time a 1 is replaced in the array with a 0, the update() function can subtract 1 from sum.
Thus, sum is always up-to-date after the first time that the number of 1s in the array is counted and there is no need for further counting.
(EDIT to take the updated question into account)
If the need is to return the number of 1s in a given range of the array, that can be done with a slightly more sophisticated caching strategy than the one I've just described.
You can keep a count of the 1s in each subset of the array and update the relevant subset count whenever a 0 is changed to a 1 or vice versa within that subset. Finding the total number of 1s in a given range within the array would then be a matter of adding the number of 1s in each subset that is fully contained within the range and then counting the number of 1s that are in the range but not in the subsets that have already been counted.
Depending on circumstances, it might be worthwhile to have a hierarchical arrangement in which (say) the number of 1s in the whole array is at the top of the hierarchy, the number of 1s in each 1/q th of the array is in the second level of the hierarchy, the number of 1s in each 1/(q^2) th of the array is in the third level of the hierarchy, etc. e.g. for q = 4, you would have the total number of 1s at the top, the number of 1s in each quarter of the array at the second level, the number of 1s in each sixteenth of the array at the third level, etc.
Are you using C (or derived language)? If so, can you control the encoding of your array? If, for example, you could use a bitmap to count. The nice thing about a bitmap, is that you can use a lookup table to sum the counts, though if your subrange ends aren't divisible by 8, you'll have to deal with end partial bytes specially, but the speedup will be significant.
If that's not the case, can you at least encode them as single bytes? In that case, you may be able to exploit sparseness if it exists (more specifically, the hope that there are often multi index swaths of zeros).
So for:
u8 input = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N};
You can write something like (untested):
uint countBytesBy1FromTo(u8 *input, uint start, uint stop)
{ // function for counting one byte at a time, use with range of less than 4,
// use functions below for longer ranges
// assume it's just one's and zeros, otherwise we have to test/branch
uint sum;
u8 *end = input + stop;
for (u8 *each = input + start; each < end; each++)
sum += *each;
return sum;
}
countBytesBy8FromTo(u8 *input, uint start, uint stop)
{
u64 *chunks = (u64*)(input+start);
u64 *end = chunks + ((start - stop) >> 3);
uint sum = countBytesBy1FromTo((u8*)end, 0, stop - (u8*)end);
for (; chunks < end; chunks++)
{
if (*chunks)
{
sum += countBytesBy1FromTo((u8*)chunks, 0, 8);
}
}
}
The basic trick, is exploiting the ability to cast slices of your target array to single entities your language can look at in one swoop, and test by inference if ANY of the values of it are zeros, and then skip the whole block. The more zeros, the better it will work. In the case where your large cast integer always has at least one, this approach just adds overhead. You might find that using a u32 is better for your data. Or that adding a u32 test between the 1 and 8 helps. For datasets where zeros are much more common than ones, I've used this technique to great advantage.
Why is sorting invalid? You can clone the original array, sort the clone, and count and/or mark the locations of the 1s as needed.

Finding if a random number has occured before or not

Let me be clear at start that this is a contrived example and not a real world problem.
If I have a problem of creating a random number between 0 to 10. I do this 11 times making sure that a previously occurred number is not drawn again, if I get a repeated number,
I create another random number again to make sure it has not be seen earlier. So essentially I get a a sequence of unique numbers from 0 - 10 in a random order
e.g. 3 1 2 0 5 9 4 8 10 6 7 and so on
Now to come up with logic to make sure that the random numbers are unique and not one which we have drawn before, we could use many approaches
Use C++ std::bitset and set the bit corresponding to the index equal to value of each random no. and check it next time when a new random number is drawn.
Or
Use a std::map<int,int> to count the number of times or even simple C array with some sentinel values stored in that array to indicate if that number has occurred or not.
If I have to avoid these methods above and use some mathematical/logical/bitwise operation to find whether a random number has been draw before or not, is there a way?
You don't want to do it the way you suggest. Consider what happens when you have already selected 10 of the 11 items; your random number generator will cycle until it finds the missing number, which might be never, depending on your random number generator.
A better solution is to create a list of numbers 0 to 10 in order, then shuffle the list into a random order. The normal algorithm for doing this is due to Knuth, Fisher and Yates: starting at the first element, swap each element with an element at a location greater than the current element in the array.
function shuffle(a, n)
for i from n-1 to 1 step -1
j = randint(i)
swap(a[i], a[j])
We assume an array with indices 0 to n-1, and a randint function that sets j to the range 0 <= j <= i.
Use an array and add all possible values to it. Then pick one out of the array and remove it. Next time, pick again until the array is empty.
Yes, there is a mathematical way to do it, but it is a bit expansive.
have an array: primes[] where primes[i] = the i'th prime number. So its beginning will be [2,3,5,7,11,...].
Also store a number mult Now, once you draw a number (let it be i) you check if mult % primes[i] == 0, if it is - the number was drawn before, if it wasn't - then the number was not. chose it and do mult = mult * primes[i].
However, it is expansive because it might require a lot of space for large ranges (the possible values of mult increases exponentially
(This is a nice mathematical approach, because we actually look at a set of primes p_i, the array of primes is only the implementation to the abstract set of primes).
A bit manipulation alternative for small values is using an int or long as a bitset.
With this approach, to check a candidate i is not in the set you only need to check:
if (pow(2,i) & set == 0) // not in the set
else //already in the set
To enter an element i to the set:
set = set | pow(2,i)
A better approach will be to populate a list with all the numbers, shuffle it with fisher-yates shuffle, and iterate it for generating new random numbers.
If I have to avoid these methods above and use some
mathematical/logical/bitwise operation to find whether a random number
has been draw before or not, is there a way?
Subject to your contrived constraints yes, you can imitate a small bitset using bitwise operations:
You can choose different integer types on the right according to what size you need.
bitset code bitwise code
std::bitset<32> x; unsigned long x = 0;
if (x[i]) { ... } if (x & (1UL << i)) { ... }
// assuming v is 0 or 1
x[i] = v; x = (x & ~(1UL << i)) | ((unsigned long)v << i);
x[i] = true; x |= (1UL << i);
x[i] = false; x &= ~(1UL << i);
For a larger set (beyond the size in bits of unsigned long long), you will need an array of your chosen integer type. Divide the index by the width of each value to know what index to look up in the array, and use the modulus for the bit shifts. This is basically what bitset does.
I'm assuming that the various answers that tell you how best to shuffle 10 numbers are missing the point entirely: that your contrived constraints are there because you do not in fact want or need to know how best to shuffle 10 numbers :-)
Keep a variable too map the drawn numbers. The i'th bit of that variable will be 1 if the number was drawn before:
int mapNumbers = 0;
int generateRand() {
if (mapNumbers & ((1 << 11) - 1) == ((1 << 11) - 1)) return; // return if all numbers have been generated
int x;
do {
x = newVal();
} while (!x & mapNumbers);
mapNumbers |= (1 << x);
return x;
}

Generate an integer that is not among four billion given ones

I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.
Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).
Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.
A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
The code to implement the bitset is simple: (taken from solutions page )
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.
HTH.
Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)
For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break
Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.
This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers, 0 to 4294967295.
Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.
EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)
They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)
Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.
Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.
To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.
If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.
If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.
The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
0 - 00
1 - 01
2 - 10
3 - 11
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).
Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.
Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.
The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;
Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.
Of course you'd need a backup plan in case n+1 causes an integer overflow.
Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
void maxNum(ulong filesize)
{
ulong bitcount = filesize * 8; //number of bits in file
for (ulong i = 0; i < bitcount; i++)
{
Console.Write(9);
}
}
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.
From Reddit by Carbonetc.
For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.
You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.
Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from int_min to int_max, and
bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)
for (integer i = int_min; i <= int_max; ++i)
{
if (isNotInFile(i)) {
return i;
}
}
For the 10 MB memory constraint:
Convert the number to its binary representation.
Create a binary tree where left = 0 and right = 1.
Insert each number in the tree using its binary representation.
If a number has already been inserted, the leafs will already have been created.
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.
I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s.
foreach (var number in file) {
bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0.
}
for (var i = 0; i < 4294967296; i++) {
if (bitArray[i] == 0) {
return i - 2147483648;
}
}
As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.
Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.
2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.
I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}
As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.
If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!
(Joking... kind of.)

Bits and Bytes: Store a shuffle instruction

Given a byte array with a length of two we have two possibilities for a shuffle. 01 and 10
A length of 3 would allow these shuffle options 012,021,102,120,102,201,210. Total of 2x3=6 options.
A length of 4 would have 6x4=24. Length of 5 would have 24x5=120 options, etc.
So once you have randomly picked one of these shuffle options, how do you store it? You could store 23105 to indicate how to shuffle four bytes.. But that takes 5x3=15 bits. I know it can be done in 7 bits because there are only 120 possibilities.
Any ideas how to more efficiently store a shuffle instruction? It should be an algorithm that will scale in length.
Edit: See my own answer below before you post a new one. I am sure that there is good information in many of these already existing answers, but I just could not understand much of it.
If you have a well-ordering of the set of elements you are shuffling, then you can create a well-ordering for the set of all the permutations and just store a single integer representing which place in the order a permutation falls.
Example:
Shuffling 1 4 5: the possibilities are:
1 4 5 [0]
1 5 4 [1]
4 1 5 [2]
4 5 1 [3]
5 1 4 [4]
5 4 1 [5]
To store the permutation 415, you would just store 2 (zero indexed).
If you have a well-ordering for the original set of elements, you can make a well-ordering for the set of permutations by iterating through the elements from least order to greatest for the leftmost element, while iterating through the leftover elements for the next place to the right and so on until you get to the rightmost element. You wouldn't need to store this array, you would just need to be able to generate the permutations in the same order again to "unpack" the stored integer.
However, attempting to generate all the permutations one by one will take a considerable amount of time beyond the smallest of sets. You can use the observation that the first (N-1)! permutations start with the 1st element, the second (N-1)! permutations start with the second element, then for each permutation that starts with a specific element, the 1st (N-2)! permutations start with the first of the leftover elements and so on and so forth. This will allow you to "pack" or "unpack" the elements in O(n), excepting the complexity of actually generating the factorials and the division and modulus of arbitrary length integers, which will be somewhat substantial.
You are right that to store just a permutation of data, and not the data itself, you will need only as many bits as ceil(log2(permutations)). For N items, the number of permutations is factorial(N) or N!, so you would need ceil(log2(factorial(N))) bits to store just the permutation of N items without also storing the items.
In whatever language you're familiar, there should be a ready way to make a big array of M bits, fill it up with a permutation, and then store it on a storage device.
A common shuffling algorithm, and one of the few unbiased ones, is the Fisher-Yates shuffle. Each iteration of the algorithm takes a random number and swaps two places based on that number. By storing a list of those random numbers, you can later reproduce the exact same permutation.
Furthermore, since the valid range for each of those numbers is known in advance, you can pack them all into a big integer by multiplying each number by the product of the lower number's valid ranges, like a kind of variable-base positional notation.
For an array of L items, why not pack the order into L*ceil(log2(L)) bits? (ceil(log2(L)) is the number of bits needed to hold L unique values). For example, here is the representation of the "unshuffled" shuffle, taking the items in order:
L=2: 0 1 (2 bits)
L=3: 00 01 10 (6 bits)
L=4: 00 01 10 11 (8 bits)
L=5: 000 001 010 011 100 (15 bits)
...
L=8: 000 001 010 011 100 101 110 111 (24 bits)
L=9: 0000 0001 0010 0011 0100 0101 0110 0111 1000 (36 bits)
...
L=16: 0000 0001 ... 1111 (64 bits)
L=128: 00000000 000000001 ... 11111111 (1024 bits)
The main advantage to this scheme compared to #user470379's answer, is that it is really easy to extract the indexes, just shift and mask. No need to regenerate the permutation table. This should be a big win for large L: (For 128 items, there are 128! = 3.8562e+215 possible permutations).
(Permutations == "possibilities"; factorial = L! = L * (L-1) * ... * 1 = exactly the way you are calculating possibilities)
This method also isn't that much larger than storing the permutation index. You can store a 128 item shuffle in 1024 bits (32 x 32-bit integers). It takes 717 bits (23 ints) to store 128!.
Between the faster decoding speed and the fact that no temporary storage is required for caclulating the permutation, storing the extra 9 ints may be well worth their cost.
Here is an implementation in Ruby that should work for arbitrary sizes. The "shuffle instruction" is contained in the array instruction. The first part calculates the shuffle using a version of the Fisher-Yates algorithm that #Theran mentioned
# Some Setup and utilities
sizeofInt = 32 # fix for your language/platform
N = 16
BitsPerIndex = Math.log2(N).ceil
IdsPerWord = sizeofInt/BitsPerIndex
# sets the n'th bitfield in array a to v
def setBitfield a,n,v
mask = (2**BitsPerIndex)-1
idx = n/IdsPerWord
shift = (n-idx*IdsPerWord)*BitsPerIndex
a[idx]&=~(mask<<shift)
a[idx]|=(v&mask)<<shift
end
# returns the n'th bitfield in array a
def getBitfield a,n
mask = (2**BitsPerIndex)-1
idx = n/IdsPerWord
shift = (n-idx*IdsPerWord)*BitsPerIndex
return (a[idx]>>shift)&mask
end
#create the shuffle instruction in linear time
nwords = (N.to_f/IdsPerWord).ceil # num words required to hold instruction
instruction = Array.new(nwords){0} # array initialized to 0
#the "inside-out" Fisher–Yates shuffle
for i in (1..N-1)
j = rand(i+1)
setBitfield(instruction,i,getBitfield(instruction,j))
setBitfield(instruction,j,i)
end
#Here is a way to visualize the shuffle order
#delete ".reverse.map{|s|s.to_i(2)}" to visualize the way it's really stored
p instruction.map{|v|v.to_s(2).rjust(BitsPerIndex*IdsPerWord,'0').scan(
Regexp.new('.'*BitsPerIndex)).reverse.map{|s|s.to_i(2)}}
Here is an example of applying the shuffle to an array of characters:
A=(0...N).map{|v|('A'.ord+v).chr}
puts A*''
#Apply the shuffle to A in linear time
for i in (0...N)
print A[getBitfield(instruction,i)]
end
print "\n"
#example: for N=20, produces
> ABCDEFGHIJKLMNOPQRST
> MSNOLGRQCTHDEPIAJFKB
Hopefully this won't be too hard to convert to javascript, or any other language.
I am sorry if this was already covered in a previous answer,, but for the first time,, these answers are completely foreign to me. I might have mentioned that I know Java and JavaScript and that I know nothing of mathematics... So log2, permutations, factorial, well-ordering are all unknown words to me.
And on top of that I ended up (again) using StackOverflow as a white board to write out my question and answered the question in my head 20 minutes later. I was tied up in non computer life and,, knowing StackOverflow I figured it was too late to save more than 20% of everybody's easily wasted time.
Anyway, having gotten lost in all three existing answers. Here is the answer I know of
(written in Javascript but it should be easy to translate 20 lines of foreign code to your language of choice)
(see it in action here: http://jsfiddle.net/M3vHC)
Edit: Thanks to AShelly for this catch: This will fail (become highly biased) when given a key length of more than 12 assuming your ints are 32 bit (more than 19 if your ints are 64 bit)
var keyLength = 5
var possibilities = 1
for(var i = 0; i < keyLength ; i++)
possibilities *= i+1 // Calculate the number of possibilities to create an unbiased key
var randomKey = parseInt(Math.random()*possibilities) // Your shuffle instruction. Random number with correct number of possibilities starting with zero as the first possibility
var keyArray = new Array(keyLength) // This will contain the new locations of existing indexes. [0,1,2,3,4] means no shuffle [4,3,2,1,0] means reverse order. etcetera
var remainsOfKey = randomKey // Our "working" key. This is disposible / single use.
var taken = new Array(keyLength) // Tells if an index has already been accounted for in the keyArray
for(var i = keyArray.length;i > 0;i--) { // The number of possibilities for the first item in the key array is the number of blanks in key array.
var add = remainsOfKey % i + 1, remainsOfKey = parseInt(randomKey / i) // Grab a number at least zero and less then the number of blanks in the keyArray
for(var j = 0; add; j++) // If we got x from the above line, make sure x is not already taken
if(!taken[j])
add--
taken[keyArray[i-1] = --j] = true // Take what we have because it is right
}
alert('Based on a key length of ' + keyLength + ' and a random key of ' + randomKey + ' the new indexes are ... ' + keyArray.join(',') + ' !')

Random number generator that fills an interval

How would you implement a random number generator that, given an interval, (randomly) generates all numbers in that interval, without any repetition?
It should consume as little time and memory as possible.
Example in a just-invented C#-ruby-ish pseudocode:
interval = new Interval(0,9)
rg = new RandomGenerator(interval);
count = interval.Count // equals 10
count.times.do{
print rg.GetNext() + " "
}
This should output something like :
1 4 3 2 7 5 0 9 8 6
Fill an array with the interval, and then shuffle it.
The standard way to shuffle an array of N elements is to pick a random number between 0 and N-1 (say R), and swap item[R] with item[N]. Then subtract one from N, and repeat until you reach N =1.
This has come up before. Try using a linear feedback shift register.
One suggestion, but it's memory intensive:
The generator builds a list of all numbers in the interval, then shuffles it.
A very efficient way to shuffle an array of numbers where each index is unique comes from image processing and is used when applying techniques like pixel-dissolve.
Basically you start with an ordered 2D array and then shift columns and rows. Those permutations are by the way easy to implement, you can even have one exact method that will yield the resulting value at x,y after n permutations.
The basic technique, described on a 3x3 grid:
1) Start with an ordered list, each number may exist only once
0 1 2
3 4 5
6 7 8
2) Pick a row/column you want to shuffle, advance it one step. In this case, i am shifting the second row one to the right.
0 1 2
5 3 4
6 7 8
3) Pick a row/column you want to shuffle... I suffle the second column one down.
0 7 2
5 1 4
6 3 8
4) Pick ... For instance, first row, one to the left.
2 0 7
5 1 4
6 3 8
You can repeat those steps as often as you want. You can always do this kind of transformation also on a 1D array. So your result would be now [2, 0, 7, 5, 1, 4, 6, 3, 8].
An occasionally useful alternative to the shuffle approach is to use a subscriptable set container. At each step, choose a random number 0 <= n < count. Extract the nth item from the set.
The main problem is that typical containers can't handle this efficiently. I have used it with bit-vectors, but it only works well if the largest possible member is reasonably small, due to the linear scanning of the bitvector needed to find the nth set bit.
99% of the time, the best approach is to shuffle as others have suggested.
EDIT
I missed the fact that a simple array is a good "set" data structure - don't ask me why, I've used it before. The "trick" is that you don't care whether the items in the array are sorted or not. At each step, you choose one randomly and extract it. To fill the empty slot (without having to shift an average half of your items one step down) you just move the current end item into the empty slot in constant time, then reduce the size of the array by one.
For example...
class remaining_items_queue
{
private:
std::vector<int> m_Items;
public:
...
bool Extract (int &p_Item); // return false if items already exhausted
};
bool remaining_items_queue::Extract (int &p_Item)
{
if (m_Items.size () == 0) return false;
int l_Random = Random_Num (m_Items.size ());
// Random_Num written to give 0 <= result < parameter
p_Item = m_Items [l_Random];
m_Items [l_Random] = m_Items.back ();
m_Items.pop_back ();
}
The trick is to get a random number generator that gives (with a reasonably even distribution) numbers in the range 0 to n-1 where n is potentially different each time. Most standard random generators give a fixed range. Although the following DOESN'T give an even distribution, it is often good enough...
int Random_Num (int p)
{
return (std::rand () % p);
}
std::rand returns random values in the range 0 <= x < RAND_MAX, where RAND_MAX is implementation defined.
Take all numbers in the interval, put them to list/array
Shuffle the list/array
Loop over the list/array
One way is to generate an ordered list (0-9) in your example.
Then use the random function to select an item from the list. Remove the item from the original list and add it to the tail of new one.
The process is finished when the original list is empty.
Output the new list.
You can use a linear congruential generator with parameters chosen randomly but so that it generates the full period. You need to be careful, because the quality of the random numbers may be bad, depending on the parameters.

Resources