How to combine two halves from two bytes to get a single byte - byte

Let us say that byte1 = 00001111, and byte2 = 11110000. How would I take the second half of byte1, then take the first half of byte2 and combine them to get one byte of value 11111111. I have done some research, but I just cannot seem to wrap my head around bitmasks and shifting.

If it's guaranteed that the "spare" bits are all zeroes, you can just OR together the two bytes, since A or 0 == A (see below)
If that's not guaranteed, you need (byte1 AND 0x0f) OR (byte2 AND 0xf0).
The AND operations set the unwanted bits to zero (since A AND 0 == 0) and then the OR operation combines the bytes, as above.
ABCDXXXX XXXXEFGH
AND 11110000 (0xf0) AND 00001111 (0x0f)
-------- --------
= ABCD0000 = 0000EFGH
and then
ABCD0000
OR 0000EFGH
--------
= ABCDEFGH
No bit shifts are required because the bits you want to keep are already in the correct bit positions.

Related

Given a number find the next sparse number

The problem statement is the following:
Given a number x, find the smallest Sparse number which greater than or equal to x
A number is Sparse if there are no two adjacent 1s in its binary representation. For example 5 (binary representation: 101) is sparse, but 6 (binary representation: 110) is not sparse.
I'm taking the problem from this post where the most efficient solution is listed as having a running time of O(logn):
1) Find binary of the given number and store it in a
boolean array.
2) Initialize last_finalized bit position as 0.
2) Start traversing the binary from least significant bit.
a) If we get two adjacent 1's such that next (or third)
bit is not 1, then
(i) Make all bits after this 1 to last finalized
bit (including last finalized) as 0.
(ii) Update last finalized bit as next bit.
What isn't clear in the post is what is meant by "finalized bit." It seems that the algorithm starts out by inserting the binary representation of a number into a std::vector using a while loop in which it ANDS the input (which is a number x) with 1 and then pushes that back into the vector but, at least from the provided description, its not clear why this is done. Is there a clearer explanation (or even approach) to an efficient solution for this problem?
EDIT:
// Start from second bit (next to LSB)
for (int i=1; i<n-1; i++)
{
// If current bit and its previous bit are 1, but next
// bit is not 1.
if (bin[i] == 1 && bin[i-1] == 1 && bin[i+1] != 1)
{
// Make the next bit 1
bin[i+1] = 1;
// Make all bits before current bit as 0 to make
// sure that we get the smallest next number
for (int j=i; j>=last_final; j--)
bin[j] = 0;
// Store position of the bit set so that this bit
// and bits before it are not changed next time.
last_final = i+1;
}
}
If you see any sequence "011" in the binary representation of your number, then change the '0' to a '1' and set every bit after it to '0' (since that gives the minimum).
The algorithm suggests starting from the right (the least significant bit), but if you start from the left, find the leftmost sequence "011" and do as above, you get the solution one half of the time. The other half is when the next bit to the left of this sequence is a '1'. When you change the '0' to a '1', you create a new "011" sequence that needs to be treated the same way.
The "last finalized bit" is the leftmost '0' bit that sees only '0' bits to its right. This is because all of those '0's won't change in the next steps.
So here are some observation to solve this question:-
Convert the number in its binary format now if the last digits is 0 then we can append 1 and 0 both at the end but if the last digit is 1 then we can only append 0 at the end.
So naive approach is to do a simple iteration and check for every number but we can optimize this approach so for that if we look closely to some example
let say n=5 -> 101 next sparse is 5 (101)
let say n=14 -> 1110 next sparse is 16 (10000)
let say n=39 ->100111 next sparse is 40 (101000)
let say n=438 -> 110110110 next sparse is 512 (1000000000)
To optimize naive approach the idea here is to use the BIT-MANIPULATION
the concept that if we AND a bit sequence with a shifted version of itself, we’re effectively removing the trailing 1 from every sequence of consecutive 1s.
for n=5
0101 (5)
& 1010 (5<<1)
---------
0000
so as you get the value of n&(n<<1) to be zero means the number you have does not have any consecutive 1's in it ( because if it is not zero then there must be a sequence of consecutive 1's in our number) so this will be answer
for n=14
01110 (14)
& 11100 (14<<1)
----------------
01100
so the value is not zero then just increment our number by 1 so our new number is 15
now again perform same things
01111 (15)
& 11110 (15<<1)
------------------------------
01110
again our number is not zero then increment number by 1 and perform same for n = 16
010000 (16)
& 100000 (16<<1)
------------------------
000000
so now our number become zero so we have now encounter a number which does not contains any consecutive 1's so our answer is 16.
So in the similar manner you can check for other number too.
Hope you get the idea if so then upvote. Happy Coding!
int nextSparse(int n) {
// code here
while(true)
{
if(n&(n<<1))
n++;
else
return n;
}
}
Time Complexity will be O(logn).

How to translate Text to Binary with Cocoa?

I'm making a simple Cocoa program that can encode text to binary and decode it back to text. I tried to make this script and I was not even close to accomplishing this. Can anyone help me? This has to include two textboxes and two buttons or whatever is best, Thanks!
There are two parts to this.
The first is to encode the characters of the string into bytes. You do this by sending the string a dataUsingEncoding: message. Which encoding you choose will determine which bytes it gives you for each character. Start with NSUTF8StringEncoding, and then experiment with other encodings, such as NSUnicodeStringEncoding, once you get it working.
The second part is to convert every bit of every byte into either a '0' character or a '1' character, so that, for example, the letter A, encoded in UTF-8 to a single byte, will be represented as 01000001.
So, converting characters to bytes, and converting bytes to characters representing bits. These two are completely separate tasks; the second part should work correctly for any stream of bytes, including any valid stream of encoded characters, any invalid stream of encoded characters, and indeed anything that isn't text at all.
The first part is easy enough:
- (NSString *) stringOfBitsFromEncoding:(NSStringEncoding)encoding
ofString:(NSString *)inputString
{
//Encode the characters to bytes using the UTF-8 encoding. The bytes are contained in an NSData object, which we receive.
NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
//I did say these were two separate jobs.
return [self stringOfBitsFromData:data];
}
For the second part, you'll need to loop through the bytes of the data. A C for loop will do the job there, and that will look like this:
//This is the method we're using above. I'll leave out the method signature and let you fill that in.
- …
{
//Find out how many bytes the data object contains.
NSUInteger length = [data length];
//Get the pointer to those bytes. “const” here means that we promise not to change the values of any of the bytes. (The compiler may give a warning if we don't include this, since we're not allowed to change these bytes anyway.)
const char *bytes = [data bytes];
//We'll store the output here. There are 8 bits per byte, and we'll be putting in one character per bit, so we'll tell NSMutableString that it should make room for (the number of bytes times 8) characters.
NSMutableString *outputString = [NSMutableString stringWithCapacity:length * 8];
//The loop. We start by initializing i to 0, then increment it (add 1 to it) after each pass. We keep looping as long as i < length; when i >= length, the loop ends.
for (NSUInteger i = 0; i < length; ++i) {
char thisByte = bytes[i];
for (NSUInteger bitNum = 0; bitNum < 8; ++bitNum) {
//Call a function, which I'll show the definition of in a moment, that will get the value of a bit at a given index within a given character.
bool bit = getBitAtIndex(thisByte, bitNum);
//If this bit is a 1, append a '1' character; if it is a 0, append a '0' character.
[outputString appendFormat: #"%c", bit ? '1' : '0'];
}
}
return outputString;
}
Bits 101 (or, 1100101)
Bits are literally just digits in base 2. Humans in the Western world usually write out numbers in base 10, but a number is a number no matter what base it's written in, and every character, and every byte, and indeed every bit, is just a number.
Digits—including bits—are counted up from the lowest place, according to the exponent to which the base is raised to find the magnitude of that place. We want bits, so that base is 2, so our place values are:
2^0 = 1: The ones place (the lowest bit)
2^1 = 2: The twos place (the next higher bit)
2^2 = 4: The fours place
2^3 = 8: The eights place
And so on, up to 2^7. (Note that the highest exponent is exactly one lower than the number of digits we're after; in this case, 7 vs. 8.)
If that all reminds you of reading about “the ones place”, “the tens place”, “the hundreds place”, etc. when you were a kid, it should: it's the exact same principle.
So a byte such as 65, which (in UTF-8) completely represents the character 'A', is the sum of:
2^7 × 0 = 0
+ 2^6 × 0 = 64
+ 2^5 × 1 = 0
+ 2^4 × 0 = 0
+ 2^3 × 0 = 0
+ 2^2 × 0 = 0
+ 2^1 × 0 = 0
+ 2^0 × 1 = 1
= 0 + 64 +0+0+0+0+0 + 1
= 64 + 1
= 65
Back when you learned base 10 numbers as a kid, you probably noticed that ten is “10”, one hundred is “100”, etc. This is true in base 2 as well: as 10^x is “1” followed by x “0”s in base 10, so is 2^x “1” followed by “x” 0s in base 2. So, for example, sixty-four in base 2 is “1000000” (count the zeroes and compare to the table above).
We are going to use these exact-power-of-two numbers to test each bit in each input byte.
Finding the bit
C has a pair of “shift” operators that will insert zeroes or remove digits at the low end of a number. The former is called “shift left”, and is written as <<, and you can guess the opposite.
We want shift left. We want to shift 1 left by the number of the bit we're after. That is exactly equivalent to raising 2 (our base) to the power of that number; for example, 1 << 6 = 2^6 = “1000000”.
Testing the bit
C has an operator for bit testing, too; it's &, the bitwise AND operator. (Do not confuse this with &&, which is the logical AND operator. && is for using whole true/false values in making decisions; & is one of your tools for working with bits within values.)
Strictly speaking, & does not test single bits; it goes through the bits of both input values, and returns a new value whose bits are the bitwise AND of each input pair. So, for example,
01100101
& 00101011
----------
00100001
Each bit in the output is 1 if and only if both of the corresponding input bits were also 1.
Putting these two things together
We're going to use the shift left operator to give us a number where one bit, the nth bit, is set—i.e., 2^n—and then use the bitwise AND operator to test whether the same bit is also set in our input byte.
//This is a C function that takes a char and an int, promising not to change either one, and returns a bool.
bool getBitAtIndex(const char byte, const int bitNum)
//It could also be a method, which would look like this:
//- (bool) bitAtIndex:(const int)bitNum inByte:(const char)byte
//but you would have to change the code above. (Feel free to try it both ways.)
{
//Find 2^bitNum, which will be a number with exactly 1 bit set. For example, when bitNum is 6, this number is “1000000”—a single 1 followed by six 0s—in binary.
const int powerOfTwo = 1 << bitNum;
//Test whether the same bit is also set in the input byte.
bool bitIsSet = byte & powerOfTwo;
return bitIsSet;
}
A bit of magic I should acknowledge
The bitwise AND operator does not evaluate to a single bit—it does not evaluate to only 1 or 0. Remember the above example, in which the & operator returned 33.
The bool type is a bit magic: Any time you convert any value to bool, it automatically becomes either 1 or 0. Anything that is not 0 becomes 1; anything that is 0 becomes 0.
The Objective-C BOOL type does not do this, which is why I used bool in the code above. You are free to use whichever you prefer, except that you generally should use BOOL whenever you deal with anything that expects a BOOL, particularly when overriding methods in subclasses or implementing protocols. You can convert back and forth freely, though not losslessly (since bool will change non-zero values as described above).
Oh yeah, you said something about text boxes too
When the user clicks on your button, get the stringValue of your input field, call stringOfBitsFromEncoding:ofString: using a reasonable encoding (such as UTF-8) and that string, and set the resulting string as the new stringValue of your output field.
Extra credit: Add a pop-up button with which the user can choose an encoding.
Extra extra credit: Populate the pop-up button with all of the available encodings, without hard-coding or hard-nibbing a list.

bit vector implementation of set in Programming Pearls, 2nd Edition

On Page 140 of Programming Pearls, 2nd Edition, Jon proposed an implementation of sets with bit vectors.
We'll turn now to two final structures that exploit the fact that our sets represent integers. Bit vectors are an old friend from Column 1. Here are their private data and functions:
enum { BITSPERWORD = 32, SHIFT = 5, MASK = 0x1F };
int n, hi, *x;
void set(int i) { x[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { x[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i) { return x[i>>SHIFT] &= (1<<(i & MASK)); }
As I gathered, the central idea of a bit vector to represent an integer set, as described in Column 1, is that the i-th bit is turned on if and only if the integer i is in the set.
But I am really at a loss at the algorithms involved in the above three functions. And the book doesn't give an explanation.
I can only get that i & MASK is to get the lower 5 bits of i, while i>>SHIFT is to move i 5 bits toward the right.
Anybody would elaborate more on these algorithms? Bit operations always seem a myth to me, :(
Bit Fields and You
I'll use a simple example to explain the basics. Say you have an unsigned integer with four bits:
[0][0][0][0] = 0
You can represent any number here from 0 to 15 by converting it to base 2. Say we have the right end be the smallest:
[0][1][0][1] = 5
So the first bit adds 1 to the total, the second adds 2, the third adds 4, and the fourth adds 8. For example, here's 8:
[1][0][0][0] = 8
So What?
Say you want to represent a binary state in an application-- if some option is enabled, if you should draw some element, and so on. You probably don't want to use an entire integer for each one of these- it'd be using a 32 bit integer to store one bit of information. Or, to continue our example in four bits:
[0][0][0][1] = 1 = ON
[0][0][0][0] = 0 = OFF //what a huge waste of space!
(Of course, the problem is more pronounced in real life since 32-bit integers look like this:
[0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0] = 0
The answer to this is to use a bit field. We have a collection of properties (usually related ones) which we will flip on and off using bit operations. So, say, you might have 4 different lights on a piece of hardware that you want to be on or off.
3 2 1 0
[0][0][0][0] = 0
(Why do we start with light 0? I'll explain this in a second.)
Note that this is an integer, and is stored as an integer, but is used to represent multiple states for multiple objects. Crazy! Say we turn lights 2 and 1 on:
3 2 1 0
[0][1][1][0] = 6
The important thing you should note here: There's probably no obvious reason why lights 2 and 1 being on should equal six, and it may not be obvious how we would do anything with this scheme of information storage. It doesn't look more obvious if you add more bits:
3 2 1 0
[1][1][1][0] = 0xE \\what?
Why do we care about this? Do we have exactly one state for each number between 0 and 15?How are we going to manage this without some insane series of switch statements? Ugh...
The Light at the End
So if you've worked with binary arithmetic a bit before, you might realize that the relationship between the numbers on the left and the numbers on the right is, of course, base 2. That is:
1*(23) + 1*(22) + 1*(21) +0 *(20) = 0xE
So each light is present in the exponent of each term of the equation. If the light is on, there is a 1 next to its term- if the light is off, there is a zero. Take the time to convince yourself that there is exactly one integer between 0 and 15 that corresponds to each state in this numbering scheme.
Bit operators
Now that we have this done, let's take a second to see what bitshifting does to integers in this setup.
[0][0][0][1] = 1
When you shift bits to the left or the right in an integer, it literally moves the bits left and right. (Note: I 100% disavow this explanation for negative numbers! There be dragons!)
1<<2 = 4
[0][1][0][0] = 4
4>>1 = 2
[0][0][1][0] = 2
You will encounter similar behavior when shifting numbers represented with more than one bit. Also, it shouldn't be hard to convince yourself that x>>0 or x<<0 is just x. Doesn't shift anywhere.
This probably explains the naming scheme of the Shift operators to anyone who wasn't familiar with them.
Bitwise operations
This representation of numbers in binary can also be used to shed some light on the operations of bitwise operators on integers. Each bit in the first number is xor-ed, and-ed, or or-ed with its fellow number. Take a second to venture to wikipedia and familiarize yourself with the function of these Boolean operators - I'll explain how they function on numbers but I don't want to rehash the general idea in great detail.
...
Welcome back! Let's start by examining the effect of the OR (|) operator on two integers, stored in four bit.
OR OPERATOR ON:
[1][0][0][1] = 0x9
[1][1][0][0] = 0xC
________________
[1][1][0][1] = 0xD
Tough! This is a close analogue to the truth table for the boolean OR operator. Notice that each column ignores the adjacent columns and simply fills in the result column with the result of the first bit and the second bit OR'd together. Note also that the value of anything or'd with 1 is 1 in that particular column. Anything or'd with zero remains the same.
The table for AND (&) is interesting, though somewhat inverted:
AND OPERATOR ON:
[1][0][0][1] = 0x9
[1][1][0][0] = 0xC
________________
[1][0][0][0] = 0x8
In this case we do the same thing- we perform the AND operation with each bit in a column and put the result in that bit. No column cares about any other column.
Important lesson about this, which I invite you to verify by using the diagram above: anything AND-ed with zero is zero. Also, equally important- nothing happens to numbers that are AND-ed with one. They stay the same.
The final table, XOR, has behavior which I hope you all find predictable by now.
XOR OPERATOR ON:
[1][0][0][1] = 0x9
[1][1][0][0] = 0xC
________________
[0][1][0][1] = 0x5
Each bit is being XOR'd with its column, yadda yadda, and so on. But look closely at the first row and the second row. Which bits changed? (Half of them.) Which bits stayed the same? (No points for answering this one.)
The bit in the first row is being changed in the result if (and only if) the bit in the second row is 1!
The one lightbulb example!
So now we have an interesting set of tools we can use to flip individual bits. Let's go back to the lightbulb example and focus only on the first lightbulb.
0
[?] \\We don't know if it's one or zero while coding
We know that we have an operation that can always make this bit equal to one- the OR 1 operator.
0|1 = 1
1|1 = 1
So, ignoring the rest of the bulbs, we could do this
4_bit_lightbulb_integer |= 1;
and know for sure that we did nothing but set the first lightbulb to ON.
3 2 1 0
[0][0][0][?] = 0 or 1? \\4_bit_lightbulb_integer
[0][0][0][1] = 1
________________
[0][0][0][1] = 0x1
Similarly, we can AND the number with zero. Well- not quite zero- we don't want to affect the state of the other bits, so we will fill them in with ones.
I'll use the unary (one-argument) operator for bit negation. The ~ (NOT) bitwise operator flips all of the bits in its argument. ~(0X1):
[0][0][0][1] = 0x1
________________
[1][1][1][0] = 0xE
We will use this in conjunction with the AND bit below.
Let's do 4_bit_lightbulb_integer & 0xE
3 2 1 0
[0][1][0][?] = 4 or 5? \\4_bit_lightbulb_integer
[1][1][1][0] = 0xE
________________
[0][1][0][0] = 0x4
We're seeing a lot of integers on the right-hand-side which don't have any immediate relevance. You should get used to this if you deal with bit fields a lot. Look at the left-hand side. The bit on the right is always zero and the other bits are unchanged. We can turn off light 0 and ignore everything else!
Finally, you can use the XOR bit to flip the first bit selectively!
3 2 1 0
[0][1][0][?] = 4 or 5? \\4_bit_lightbulb_integer
[0][0][0][1] = 0x1
________________
[0][1][0][*] = 4 or 5?
We don't actually know what the value of * is now- just that flipped from whatever ? was.
Combining Bit Shifting and Bitwise operations
The interesting fact about these two operations is when taken together they allow you to manipulate selective bits.
[0][0][0][1] = 1 = 1<<0
[0][0][1][0] = 2 = 1<<1
[0][1][0][0] = 4 = 1<<2
[1][0][0][0] = 8 = 1<<3
Hmm. Interesting. I'll mention the negation operator here (~) as it's used in a similar way to produce the needed bit values for ANDing stuff in bit fields.
[1][1][1][0] = 0xE = ~(1<<0)
[1][1][0][1] = 0xD = ~(1<<1)
[1][0][1][1] = 0xB = ~(1<<2)
[0][1][1][1] = 0X7 = ~(1<<3)
Are you seeing an interesting relationship between the shift value and the corresponding lightbulb position of the shifted bit?
The canonical bitshift operators
As alluded to above, we have an interesting, generic method for turning on and off specific lights with the bit-shifters above.
To turn on a bulb, we generate the 1 in the right position using bit shifting, and then OR it with the current lightbulb positions. Say we want to turn on light 3, and ignore everything else. We need to get a bit shifting operation that ORs
3 2 1 0
[?][?][?][?] \\all we know about these values at compile time is where they are!
and 0x8
[1][0][0][0] = 0x8
Which is easy, thanks to bitshifting! We'll pick the number of the light and switch the value over:
1<<3 = 0x8
and then:
4_bit_lightbulb_integer |= 0x8;
3 2 1 0
[1][?][?][?] \\the ? marks have not changed!
And we can guarantee that the bit for the 3rd lightbulb is set to 1 and that nothing else has changed.
Clearing a bit works similarly- we'll use the negated bits table above to, say, clear light 2.
~(1<<2) = 0xB = [1][0][1][1]
4_bit_lightbulb_integer & 0xB:
3 2 1 0
[?][?][?][?]
[1][0][1][1]
____________
[?][0][?][?]
The XOR method of flipping bits is the same idea as the OR one.
So the canonical methods of bit switching are this:
Turn on the light i:
4_bit_lightbulb_integer|=(1<<i)
Turn off light i:
4_bit_lightbulb_integer&=~(1<<i)
Flip light i:
4_bit_lightbulb_integer^=(1<<i)
Wait, how do I read these?
In order to check a bit we can simply zero out all of the bits except for the one we care about. We'll then check to see if the resulting value is greater than zero- since this is the only value that could possibly be nonzero, it will make the entire integer nonzero if and only if it is nonzero. For example, to check bit 2:
1<<2:
[0][1][0][0]
4_bit_lightbulb_integer:
[?][?][?][?]
1<<2 & 4_bit_lightbulb_integer:
[0][?][0][0]
Remember from the previous examples that the value of ? didn't change. Remember also that anything AND 0 is 0. So, we can say for sure that if this value is greater than zero, the switch at position 2 is true and the lightbulb is zero. Similarly, if the value is off, the value of the entire thing will be zero.
(You can alternately shift the entire value of 4_bit_lightbulb_integer over by i bits and AND it with 1. I don't remember off the top of my head if one is faster than the other but I doubt it.)
So the canonical checking function:
Check if bit i is on:
if (4_bit_lightbulb_integer & 1<<i) {
\\do whatever
}
The specifics
Now that we have a complete set of tools for bitwise operations, we can look at the specific example here. This is basically the same idea- except a much more concise and powerful way of executing it. Let's look at this function:
void set(int i) { x[i>>SHIFT] |= (1<<(i & MASK)); }
From the canonical implementation I'm going to make a guess that this is trying to set some bits to 1! Let's take an integer and look at what's going on here if i feed the value 0x32 (50 in decimal) into i:
x[0x32>>5] |= (1<<(0x32 & 0x1f))
Well, that's a mess.. let's dissect this operation on the right. For convenience, pretend there are 24 more irrelevant zeros, since these are both 32 bit integers.
...[0][0][0][1][1][1][1][1] = 0x1F
...[0][0][1][1][0][0][1][0] = 0x32
________________________
...[0][0][0][1][0][0][1][0] = 0x12
It looks like everything is being cut off at the boundary on top where 1s turn into zeros. This technique is called Bit Masking. Interestingly, the boundary here restricts the resulting values to be between 0 and 31... Which is exactly the number of bit positions we have for a 32 bit integer!
x[0x32>>5] |= (1<<(0x12))
Let's look at the other half.
...[0][0][1][1][0][0][1][0] = 0x32
Shift five bits to the right:
...[0][0][0][0][0][0][0][1] = 0x01
Note that this transformation exactly destroyed all information from the first part of the function- we have 32-5 = 27 remaining bits which could be nonzero. This indicates which of 227 integers in the array of integers are selected. So the simplified equation is now:
x[1] |= (1<<0x12)
This just looks like the canonical bit-setting operation! We've just chosen
So the idea is to use the first 27 bits to pick an integer to shift and the last five bits indicate which bit of the 32 in that integer to shift.
The key to understanding what's going on is to recognize that BITSPERWORD = 2SHIFT. Thus, x[i>>SHIFT] finds which 32-bit element of the array x has the bit corresponding to i. (By shifting i 5 bits to the right, you're simply dividing by 32.) Once you have located the correct element of x, the lower 5 bits of i can then be used to find which particular bit of x[i>>SHIFT] corresponds to i. That's what i & MASK does; by shifting 1 by that number of bits, you move the bit corresponding to 1 to the exact position within x[i>>SHIFT] that corresponds to the ith bit in x.
Here's a bit more of an explanation:
Imagine that we want capacity for N bits in our bit vector. Since each int holds 32 bits, we will need (N + 31) / 32 int values for our storage (that is, N/32 rounded up). Within each int value, we will adopt the convention that bits are ordered from least significant to most significant. We will also adopt the convention that the first 32 bits of our vector are in x[0], the next 32 bits are in x[1], and so forth. Here's the memory layout we are using (showing the bit index in our bit vector corresponding to each bit of memory):
+----+----+-------+----+----+----+
x[0]: | 31 | 30 | . . . | 02 | 01 | 00 |
+----+----+-------+----+----+----+
x[1]: | 63 | 62 | . . . | 34 | 33 | 32 |
+----+----+-------+----+----+----+
etc.
Our first step is to allocate the necessary storage capacity:
x = new int[(N + BITSPERWORD - 1) >> SHIFT]
(We could make provision for dynamically expanding this storage, but that would just add complexity to the explanation.)
Now suppose we want to access bit i (either to set it, clear it, or just to know its current value). We need to first figure out which element of x to use. Since there are 32 bits per int value, this is easy:
subscript for x = i / 32
Making use of the enum constants, the x element we want is:
x[i >> SHIFT]
(Think of this as a 32-bit-wide window into our N-bit vector.) Now we have to find the specific bit corresponding to i. Looking at the memory layout, it's not hard to figure out that the first (rightmost) bit in the window corresponds to bit index 32 * (i >> SHIFT). (The window starts afteri >> SHIFT slots in x, and each slot has 32 bits.) Since that's the first bit in the window (position 0), then the bit we're interested in is is at position
i - (32 * (i >> SHIFT))
in the windows. With a little experimenting, you can convince yourself that this expression is always equal to i % 32 (actually, that's one definition of the mod operator) which, in turn, is always equal to i & MASK. Since this last expression is the fastest way to calculate what we want, that's what we'll use.
From here, the rest is pretty simple. We start with a single bit in the least-significant position of the window (that is, the constant 1), and move it to the left by i & MASK bits to get it to the position in the window corresponding to bit i in the bit vector. This is where the expression
1 << (i & MASK)
comes from. With the bit now moved to where we want it, we can use this as a mask to set, clear, or query the value of the bit at that position in x[i>>SHIFT] and we know that we're actually setting, clearing, or querying the value of bit i in our bit vector.
If you store your bits in an array of n words you can imagine them to be layed out as a matrix with n rows and 32 columns (BITSPERWORD):
3 0
1 0
0 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
1 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
2 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
....
n xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
To get the k-th bit you divide k by 32. The (integer) result will give you the row (word) the bit is in, the reminder will give you which bit is within the word.
Dividing by 2^p can be done simply by shifting p postions to the right. The reminder can be obtained by getting the p rightmost bits (i.e the bitwise AND with (2^p - 1)).
In C terms:
#define div32(k) ((k) >> 5)
#define mod32(k) ((k) & 31)
#define word_the_bit_is_in(k) div32(k)
#define bit_within_word(k) mod32(k)
Hope it helps.

Generate an integer that is not among four billion given ones

I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.
Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).
Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.
A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
The code to implement the bitset is simple: (taken from solutions page )
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.
HTH.
Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)
For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break
Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.
This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers, 0 to 4294967295.
Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.
EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)
They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)
Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.
Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.
To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.
If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.
If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.
The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
0 - 00
1 - 01
2 - 10
3 - 11
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).
Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.
Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.
The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;
Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.
Of course you'd need a backup plan in case n+1 causes an integer overflow.
Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
void maxNum(ulong filesize)
{
ulong bitcount = filesize * 8; //number of bits in file
for (ulong i = 0; i < bitcount; i++)
{
Console.Write(9);
}
}
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.
From Reddit by Carbonetc.
For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.
You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.
Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from int_min to int_max, and
bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)
for (integer i = int_min; i <= int_max; ++i)
{
if (isNotInFile(i)) {
return i;
}
}
For the 10 MB memory constraint:
Convert the number to its binary representation.
Create a binary tree where left = 0 and right = 1.
Insert each number in the tree using its binary representation.
If a number has already been inserted, the leafs will already have been created.
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.
I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s.
foreach (var number in file) {
bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0.
}
for (var i = 0; i < 4294967296; i++) {
if (bitArray[i] == 0) {
return i - 2147483648;
}
}
As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.
Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.
2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.
I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}
As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.
If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!
(Joking... kind of.)

Bits and Bytes: Store a shuffle instruction

Given a byte array with a length of two we have two possibilities for a shuffle. 01 and 10
A length of 3 would allow these shuffle options 012,021,102,120,102,201,210. Total of 2x3=6 options.
A length of 4 would have 6x4=24. Length of 5 would have 24x5=120 options, etc.
So once you have randomly picked one of these shuffle options, how do you store it? You could store 23105 to indicate how to shuffle four bytes.. But that takes 5x3=15 bits. I know it can be done in 7 bits because there are only 120 possibilities.
Any ideas how to more efficiently store a shuffle instruction? It should be an algorithm that will scale in length.
Edit: See my own answer below before you post a new one. I am sure that there is good information in many of these already existing answers, but I just could not understand much of it.
If you have a well-ordering of the set of elements you are shuffling, then you can create a well-ordering for the set of all the permutations and just store a single integer representing which place in the order a permutation falls.
Example:
Shuffling 1 4 5: the possibilities are:
1 4 5 [0]
1 5 4 [1]
4 1 5 [2]
4 5 1 [3]
5 1 4 [4]
5 4 1 [5]
To store the permutation 415, you would just store 2 (zero indexed).
If you have a well-ordering for the original set of elements, you can make a well-ordering for the set of permutations by iterating through the elements from least order to greatest for the leftmost element, while iterating through the leftover elements for the next place to the right and so on until you get to the rightmost element. You wouldn't need to store this array, you would just need to be able to generate the permutations in the same order again to "unpack" the stored integer.
However, attempting to generate all the permutations one by one will take a considerable amount of time beyond the smallest of sets. You can use the observation that the first (N-1)! permutations start with the 1st element, the second (N-1)! permutations start with the second element, then for each permutation that starts with a specific element, the 1st (N-2)! permutations start with the first of the leftover elements and so on and so forth. This will allow you to "pack" or "unpack" the elements in O(n), excepting the complexity of actually generating the factorials and the division and modulus of arbitrary length integers, which will be somewhat substantial.
You are right that to store just a permutation of data, and not the data itself, you will need only as many bits as ceil(log2(permutations)). For N items, the number of permutations is factorial(N) or N!, so you would need ceil(log2(factorial(N))) bits to store just the permutation of N items without also storing the items.
In whatever language you're familiar, there should be a ready way to make a big array of M bits, fill it up with a permutation, and then store it on a storage device.
A common shuffling algorithm, and one of the few unbiased ones, is the Fisher-Yates shuffle. Each iteration of the algorithm takes a random number and swaps two places based on that number. By storing a list of those random numbers, you can later reproduce the exact same permutation.
Furthermore, since the valid range for each of those numbers is known in advance, you can pack them all into a big integer by multiplying each number by the product of the lower number's valid ranges, like a kind of variable-base positional notation.
For an array of L items, why not pack the order into L*ceil(log2(L)) bits? (ceil(log2(L)) is the number of bits needed to hold L unique values). For example, here is the representation of the "unshuffled" shuffle, taking the items in order:
L=2: 0 1 (2 bits)
L=3: 00 01 10 (6 bits)
L=4: 00 01 10 11 (8 bits)
L=5: 000 001 010 011 100 (15 bits)
...
L=8: 000 001 010 011 100 101 110 111 (24 bits)
L=9: 0000 0001 0010 0011 0100 0101 0110 0111 1000 (36 bits)
...
L=16: 0000 0001 ... 1111 (64 bits)
L=128: 00000000 000000001 ... 11111111 (1024 bits)
The main advantage to this scheme compared to #user470379's answer, is that it is really easy to extract the indexes, just shift and mask. No need to regenerate the permutation table. This should be a big win for large L: (For 128 items, there are 128! = 3.8562e+215 possible permutations).
(Permutations == "possibilities"; factorial = L! = L * (L-1) * ... * 1 = exactly the way you are calculating possibilities)
This method also isn't that much larger than storing the permutation index. You can store a 128 item shuffle in 1024 bits (32 x 32-bit integers). It takes 717 bits (23 ints) to store 128!.
Between the faster decoding speed and the fact that no temporary storage is required for caclulating the permutation, storing the extra 9 ints may be well worth their cost.
Here is an implementation in Ruby that should work for arbitrary sizes. The "shuffle instruction" is contained in the array instruction. The first part calculates the shuffle using a version of the Fisher-Yates algorithm that #Theran mentioned
# Some Setup and utilities
sizeofInt = 32 # fix for your language/platform
N = 16
BitsPerIndex = Math.log2(N).ceil
IdsPerWord = sizeofInt/BitsPerIndex
# sets the n'th bitfield in array a to v
def setBitfield a,n,v
mask = (2**BitsPerIndex)-1
idx = n/IdsPerWord
shift = (n-idx*IdsPerWord)*BitsPerIndex
a[idx]&=~(mask<<shift)
a[idx]|=(v&mask)<<shift
end
# returns the n'th bitfield in array a
def getBitfield a,n
mask = (2**BitsPerIndex)-1
idx = n/IdsPerWord
shift = (n-idx*IdsPerWord)*BitsPerIndex
return (a[idx]>>shift)&mask
end
#create the shuffle instruction in linear time
nwords = (N.to_f/IdsPerWord).ceil # num words required to hold instruction
instruction = Array.new(nwords){0} # array initialized to 0
#the "inside-out" Fisher–Yates shuffle
for i in (1..N-1)
j = rand(i+1)
setBitfield(instruction,i,getBitfield(instruction,j))
setBitfield(instruction,j,i)
end
#Here is a way to visualize the shuffle order
#delete ".reverse.map{|s|s.to_i(2)}" to visualize the way it's really stored
p instruction.map{|v|v.to_s(2).rjust(BitsPerIndex*IdsPerWord,'0').scan(
Regexp.new('.'*BitsPerIndex)).reverse.map{|s|s.to_i(2)}}
Here is an example of applying the shuffle to an array of characters:
A=(0...N).map{|v|('A'.ord+v).chr}
puts A*''
#Apply the shuffle to A in linear time
for i in (0...N)
print A[getBitfield(instruction,i)]
end
print "\n"
#example: for N=20, produces
> ABCDEFGHIJKLMNOPQRST
> MSNOLGRQCTHDEPIAJFKB
Hopefully this won't be too hard to convert to javascript, or any other language.
I am sorry if this was already covered in a previous answer,, but for the first time,, these answers are completely foreign to me. I might have mentioned that I know Java and JavaScript and that I know nothing of mathematics... So log2, permutations, factorial, well-ordering are all unknown words to me.
And on top of that I ended up (again) using StackOverflow as a white board to write out my question and answered the question in my head 20 minutes later. I was tied up in non computer life and,, knowing StackOverflow I figured it was too late to save more than 20% of everybody's easily wasted time.
Anyway, having gotten lost in all three existing answers. Here is the answer I know of
(written in Javascript but it should be easy to translate 20 lines of foreign code to your language of choice)
(see it in action here: http://jsfiddle.net/M3vHC)
Edit: Thanks to AShelly for this catch: This will fail (become highly biased) when given a key length of more than 12 assuming your ints are 32 bit (more than 19 if your ints are 64 bit)
var keyLength = 5
var possibilities = 1
for(var i = 0; i < keyLength ; i++)
possibilities *= i+1 // Calculate the number of possibilities to create an unbiased key
var randomKey = parseInt(Math.random()*possibilities) // Your shuffle instruction. Random number with correct number of possibilities starting with zero as the first possibility
var keyArray = new Array(keyLength) // This will contain the new locations of existing indexes. [0,1,2,3,4] means no shuffle [4,3,2,1,0] means reverse order. etcetera
var remainsOfKey = randomKey // Our "working" key. This is disposible / single use.
var taken = new Array(keyLength) // Tells if an index has already been accounted for in the keyArray
for(var i = keyArray.length;i > 0;i--) { // The number of possibilities for the first item in the key array is the number of blanks in key array.
var add = remainsOfKey % i + 1, remainsOfKey = parseInt(randomKey / i) // Grab a number at least zero and less then the number of blanks in the keyArray
for(var j = 0; add; j++) // If we got x from the above line, make sure x is not already taken
if(!taken[j])
add--
taken[keyArray[i-1] = --j] = true // Take what we have because it is right
}
alert('Based on a key length of ' + keyLength + ' and a random key of ' + randomKey + ' the new indexes are ... ' + keyArray.join(',') + ' !')

Resources