From textbook:
So I know a byte has 8 bits and the right bit-shift adds zero bits to the left and pops off bits from the right.
But how is it used the above example to output a byte?
I would've expected:
putchar(b >> 8)
putchar(b >> 7)
putchar(b >> 6)
etc.
Since I assume putchar outputs the popped off bits?
putchar prints the ascii character corresponding to the integer given.
putchar(0x41) converts the integer 0x41 into an unsigned char (with a size of one byte) and prints out the ascii character corresponding to 0x41 (which is "A").
The key thing to realize here that putchar only looks at the lower 8 bits, i.e. putchar(0x41) and putchar(0xffffff41) do the same thing.
Now let's look at what happens when you pass something to your function above.
outbyte(0x41424344);
first it bitshifts b by 24 bits, and then calls putchar on that value
0x41424344 << 24; //0x00000041
putchar(0x00000041); //A
then it bitshifts b by 16 bits, and then calls putchar on that value
0x41424344 << 24; //0x00004142
putchar(0x00004142); //B
etc.
Here it is in action: http://ideone.com/3xeFSx
Related
Suppose I have an arbitrary block of bytes. The block is terminated with a CRC remainder computed over the whole block using the CRC-16-CCITT algorithm, where the remainder is arranged in the big-endian byte order. After the block and the remainder, there is an arbitrary number of zero bytes that continue until the end of the byte stream.
This arrangement takes advantage of a certain property of this CRC algorithm which is normally considered undesirable: it does not distinguish between messages with different numbers of trailing zeroes, provided that the message is terminated with its remainder (it is in my case). This allows the receiver to assert the correctness of the data regardless of the number of trailing bytes in the stream.
Here is an example:
>>> hex(crc(b'123456789')) # Computing the remainder
'0x29b1'
>>> hex(crc(b'123456789\x29\xb1')) # Appending the remainder in the big-endian order
'0x0' # If the remainder is correct, the residual value is always zero
>>> hex(crc(b'123456789\x29\xb1\x00\x00')) # ...and it is invariant to the number of trailing zeros
'0x0'
>>> hex(crc(b'123456789\x29\xb1\x00\x00\x00'))
'0x0'
This is the desired behavior in my case. However, in my application the data are exchanged over a medium that utilizes a non-return-to-zero (NRZ) encoding: the medium layer injects a single stuff bit after every five consecutive data bits of the same level, where the polarity of the stuff bit is the opposite of the preceding bits; e.g. the value of 00000000 is transmitted as 000001000. Bit stuffing is highly undesirable because it adds overhead.
In order to take advantage of the invariance of the CRC algorithm to the trailing data (which is used for padding) and yet avoid bit stuffing, I intend to xor every data byte with 0x55 (although it could be any other bit pattern that avoids stuffing) before updating the CRC remainder, and then xor the final remainder with 0x5555.
For reference, here is the standard CRC-16-CCITT algorithm, naive implementation:
def crc16(b):
crc = 0xFFFF
for byte in b:
crc ^= byte << 8
for bit in range(8):
if crc & 0x8000:
crc = ((crc << 1) ^ 0x1021) & 0xFFFF
else:
crc = (crc << 1) & 0xFFFF
return crc
And here is my modification which xors inputs and outputs with 0x55:
def crc16_mod(b):
crc = 0xFFFF
for byte in b:
crc ^= (byte ^ 0x55) << 8
for bit in range(8):
if crc & 0x8000:
crc = ((crc << 1) ^ 0x1021) & 0xFFFF
else:
crc = (crc << 1) & 0xFFFF
return crc ^ 0x5555
A simple check confirms that the modified algorithm behaves as intended:
>>> print(hex(crc16_mod(b'123456789'))) # New remainder
0x954f
>>> print(hex(crc16_mod(b'123456789\x95\x4f'))) # Appending the remainder; residual is 0x5555
0x5555
>>> print(hex(crc16_mod(b'123456789\x95\x4f\x55\x55\x55'))) # Invariant to the number of trailing 0x55
0x5555
>>> print(hex(crc16_mod(b'123456789\x95\x4f\x55\x55\x55\x55'))) # Invariant to the number of trailing 0x55
0x5555
My question is as follows: am I compromising error-detecting properties of the algorithm by introducing this modification? Are there any other downsides I should be aware of?
Under the standard model of errors (bits flipped independently with a fixed probability), there is no downside. It is difficult to anticipate practical difficulties.
When I say that 0b11111111 is a byte in java, it says " cannot convert int to byte," which is because, as i understand it, 11111111=256, and bytes in java are signed, and go from -128 to 127. But, if a byte is just 8 bits of data, isn't 11111111 8 bits? I know 11111111 could be an integer, but in my situation it must be represented as a byte because it must be sent to a file in byte form. So how do I send a byte with the bits 11111111 to a file(by the way, this is my question)? When I try printing the binary value of -1, i get 11111111111111111111111111111111, why is that? I don't really understand how signed bytes work.
You need to cast the value to a byte:
byte b = (byte) 0b11111111;
The reason you need the cast is that 0b11111111 is an int literal (with a decimal value of 255) and it's outside the range of valid byte values (-128 to +127).
Java allows hex literals, but not binary. You can declare a byte with the binary value of 11111111 using this:
byte myByte = (byte) 0xFF;
You can use hex literals to store binary data in ints and longs as well.
Edit: you actually can have binary literals in Java 7 and up, my bad.
I'm trying to convert an int to a 4 byte unsigned ints. I was reading a post and I'm not sure what comparison does. I know that its a sub mask, but I'm not sure when to use the & and when to use the |. Lets use the number 6 as an example, if the LSB is the number 6. Why would we do
6 & 0xFF// I know that we are comparing it to 11111111
when do we use the OR operator? I'm still not sure how to use the & nor |
x & 0xFF will set all bits of x to zero except for the last byte (which stays the same). If you had used a bitwise or (|), it would leave the bits of x set, and set all the bits of the last byte to 1.
Typically, the comparison will be something like (x & 0xFF) == x. This is to make sure that the first three bytes of x are all 0.
I'm making a simple Cocoa program that can encode text to binary and decode it back to text. I tried to make this script and I was not even close to accomplishing this. Can anyone help me? This has to include two textboxes and two buttons or whatever is best, Thanks!
There are two parts to this.
The first is to encode the characters of the string into bytes. You do this by sending the string a dataUsingEncoding: message. Which encoding you choose will determine which bytes it gives you for each character. Start with NSUTF8StringEncoding, and then experiment with other encodings, such as NSUnicodeStringEncoding, once you get it working.
The second part is to convert every bit of every byte into either a '0' character or a '1' character, so that, for example, the letter A, encoded in UTF-8 to a single byte, will be represented as 01000001.
So, converting characters to bytes, and converting bytes to characters representing bits. These two are completely separate tasks; the second part should work correctly for any stream of bytes, including any valid stream of encoded characters, any invalid stream of encoded characters, and indeed anything that isn't text at all.
The first part is easy enough:
- (NSString *) stringOfBitsFromEncoding:(NSStringEncoding)encoding
ofString:(NSString *)inputString
{
//Encode the characters to bytes using the UTF-8 encoding. The bytes are contained in an NSData object, which we receive.
NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
//I did say these were two separate jobs.
return [self stringOfBitsFromData:data];
}
For the second part, you'll need to loop through the bytes of the data. A C for loop will do the job there, and that will look like this:
//This is the method we're using above. I'll leave out the method signature and let you fill that in.
- …
{
//Find out how many bytes the data object contains.
NSUInteger length = [data length];
//Get the pointer to those bytes. “const” here means that we promise not to change the values of any of the bytes. (The compiler may give a warning if we don't include this, since we're not allowed to change these bytes anyway.)
const char *bytes = [data bytes];
//We'll store the output here. There are 8 bits per byte, and we'll be putting in one character per bit, so we'll tell NSMutableString that it should make room for (the number of bytes times 8) characters.
NSMutableString *outputString = [NSMutableString stringWithCapacity:length * 8];
//The loop. We start by initializing i to 0, then increment it (add 1 to it) after each pass. We keep looping as long as i < length; when i >= length, the loop ends.
for (NSUInteger i = 0; i < length; ++i) {
char thisByte = bytes[i];
for (NSUInteger bitNum = 0; bitNum < 8; ++bitNum) {
//Call a function, which I'll show the definition of in a moment, that will get the value of a bit at a given index within a given character.
bool bit = getBitAtIndex(thisByte, bitNum);
//If this bit is a 1, append a '1' character; if it is a 0, append a '0' character.
[outputString appendFormat: #"%c", bit ? '1' : '0'];
}
}
return outputString;
}
Bits 101 (or, 1100101)
Bits are literally just digits in base 2. Humans in the Western world usually write out numbers in base 10, but a number is a number no matter what base it's written in, and every character, and every byte, and indeed every bit, is just a number.
Digits—including bits—are counted up from the lowest place, according to the exponent to which the base is raised to find the magnitude of that place. We want bits, so that base is 2, so our place values are:
2^0 = 1: The ones place (the lowest bit)
2^1 = 2: The twos place (the next higher bit)
2^2 = 4: The fours place
2^3 = 8: The eights place
And so on, up to 2^7. (Note that the highest exponent is exactly one lower than the number of digits we're after; in this case, 7 vs. 8.)
If that all reminds you of reading about “the ones place”, “the tens place”, “the hundreds place”, etc. when you were a kid, it should: it's the exact same principle.
So a byte such as 65, which (in UTF-8) completely represents the character 'A', is the sum of:
2^7 × 0 = 0
+ 2^6 × 0 = 64
+ 2^5 × 1 = 0
+ 2^4 × 0 = 0
+ 2^3 × 0 = 0
+ 2^2 × 0 = 0
+ 2^1 × 0 = 0
+ 2^0 × 1 = 1
= 0 + 64 +0+0+0+0+0 + 1
= 64 + 1
= 65
Back when you learned base 10 numbers as a kid, you probably noticed that ten is “10”, one hundred is “100”, etc. This is true in base 2 as well: as 10^x is “1” followed by x “0”s in base 10, so is 2^x “1” followed by “x” 0s in base 2. So, for example, sixty-four in base 2 is “1000000” (count the zeroes and compare to the table above).
We are going to use these exact-power-of-two numbers to test each bit in each input byte.
Finding the bit
C has a pair of “shift” operators that will insert zeroes or remove digits at the low end of a number. The former is called “shift left”, and is written as <<, and you can guess the opposite.
We want shift left. We want to shift 1 left by the number of the bit we're after. That is exactly equivalent to raising 2 (our base) to the power of that number; for example, 1 << 6 = 2^6 = “1000000”.
Testing the bit
C has an operator for bit testing, too; it's &, the bitwise AND operator. (Do not confuse this with &&, which is the logical AND operator. && is for using whole true/false values in making decisions; & is one of your tools for working with bits within values.)
Strictly speaking, & does not test single bits; it goes through the bits of both input values, and returns a new value whose bits are the bitwise AND of each input pair. So, for example,
01100101
& 00101011
----------
00100001
Each bit in the output is 1 if and only if both of the corresponding input bits were also 1.
Putting these two things together
We're going to use the shift left operator to give us a number where one bit, the nth bit, is set—i.e., 2^n—and then use the bitwise AND operator to test whether the same bit is also set in our input byte.
//This is a C function that takes a char and an int, promising not to change either one, and returns a bool.
bool getBitAtIndex(const char byte, const int bitNum)
//It could also be a method, which would look like this:
//- (bool) bitAtIndex:(const int)bitNum inByte:(const char)byte
//but you would have to change the code above. (Feel free to try it both ways.)
{
//Find 2^bitNum, which will be a number with exactly 1 bit set. For example, when bitNum is 6, this number is “1000000”—a single 1 followed by six 0s—in binary.
const int powerOfTwo = 1 << bitNum;
//Test whether the same bit is also set in the input byte.
bool bitIsSet = byte & powerOfTwo;
return bitIsSet;
}
A bit of magic I should acknowledge
The bitwise AND operator does not evaluate to a single bit—it does not evaluate to only 1 or 0. Remember the above example, in which the & operator returned 33.
The bool type is a bit magic: Any time you convert any value to bool, it automatically becomes either 1 or 0. Anything that is not 0 becomes 1; anything that is 0 becomes 0.
The Objective-C BOOL type does not do this, which is why I used bool in the code above. You are free to use whichever you prefer, except that you generally should use BOOL whenever you deal with anything that expects a BOOL, particularly when overriding methods in subclasses or implementing protocols. You can convert back and forth freely, though not losslessly (since bool will change non-zero values as described above).
Oh yeah, you said something about text boxes too
When the user clicks on your button, get the stringValue of your input field, call stringOfBitsFromEncoding:ofString: using a reasonable encoding (such as UTF-8) and that string, and set the resulting string as the new stringValue of your output field.
Extra credit: Add a pop-up button with which the user can choose an encoding.
Extra extra credit: Populate the pop-up button with all of the available encodings, without hard-coding or hard-nibbing a list.
I have the character "ö". If I look in this UTF-8 table I see it has the hex value F6. If I look in the Unicode table I see that "ö" has the indices E0and 16. If I add both I get the hex value of the code point of F6. This is the binary value 1111 0110.
1) How do I get from the hex value F6 to the indices E0 and 16?
2) I don't know how to come from F6 to the two bytes C3 B6 ...
Because I didn't got the results I tried to go the other way. "ö" is represented in ISO-8859-1 as "ö". In the UTF-8 table I can see that "Ã" has the decimal value 195 and "¶" has the decimal value 182. Converted to bits this is 1100 0011 1011 0110.
Process:
Look in a table and get the unicode for the character "ö". Calculated from the indices E0 and 16 you get the Unicode U+00F6.
According to the algorithm posted by wildplasser you can calculate the coded UTF-8 value C3 and B6.
In the binary form you get 1100 0011 1011 0110 which corresponds to the decimal values 195 and 182.
If these values are interpreted as ISO 8859-1 (only 1 byte) then you get "ö".
PS: I found also this link, which shows the values from step 2.
The pages you are using are confusing you somewhat. Neither your "UTF-8 table" or "Unicode table" are giving you the value of the code point in UTF-8. They are both simply listing the Unicode value of the characters.
In Unicode, every character ("code point") has a unique number assigned to it. The character ö is assigned the code point U+00F6, which is F6 in hexadecimal, and 246 in decimal.
UTF-8 is a representation of Unicode, using a sequence of between one and four bytes per Unicode code point. The transformation from 32-bit Unicode code points to UTF-8 byte sequences is described in that article - it is pretty simple to do, once you get used to it. Of course, computers do it all the time, but you can do it with a pencil and paper easily, and in your head with a bit of practice.
If you do that transformation, you will see that U+00F6 transforms to the UTF-8 sequence C3 B6, or 1100 0011 1011 0110 in binary, which is why that is the UTF-8 representation of ö.
The other half of your question is about ISO-8859-1. This is a character encoding commonly called "Latin-1". The numeric values of the Latin-1 encoding are the same as the first 256 code points in Unicode, thus ö is F6 in Latin-1.
Once you have converted between UTF-8 and standard Unicode code points (UTF-32), it should be trivial to get the Latin-1 encoding. However, not all UTF-8 sequences / Unicode characters have corresponding Latin-1 characters.
See the excellent article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for a better understanding of character encodings and transformations between them.
unsigned cha_latin2utf8(unsigned char *dst, unsigned cha)
{
if (cha < 0x80) { *dst = cha; return 1; }
/* all 11 bit codepoints (0x0 -- 0x7ff)
** fit within a 2byte utf8 char
** firstbyte = 110 +xxxxx := 0xc0 + (char>>6) MSB
** second = 10 +xxxxxx := 0x80 + (char& 63) LSB
*/
*dst++ = 0xc0 | (cha >>6) & 0x1f; /* 2+1+5 bits */
*dst++ = 0x80 | (cha) & 0x3f; /* 1+1+6 bits */
return 2; /* number of bytes produced */
}
To test it:
#include <stdio.h>
int main (void)
{
char buff[12];
cha_latin2utf8 ( buff, 0xf6);
fprintf(stdout, "%02x %02x\n"
, (unsigned) buff[0] & 0xff
, (unsigned) buff[1] & 0xff );
return 0;
}
The result:
c3 b6