Convert character from UTF-8 to ISO-8859-1 manually - utf-8

I have the character "ö". If I look in this UTF-8 table I see it has the hex value F6. If I look in the Unicode table I see that "ö" has the indices E0and 16. If I add both I get the hex value of the code point of F6. This is the binary value 1111 0110.
1) How do I get from the hex value F6 to the indices E0 and 16?
2) I don't know how to come from F6 to the two bytes C3 B6 ...
Because I didn't got the results I tried to go the other way. "ö" is represented in ISO-8859-1 as "ö". In the UTF-8 table I can see that "Ã" has the decimal value 195 and "¶" has the decimal value 182. Converted to bits this is 1100 0011 1011 0110.
Process:
Look in a table and get the unicode for the character "ö". Calculated from the indices E0 and 16 you get the Unicode U+00F6.
According to the algorithm posted by wildplasser you can calculate the coded UTF-8 value C3 and B6.
In the binary form you get 1100 0011 1011 0110 which corresponds to the decimal values 195 and 182.
If these values are interpreted as ISO 8859-1 (only 1 byte) then you get "ö".
PS: I found also this link, which shows the values from step 2.

The pages you are using are confusing you somewhat. Neither your "UTF-8 table" or "Unicode table" are giving you the value of the code point in UTF-8. They are both simply listing the Unicode value of the characters.
In Unicode, every character ("code point") has a unique number assigned to it. The character ö is assigned the code point U+00F6, which is F6 in hexadecimal, and 246 in decimal.
UTF-8 is a representation of Unicode, using a sequence of between one and four bytes per Unicode code point. The transformation from 32-bit Unicode code points to UTF-8 byte sequences is described in that article - it is pretty simple to do, once you get used to it. Of course, computers do it all the time, but you can do it with a pencil and paper easily, and in your head with a bit of practice.
If you do that transformation, you will see that U+00F6 transforms to the UTF-8 sequence C3 B6, or 1100 0011 1011 0110 in binary, which is why that is the UTF-8 representation of ö.
The other half of your question is about ISO-8859-1. This is a character encoding commonly called "Latin-1". The numeric values of the Latin-1 encoding are the same as the first 256 code points in Unicode, thus ö is F6 in Latin-1.
Once you have converted between UTF-8 and standard Unicode code points (UTF-32), it should be trivial to get the Latin-1 encoding. However, not all UTF-8 sequences / Unicode characters have corresponding Latin-1 characters.
See the excellent article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for a better understanding of character encodings and transformations between them.

unsigned cha_latin2utf8(unsigned char *dst, unsigned cha)
{
if (cha < 0x80) { *dst = cha; return 1; }
/* all 11 bit codepoints (0x0 -- 0x7ff)
** fit within a 2byte utf8 char
** firstbyte = 110 +xxxxx := 0xc0 + (char>>6) MSB
** second = 10 +xxxxxx := 0x80 + (char& 63) LSB
*/
*dst++ = 0xc0 | (cha >>6) & 0x1f; /* 2+1+5 bits */
*dst++ = 0x80 | (cha) & 0x3f; /* 1+1+6 bits */
return 2; /* number of bytes produced */
}
To test it:
#include <stdio.h>
int main (void)
{
char buff[12];
cha_latin2utf8 ( buff, 0xf6);
fprintf(stdout, "%02x %02x\n"
, (unsigned) buff[0] & 0xff
, (unsigned) buff[1] & 0xff );
return 0;
}
The result:
c3 b6

Related

What is this unpack doing? Can someone help me understand just a few letters?

I'm reading this code and I'm a tad confused as to what is going on. This code is using Ruby's OpenSSL library.
encrypted_message = cipher.update(address_string) + cipher.final
encrypted_message
=> "G\xCB\xE10prs\x1D\xA7\xD0\xB0\xCEmX\xDC#k\xDD\x8B\x8BB\xE1#!v\xF1\xDC\x19\xDD\xD0\xCA\xC9\x8B?B\xD4\xED\xA1\x83\x10\x1F\b\xF0A\xFEMBs'\xF3\xC7\xBC\x87\x9D_n\\z\xB7\xC1\xA5\xDA\xF4s \x99\\\xFD^\x85\x89s\e"
[3] pry(Encoder)> encrypted_message.unpack('H*')
=> ["47cbe1307072731da7d0b0ce6d58dc406bdd8b8b42e1232176f1dc19ddd0cac98b3f42d4eda183101f08f041fe4d427327f3c7bc879d5f6e5c7ab7c1a5daf47320995cfd5e8589731b"]
It seems that the H directive is this:
hex string (high nibble first)
How are the escaped characters in the encrypted_message transformed into letters and numbers?
I think the heart of the issue is that I don't understand this. What is going on?
['A'].pack('H')
=> "\xA0"
Here is a good explanation of Ruby's pack and unpack methods.
According to your question:
> ['A'].pack('H')
=> "\xA0"
A byte consists of 8 bits. A nibble consists of 4 bits. So a byte has two nibbles. The ascii value of ‘h’ is 104. Hex value of 104 is 68. This 68 is stored in two nibbles. First nibble, meaning 4 bits, contain the value 6 and the second nibble contains the value 8. In general we deal with high nibble first and going from left to right we pick the value 6 and then 8.
In the above case the input ‘A’ is not ASCII ‘A’ but the hex ‘A’. Why is it hex ‘A’. It is hex ‘A’ because the directive ‘H’ is telling pack to treat input value as hex value. Since ‘H’ is high nibble first and since the input has only one nibble then that means the second nibble is zero. So the input changes from ['A'] to ['A0'] .
Since hex value A0 does not translate into anything in the ASCII table the final output is left as it and hence the result is \xA0. The leading \x indicates that the value is hex value.

How to represent 11111111 as a byte in java

When I say that 0b11111111 is a byte in java, it says " cannot convert int to byte," which is because, as i understand it, 11111111=256, and bytes in java are signed, and go from -128 to 127. But, if a byte is just 8 bits of data, isn't 11111111 8 bits? I know 11111111 could be an integer, but in my situation it must be represented as a byte because it must be sent to a file in byte form. So how do I send a byte with the bits 11111111 to a file(by the way, this is my question)? When I try printing the binary value of -1, i get 11111111111111111111111111111111, why is that? I don't really understand how signed bytes work.
You need to cast the value to a byte:
byte b = (byte) 0b11111111;
The reason you need the cast is that 0b11111111 is an int literal (with a decimal value of 255) and it's outside the range of valid byte values (-128 to +127).
Java allows hex literals, but not binary. You can declare a byte with the binary value of 11111111 using this:
byte myByte = (byte) 0xFF;
You can use hex literals to store binary data in ints and longs as well.
Edit: you actually can have binary literals in Java 7 and up, my bad.

How MKI$ and CVI Functions works

I am working on GwBasic and want to know how 'CVI("aa")' returns '24929' is that converts each char to ASCII but code of "aa" is 9797.
CVI converts between a GW-BASIC integer and its internal representation in bytes. That internal representation is a 16-bit little-endian signed integer, so that the value you find is the same as ASC("a") + 256*ASC("a"), which is 97 + 256*97, which is 24929.
MKI$ is the opposite operation of CVI, so that MKI$(24929) returns the string "aa".
The 'byte reversal' is a consequence of the little endianness of GW-BASIC's internal representation of integers: the leftmost byte of the representation is the least significant byte, whereas in hexadecimal notation you would write the most significant byte on the left.

C++ Output a Byte Clarification

From textbook:
So I know a byte has 8 bits and the right bit-shift adds zero bits to the left and pops off bits from the right.
But how is it used the above example to output a byte?
I would've expected:
putchar(b >> 8)
putchar(b >> 7)
putchar(b >> 6)
etc.
Since I assume putchar outputs the popped off bits?
putchar prints the ascii character corresponding to the integer given.
putchar(0x41) converts the integer 0x41 into an unsigned char (with a size of one byte) and prints out the ascii character corresponding to 0x41 (which is "A").
The key thing to realize here that putchar only looks at the lower 8 bits, i.e. putchar(0x41) and putchar(0xffffff41) do the same thing.
Now let's look at what happens when you pass something to your function above.
outbyte(0x41424344);
first it bitshifts b by 24 bits, and then calls putchar on that value
0x41424344 << 24; //0x00000041
putchar(0x00000041); //A
then it bitshifts b by 16 bits, and then calls putchar on that value
0x41424344 << 24; //0x00004142
putchar(0x00004142); //B
etc.
Here it is in action: http://ideone.com/3xeFSx

Convert HEX 32 bit from GPS plot on Ruby

I am working with the following HEX values representing different values from a GPS/GPRS plot. All are given as 32 bit integer.
For example:
296767 is the decimal value (unsigned) reported for hex number: 3F870400
Another one:
34.96987500 is the decimal float value (signed) given on radian resolution 10^(-8) reported for hex humber: DA4DA303.
Which is the process for transforming the hex numbers onto their corresponding values on Ruby?
I've already tried unpack/pack with directives: L, H & h. Also tried adding two's complement and converting them to binary and then decimal with no success.
If you are expecting an Integer value:
input = '3F870400'
output = input.scan(/../).reverse.join.to_i( 16 )
# 296767
If you are expecting degrees:
input = 'DA4DA303'
temp = input.scan(/../).reverse.join.to_i( 16 )
temp = ( temp & 0x80000000 > 1 ? temp - 0x100000000 : temp ) # Handles negatives
output = temp * 180 / (Math::PI * 10 ** 8)
# 34.9698751282937
Explanation:
The hexadecimal string is representing bytes of an Integer stored least-significant-byte first (or little-endian). To store it as raw bytes you might use [296767].pack('V') - and if you had the raw bytes in the first place you would simply reverse that binary_string.unpack('V'). However, you have a hex representation instead. There are a few different approaches you might take (including putting the hex back into bytes and unpacking it), but in the above I have chosen to manipulate the hex string into the most-significant-byte first form and use Ruby's String#to_i

Resources