What's rules to specify byte sequence for BOM? - utf-8

I'm handling some file encoding stuff. When I learn BOM, it says The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF, then I find the Code page layout which is a table that contains many character encoding information. What I am curious is that if there are some rules for the BOM bytes sequence, I mean, why don't use 0xEE,0xFF,0xBB or any other bytes sequence to represent UTF-8? Thanks in advance.

The BOM is specific to Unicode UTF (Unicode Transformation Format) encodings. It is the Unicode character U+FEFF ZERO WIDTH NO-BREAK SPACE encoded to a specific byte sequence according to the rules defined in the specific UTF that it is encoded in, same as for any other Unicode codepoint. What makes the BOM special is that it is the first encoded codepoint at the front of the encoded text, so you can discover which UTF was used to encode the text, if not specified out-of-band through other means.
The BOM for UTF-8 is EF BB BF, for UTF-16LE is FF FE, for UTF-32LE is FF FE 00 00, etc. They are all just different representations of the same Unicode codepoint U+FEFF.
Other encodings, like Windows-1252, which you link to, do not use a BOM and cannot encode that particular character, so there is no alternative "Windows-1252 encoding" of a BOM.

Related

Are there examples of ISO 8859-1 text files which are valid, but different in UTF-8?

I know that UTF-8 supports way more characters than Latin-1 (even with the extensions). But are there examples of files that are valid in both, but the characters are different? So essentially that the content changes, depending on how you think the file is encoded?
I also know that a big chunk of Latin-1 maps 1:1 to the same part in UTF-8. The question is: which code points could change the value if interpreted differently (not invalid, but different)?
Latin-1 is a single-byte encoding (meaning 1 character = 1 byte), which uses all possible byte values. So any byte maps to something in Latin-1. So literally any file is "valid" in Latin-1. So you can interpret any file as Latin-1 and you'll get… something… as a result.
So yes, interpret any valid UTF-8 file in Latin-1. It's valid both in UTF-8 and Latin-1. The first 128 characters are the same for both encodings and both based on ASCII; but if your UTF-8 file uses any non-ASCII characters, those will be interpreted as gibberish (yet valid) Latin-1.
bytes
encoding
text
e6bc a2e5 ad97
UTF-8
漢字
e6bc a2e5 ad97
Latin-1
æ¼¢å­ 👈 valid but nonsensical
Unicode is - somewhat simplified - a character set, and UTF-8 is one of multiple encodings for the binary representation of Unicode.
ISO-8859-1 is both, a character set and encoding.
At the character set level, ISO-8859-1 is a subset of Unicode, i.e. each ISO-8859-1 character also exists in Unicode, and the ISO-8859-1 code is even equal to the Unicode codepoint.
At the encoding level, ISO-8859-1 and UTF-8 use the same binary representation for the ISO-8859-1 characters up to 127. But for the characters between 128 and 255 they differ as UTF-8 needs 2 bytes to represent them.
Example:
Word
ISO-8859-1
UTF-8
Zürich
5a fc 72 69 63 68
5a c3 bc 72 69 63 68

How to decode/encode ASCII_8BIT in the Scala?

I have an external system written in Ruby which sending a data over the wire encoded with ASCII_8BIT. How should I decode and encode them in the Scala?
I couldn't find a library for decoding and encoding ASCII_8BIT string in scala.
As I understand, correctly, the ASCII_8BIT is something similar to Base64. However, there is more than one Base64 encoding. Which type of encoding should I use to be sure that cover all corner cases?
What is ASCII-8BIT?
ASCII-8BIT is Ruby's binary encoding (the name "BINARY" is accepted as an alias for "ASCII-8BIT" when specifying the name of an encoding). It is used both for binary data and for text whose real encoding you don't know.
Any sequence of bytes is a valid string in the ASCII-8BIT encoding, but unlike other 8bit-encodings, only the bytes in the ASCII range are considered printable characters (and of course only those that are printable in ASCII). The bytes in the 128-255 range are considered special characters that don't have a representation in other encodings. So trying to convert an ASCII-8BIT string to any other encoding will fail (or replace the non-ASCII characters with question marks depending on the options you give to encode) unless it only contains ASCII characters.
What's its equivalent in the Scala/JVM world?
There is no strict equivalent. If you're dealing with binary data, you should be using binary streams that don't have an encoding and aren't treated as containing text.
If you're dealing with text, you'll either need to know (or somehow figure out) its encoding or just arbitrarily pick an 8-bit ASCII-superset encoding. That way non-ASCII characters may come out as the wrong character (if the text was actually encoded with a different encoding), but you won't get any errors because any byte is a valid character. You can then replace the non-ASCII characters with question marks if you want.
What does this have to do with Base64?
Nothing. Base64 is a way to represent binary data as ASCII text. It is not itself a character encoding. Knowing that a string has the character encoding ASCII or ASCII-8BIT or any other encoding, doesn't tell you whether it contains Base64 data or not.
But do note that a Base64 string will consist entirely of ASCII characters (and not just any ASCII characters, but only letters, numbers, +, / and =). So if your string contains any non-ASCII character or any character except the aforementioned, it's not Base64.
Therefore any Base64 string can be represented as ASCII. So if you have an ASCII-8BIT string containing Base64 data in Ruby, you should be able to convert it to ASCII without any problems. If you can't, it's not Base64.

Encoding Special Characters for ISO-8859-1 API

I'm writing a Go package for communicating with a 3rd-party vendor's API. Their documentation states roughly this:
Our API uses the ISO-8859-1 encoding. If you fail to use ISO-8859-1 for encoding special characters, this will result in unexpected errors or malformed strings.
I've been doing research on the subject of charsets and encodings, trying to figure out how to "encode special characters" in ISO-8859-1, but based on what I've found this seems to be a red herring.
From StackOverflow, emphasis mine:
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
ISO-8859-1 is a binary encoding format where each possible value of a single byte maps to a specific character. It's certainly within my power to have my HTTP POST body encoded in this way, but not any characters beyond the 256 defined in the spec.
I gather that, to encode a special character (such as the Euro symbol) in ISO-8859-1, it would first need to be escaped in some way.
Is there some kind of standard ISO-8859-1 escaping? Would it suffice to URL-encode any special characters and then encode my POST body in ISO-8859-1?

not utf-8 HTML Encoding for French characters

To use the utf-8 tag I must save the html in unicode. I need a working tag that can save the html file in plain text and show the French characters. Just like ISO-8859-1 for German.
UTF-8 and ISO-8859-1 are just different byte encodings of the same character set (Unicode). If you want to use UTF-8 encoded byte octets in your HTML, you have to save the HTML in UTF-8 encoding. Just like if you want to use ISO-8859-1 encoded byte octets, you have to save the HTML in ISO-8859-1 encoding.
Otherwise, use HTML entities in &<name>; or &#<codepoint>; format for non-ASCII Unicode characters, instead of raw byte octets. Many Unicode codepoints have reserved entity names (see the W3C Character Entity Reference Chart), otherwise you can use the actual Unicode numeric codepoint value instead.

Can UTF8 encoded data missread as Latin-1 produce ASCII artifacts?

UTF-8 single byte characters map perfectly to Latin-1 (ISO 8859-1) characters (those below the character code of 128); basicly the default ASCII characters.
If I have a UTF-8 encoded string and pass it to a function, that expects a Latin-1 string is there any possibility that the Latin-1 function misinterprets parts of UTF-8 mutlibyte characters as ASCII characters?
I imagine something like this could happen:
(imagniray) UTF-8 multibyte character: 0xA330
(mis-)interpreted by Latin-1 function as two Latin-1 characters: 0xA3 0x30
The first of those characters does not lie within the ASCII set, but the second is the ASCII code for the 0 character. Is it possible that an multibyte UTF-8 character produces an artifact that looks like a single-byte UTF-8 / ASCII character like in the example above?
From my understanding of UTF-8 only single-byte characters contain any bytes with the most significant bit unset, so basicly multibyte characters never contain a byte that could be misinterpreted by a Latin-1 function as a valid ASCII character (because all those characters have the most significant bit unset). But I want to make sure this is true and I don't screw up on this, because this may have security implications when dealing with data sanitization - which I am apparently currently doing.
You are correct in your understanding that only single byte characters contain any bytes with the most significant bit unset. There is a nice table showing this at: http://en.wikipedia.org/wiki/UTF-8#Description

Resources