I'm writing a Go package for communicating with a 3rd-party vendor's API. Their documentation states roughly this:
Our API uses the ISO-8859-1 encoding. If you fail to use ISO-8859-1 for encoding special characters, this will result in unexpected errors or malformed strings.
I've been doing research on the subject of charsets and encodings, trying to figure out how to "encode special characters" in ISO-8859-1, but based on what I've found this seems to be a red herring.
From StackOverflow, emphasis mine:
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
ISO-8859-1 is a binary encoding format where each possible value of a single byte maps to a specific character. It's certainly within my power to have my HTTP POST body encoded in this way, but not any characters beyond the 256 defined in the spec.
I gather that, to encode a special character (such as the Euro symbol) in ISO-8859-1, it would first need to be escaped in some way.
Is there some kind of standard ISO-8859-1 escaping? Would it suffice to URL-encode any special characters and then encode my POST body in ISO-8859-1?
Related
I have an external system written in Ruby which sending a data over the wire encoded with ASCII_8BIT. How should I decode and encode them in the Scala?
I couldn't find a library for decoding and encoding ASCII_8BIT string in scala.
As I understand, correctly, the ASCII_8BIT is something similar to Base64. However, there is more than one Base64 encoding. Which type of encoding should I use to be sure that cover all corner cases?
What is ASCII-8BIT?
ASCII-8BIT is Ruby's binary encoding (the name "BINARY" is accepted as an alias for "ASCII-8BIT" when specifying the name of an encoding). It is used both for binary data and for text whose real encoding you don't know.
Any sequence of bytes is a valid string in the ASCII-8BIT encoding, but unlike other 8bit-encodings, only the bytes in the ASCII range are considered printable characters (and of course only those that are printable in ASCII). The bytes in the 128-255 range are considered special characters that don't have a representation in other encodings. So trying to convert an ASCII-8BIT string to any other encoding will fail (or replace the non-ASCII characters with question marks depending on the options you give to encode) unless it only contains ASCII characters.
What's its equivalent in the Scala/JVM world?
There is no strict equivalent. If you're dealing with binary data, you should be using binary streams that don't have an encoding and aren't treated as containing text.
If you're dealing with text, you'll either need to know (or somehow figure out) its encoding or just arbitrarily pick an 8-bit ASCII-superset encoding. That way non-ASCII characters may come out as the wrong character (if the text was actually encoded with a different encoding), but you won't get any errors because any byte is a valid character. You can then replace the non-ASCII characters with question marks if you want.
What does this have to do with Base64?
Nothing. Base64 is a way to represent binary data as ASCII text. It is not itself a character encoding. Knowing that a string has the character encoding ASCII or ASCII-8BIT or any other encoding, doesn't tell you whether it contains Base64 data or not.
But do note that a Base64 string will consist entirely of ASCII characters (and not just any ASCII characters, but only letters, numbers, +, / and =). So if your string contains any non-ASCII character or any character except the aforementioned, it's not Base64.
Therefore any Base64 string can be represented as ASCII. So if you have an ASCII-8BIT string containing Base64 data in Ruby, you should be able to convert it to ASCII without any problems. If you can't, it's not Base64.
I'm trying to figure out the term for these types of characters:
\M-C\M-6 (corresponds to german "ö")
\M-C\M-$ (corresponds to german "ä")
\M-C\M^_ (corresponds to german "ß")
I want to know the term for these outputs so that I can easily convert them into the utf-8 character they actually are in golang instead of creating a mapping of each I come across.
What is the term for these? unicode? What would be the best way to convert these "characters" to their actual human readable character in golang?
It is the vis encoding of UTF-8 encoded text.
Here's an example:
The UTF-8 encoding of the rune ö in bytes is [0303, 0266].
vis encodes the byte 0303 as the bytes \M-C and the byte 0266 as the bytes \M-6.
Putting the two levels of encoding together, the rune ö is encoded as the bytes \M-C\M-6.
You can either write an decoder using the documentation on the man page or search for a decoding package. The Go standard library does not include such a decoder.
I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.
This is what my header looks like:
From: "\261Central Station <sip#...>"
Looking at the ASCII table the character in the picture is "±":
What I don't understand:
If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1".
e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.
From the page you linked to:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.
That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).
Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as #muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.
If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.
As to your questions:
If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?
The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.
How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?
We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).
If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)
That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print � for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.
Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).
Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.
I would like to know if :
all characters encoded in ANSI (1252) could be converted to UTF-8 without any problem.
all characters encoded in UTF-8 couldn't be converted to ANSI (1252) without any problem (example : Ǣ couldn't be converted to ANSI encoding).
Could you confirm for me that it corrects ?
Thanks !
Yes, all characters representable in Windows-1252 have Unicode equivalents, and can therefore be converted to UTF-8. See this Wikipedia article for a table showing the mapping to Unicode code points.
And since Windows-1252 is an 8-bit character set, and UTF-8 can represent many thousands of distinct characters, there are obviously plenty of characters representable as UTF-8 and not representable as Windows-1252.
Note that the name "ANSI" for the Windows-1252 encoding is strictly incorrect. When it was first proposed, it was intended to be an ANSI standard, but that never happened. Unfortunately, the name stuck. (Microsoft-related documentation also commonly refers to UTF-16 as "Unicode", another misnomer; UTF-16 is one representation of Unicode, but there are others.)
UTF-8 single byte characters map perfectly to Latin-1 (ISO 8859-1) characters (those below the character code of 128); basicly the default ASCII characters.
If I have a UTF-8 encoded string and pass it to a function, that expects a Latin-1 string is there any possibility that the Latin-1 function misinterprets parts of UTF-8 mutlibyte characters as ASCII characters?
I imagine something like this could happen:
(imagniray) UTF-8 multibyte character: 0xA330
(mis-)interpreted by Latin-1 function as two Latin-1 characters: 0xA3 0x30
The first of those characters does not lie within the ASCII set, but the second is the ASCII code for the 0 character. Is it possible that an multibyte UTF-8 character produces an artifact that looks like a single-byte UTF-8 / ASCII character like in the example above?
From my understanding of UTF-8 only single-byte characters contain any bytes with the most significant bit unset, so basicly multibyte characters never contain a byte that could be misinterpreted by a Latin-1 function as a valid ASCII character (because all those characters have the most significant bit unset). But I want to make sure this is true and I don't screw up on this, because this may have security implications when dealing with data sanitization - which I am apparently currently doing.
You are correct in your understanding that only single byte characters contain any bytes with the most significant bit unset. There is a nice table showing this at: http://en.wikipedia.org/wiki/UTF-8#Description