I have an external system written in Ruby which sending a data over the wire encoded with ASCII_8BIT. How should I decode and encode them in the Scala?
I couldn't find a library for decoding and encoding ASCII_8BIT string in scala.
As I understand, correctly, the ASCII_8BIT is something similar to Base64. However, there is more than one Base64 encoding. Which type of encoding should I use to be sure that cover all corner cases?
What is ASCII-8BIT?
ASCII-8BIT is Ruby's binary encoding (the name "BINARY" is accepted as an alias for "ASCII-8BIT" when specifying the name of an encoding). It is used both for binary data and for text whose real encoding you don't know.
Any sequence of bytes is a valid string in the ASCII-8BIT encoding, but unlike other 8bit-encodings, only the bytes in the ASCII range are considered printable characters (and of course only those that are printable in ASCII). The bytes in the 128-255 range are considered special characters that don't have a representation in other encodings. So trying to convert an ASCII-8BIT string to any other encoding will fail (or replace the non-ASCII characters with question marks depending on the options you give to encode) unless it only contains ASCII characters.
What's its equivalent in the Scala/JVM world?
There is no strict equivalent. If you're dealing with binary data, you should be using binary streams that don't have an encoding and aren't treated as containing text.
If you're dealing with text, you'll either need to know (or somehow figure out) its encoding or just arbitrarily pick an 8-bit ASCII-superset encoding. That way non-ASCII characters may come out as the wrong character (if the text was actually encoded with a different encoding), but you won't get any errors because any byte is a valid character. You can then replace the non-ASCII characters with question marks if you want.
What does this have to do with Base64?
Nothing. Base64 is a way to represent binary data as ASCII text. It is not itself a character encoding. Knowing that a string has the character encoding ASCII or ASCII-8BIT or any other encoding, doesn't tell you whether it contains Base64 data or not.
But do note that a Base64 string will consist entirely of ASCII characters (and not just any ASCII characters, but only letters, numbers, +, / and =). So if your string contains any non-ASCII character or any character except the aforementioned, it's not Base64.
Therefore any Base64 string can be represented as ASCII. So if you have an ASCII-8BIT string containing Base64 data in Ruby, you should be able to convert it to ASCII without any problems. If you can't, it's not Base64.
Related
I'm trying to figure out the term for these types of characters:
\M-C\M-6 (corresponds to german "ö")
\M-C\M-$ (corresponds to german "ä")
\M-C\M^_ (corresponds to german "ß")
I want to know the term for these outputs so that I can easily convert them into the utf-8 character they actually are in golang instead of creating a mapping of each I come across.
What is the term for these? unicode? What would be the best way to convert these "characters" to their actual human readable character in golang?
It is the vis encoding of UTF-8 encoded text.
Here's an example:
The UTF-8 encoding of the rune ö in bytes is [0303, 0266].
vis encodes the byte 0303 as the bytes \M-C and the byte 0266 as the bytes \M-6.
Putting the two levels of encoding together, the rune ö is encoded as the bytes \M-C\M-6.
You can either write an decoder using the documentation on the man page or search for a decoding package. The Go standard library does not include such a decoder.
In almost all examples UUID is encoded to utf-8 for example.
"aa4aaa2c-c6ca-d5f5-b8b2-0b5c78ee2cb7".getBytes(StandardCharsets.UTF_8))
UUID is not ascii format? Why everyone encodes to utf-8?
UUID is encoded as a 128-bit object (see RFC4122). Your example is the textual representation in hexadecimal of an UUID value.
There is no particular encoding required for UUID. I guess UTF-8 is used probably because it is the default encoding for various exchange formats such as for example JSON.
What do you mean by "UUID is not ascii format?" UUID is a 128-bit number, and this is one (ambiguous) way to encode it into a string. Do you mean "why do people use UTF-8 when ASCII is equivalent?" Because it's good habit to use UTF-8 for most things unless you have a reason not to. When it's equivalent to ASCII, it's the same, so it doesn't matter. When it's not equivalent to ASCII, you usually wanted UTF-8.
I'm writing a Go package for communicating with a 3rd-party vendor's API. Their documentation states roughly this:
Our API uses the ISO-8859-1 encoding. If you fail to use ISO-8859-1 for encoding special characters, this will result in unexpected errors or malformed strings.
I've been doing research on the subject of charsets and encodings, trying to figure out how to "encode special characters" in ISO-8859-1, but based on what I've found this seems to be a red herring.
From StackOverflow, emphasis mine:
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
ISO-8859-1 is a binary encoding format where each possible value of a single byte maps to a specific character. It's certainly within my power to have my HTTP POST body encoded in this way, but not any characters beyond the 256 defined in the spec.
I gather that, to encode a special character (such as the Euro symbol) in ISO-8859-1, it would first need to be escaped in some way.
Is there some kind of standard ISO-8859-1 escaping? Would it suffice to URL-encode any special characters and then encode my POST body in ISO-8859-1?
I would like to know if :
all characters encoded in ANSI (1252) could be converted to UTF-8 without any problem.
all characters encoded in UTF-8 couldn't be converted to ANSI (1252) without any problem (example : Ǣ couldn't be converted to ANSI encoding).
Could you confirm for me that it corrects ?
Thanks !
Yes, all characters representable in Windows-1252 have Unicode equivalents, and can therefore be converted to UTF-8. See this Wikipedia article for a table showing the mapping to Unicode code points.
And since Windows-1252 is an 8-bit character set, and UTF-8 can represent many thousands of distinct characters, there are obviously plenty of characters representable as UTF-8 and not representable as Windows-1252.
Note that the name "ANSI" for the Windows-1252 encoding is strictly incorrect. When it was first proposed, it was intended to be an ANSI standard, but that never happened. Unfortunately, the name stuck. (Microsoft-related documentation also commonly refers to UTF-16 as "Unicode", another misnomer; UTF-16 is one representation of Unicode, but there are others.)
I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.