How can I transcode US ASCII HTML's e.g. é to UTF-8's é undere Linux - utf-8

What a quick web search will confirm that US ASCII is a subset of UTF-8, but what I've not yet found is how to convert &foo; and { to their corresponding native UTF-8 characters.
I know that at least 7-bit US ASCII is unchanged in UTF-8, but I haven't seen yet a program to filter through and convert &foo; to how it would naturally be expressed in UTF-8.

You can use html_entity_decode(s, "UTF-8") in PHP or html.unescape(s) in Python.
https://www.php.net/manual/en/function.html-entity-decode.php
https://docs.python.org/3/library/html.html#html.unescape

Related

Is there a term for characters that appear from ios logs like `\M-C\M-6` or `\134`

I'm trying to figure out the term for these types of characters:
\M-C\M-6 (corresponds to german "ö")
\M-C\M-$ (corresponds to german "ä")
\M-C\M^_ (corresponds to german "ß")
I want to know the term for these outputs so that I can easily convert them into the utf-8 character they actually are in golang instead of creating a mapping of each I come across.
What is the term for these? unicode? What would be the best way to convert these "characters" to their actual human readable character in golang?
It is the vis encoding of UTF-8 encoded text.
Here's an example:
The UTF-8 encoding of the rune ö in bytes is [0303, 0266].
vis encodes the byte 0303 as the bytes \M-C and the byte 0266 as the bytes \M-6.
Putting the two levels of encoding together, the rune ö is encoded as the bytes \M-C\M-6.
You can either write an decoder using the documentation on the man page or search for a decoding package. The Go standard library does not include such a decoder.

Encoding Special Characters for ISO-8859-1 API

I'm writing a Go package for communicating with a 3rd-party vendor's API. Their documentation states roughly this:
Our API uses the ISO-8859-1 encoding. If you fail to use ISO-8859-1 for encoding special characters, this will result in unexpected errors or malformed strings.
I've been doing research on the subject of charsets and encodings, trying to figure out how to "encode special characters" in ISO-8859-1, but based on what I've found this seems to be a red herring.
From StackOverflow, emphasis mine:
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
ISO-8859-1 is a binary encoding format where each possible value of a single byte maps to a specific character. It's certainly within my power to have my HTTP POST body encoded in this way, but not any characters beyond the 256 defined in the spec.
I gather that, to encode a special character (such as the Euro symbol) in ISO-8859-1, it would first need to be escaped in some way.
Is there some kind of standard ISO-8859-1 escaping? Would it suffice to URL-encode any special characters and then encode my POST body in ISO-8859-1?

ANSI to UTF-8 conversion

I would like to know if :
all characters encoded in ANSI (1252) could be converted to UTF-8 without any problem.
all characters encoded in UTF-8 couldn't be converted to ANSI (1252) without any problem (example : Ǣ couldn't be converted to ANSI encoding).
Could you confirm for me that it corrects ?
Thanks !
Yes, all characters representable in Windows-1252 have Unicode equivalents, and can therefore be converted to UTF-8. See this Wikipedia article for a table showing the mapping to Unicode code points.
And since Windows-1252 is an 8-bit character set, and UTF-8 can represent many thousands of distinct characters, there are obviously plenty of characters representable as UTF-8 and not representable as Windows-1252.
Note that the name "ANSI" for the Windows-1252 encoding is strictly incorrect. When it was first proposed, it was intended to be an ANSI standard, but that never happened. Unfortunately, the name stuck. (Microsoft-related documentation also commonly refers to UTF-16 as "Unicode", another misnomer; UTF-16 is one representation of Unicode, but there are others.)

How do pack and unpack guesses the character encoding when converting to and from utf8?

Suppose I want to convert "\xBD" to UTF-8.
If I use pack & unpack, I'll get ½:
puts "\xBD".unpack('C*').pack('U*') #=> ½
as "\xBD" is ½ in ISO-8859-1.
BUT "\xBD" is œ in ISO-8859-9.
My question is: why pack used ISO-8859-1 instead of ISO-8859-9 to convert the char to UTF-8? Is there some way to configure that character encoding?
I know I can use Iconv in Ruby 1.8.7, and String#encode in 1.9.2, but I'm curious about pack because I use it in some code.
This actually has nothing to do with how \xBD is represented in ISO-8859-x. The critical part is the pack into UTF-8.
The pack receives [189]. The code point 189 is defined in UTF-8 (more precisely, Unicode) as being ½. Don't think of this as the Unicode spec writers for "preferring" ISO-8859-1 over ISO-8859-9. They had to make a choice of what code point represented ½ and they just chose 189.
Since you're trying to learn more about pack/unpack, let me explain more:
When you unpack with the C directive, ruby interprets the string as ascii-8bit, and extracts the ascii codes. In this case \xBD translates to 0xBD a.k.a. 189. This is a really basic conversion.
When you pack with the U directive, ruby will look up in its UTF-8 translation table to see what codepoints map to each of the integers in the array.
pack/unpack have very specific behavior depending on the directives you provide it. I suggest reading up on ruby-doc.org. Some of the directives still don't make sense to me, so don't be discouraged.

Converting ANSI to UTF8 with Ruby

I have a Ruby script that generates an ANSI file.
I want to convert the file to UTF8.
What's the easiest way to do it?
If your data is between ascii range 0 to 0x7F, its valid UTF8, so you don't need to do anything.
Or, if there is characters above 0x7F, you could use Iconv
text=Iconv.iconv('UTF-8', 'ascii',text)
The 8-bit Unicode Transformation Format (UTF-8) was designed to be backwards compatible with the American Standard Code for Information Interchange (ASCII). Therefore, by definition, any valid ASCII sequence is also a valid UTF-8 sequence. For more information, read the UTF FAQ and Unicode FAQ.
Any ASCII file is a valid UTF8 file, going by your Q's title, so no conversion is needed. I don't know what a UIF8 file is, going by your Q's text, so different from its title.

Resources