I was looking at the encoding of Chinese characters on Wikipedia and I'm having trouble figuring out what they are using. For instance "的" is encoded as "%E7%9A%84" (see here). That's three bytes, however none of the encodings described on this page uses three bytes to represent Chinese characters. UTF-8 for instance uses 2 bytes.
I'm basically trying to match these three bytes to an actual character. Any suggestion on what encoding it could be?
>>> c='\xe7\x9a\x84'.decode('utf8')
>>> c
u'\u7684'
>>> print c
的
though Unicode encodes it in 16 bits, utf8 breaks it down to 3 bytes.
The header of a wikipedia page includes this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
So the page is UTF-8.
The example you give is an IRI.
IRIs use the UTF8 encoding. UTF8 implements unicode, and in unicode, each character has a codepoint, that is between 0x4E00 and 0x9FFF (2 bytes) for all chinese characters.
But UTF8 doesn't encode characters by just storing their codepoint (UTF32 does that). Instead, it uses a more complex standard, that makes all chinese ideograms 2 or 3 bytes long.
Related
I'm writing a Go package for communicating with a 3rd-party vendor's API. Their documentation states roughly this:
Our API uses the ISO-8859-1 encoding. If you fail to use ISO-8859-1 for encoding special characters, this will result in unexpected errors or malformed strings.
I've been doing research on the subject of charsets and encodings, trying to figure out how to "encode special characters" in ISO-8859-1, but based on what I've found this seems to be a red herring.
From StackOverflow, emphasis mine:
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
ISO-8859-1 is a binary encoding format where each possible value of a single byte maps to a specific character. It's certainly within my power to have my HTTP POST body encoded in this way, but not any characters beyond the 256 defined in the spec.
I gather that, to encode a special character (such as the Euro symbol) in ISO-8859-1, it would first need to be escaped in some way.
Is there some kind of standard ISO-8859-1 escaping? Would it suffice to URL-encode any special characters and then encode my POST body in ISO-8859-1?
I would like to know if :
all characters encoded in ANSI (1252) could be converted to UTF-8 without any problem.
all characters encoded in UTF-8 couldn't be converted to ANSI (1252) without any problem (example : Ǣ couldn't be converted to ANSI encoding).
Could you confirm for me that it corrects ?
Thanks !
Yes, all characters representable in Windows-1252 have Unicode equivalents, and can therefore be converted to UTF-8. See this Wikipedia article for a table showing the mapping to Unicode code points.
And since Windows-1252 is an 8-bit character set, and UTF-8 can represent many thousands of distinct characters, there are obviously plenty of characters representable as UTF-8 and not representable as Windows-1252.
Note that the name "ANSI" for the Windows-1252 encoding is strictly incorrect. When it was first proposed, it was intended to be an ANSI standard, but that never happened. Unfortunately, the name stuck. (Microsoft-related documentation also commonly refers to UTF-16 as "Unicode", another misnomer; UTF-16 is one representation of Unicode, but there are others.)
I know I need to escape these in all cases:
quot "
amp &
apos '
lt <
gt >
But what about international characters that have accents, or Russian characters to name a couple. Do I need to escape characters of this type when my encoding instruction is set to UTF-8?
What If I were to set the encoding instruction to ASCII? Would I need to escape all those characters also?
This is a sample of the XML (from a legacy system) I am trying to reproduce using Nokogiri(lib2xml):
<?xml version="1.0" encoding="UTF-8"?>
<DESCRIPTION lang="rus">
<SHORT_DESCRIPTION>МОДУЛЬ- ELECTRONIC OUTPUT 120 V DC 5 mA</SHORT_DESCRIPTION>
<LONG_DESCRIPTION>МОДУЛЬ- ТИП ELECTRONIC OUTPUT ВХОД 120 V DC ВЫХОД 5 mA ИСТОЧНИК ПИТАНИЯ 120 V DC ДОПОЛНИТЕЛЬНАЯ ДЕТАЛЬ 1 ANALOG SM322-8S TOR</LONG_DESCRIPTION>
</DESCRIPTION>
You can see that the instruction in the sample says UTF-8 but they have escaped a lot of characters, characters that Nokogiri only escapes when I specify an ASCII encoding instruction. This is what is confusing me.
EDIT 2 : If I do not pass an encoding instruction to Nokogiri, the resulting XML leaves all the Russian characters in their native Cyrillic alphabet, BUT that would not be consistent with the XML I need to replicate.
You only need to represent a character with a character reference if either:
It would have special meaning in the current context (so the five characters you listed only need encoding sometimes)
It does not exist in the character encoding the file is encoded in
ASCII doesn't have many characters in it, so if you encoded your XML in ASCII you would have to use character references for many characters.
Don't encode your XML in ASCII. The default encoding for XML is UTF-8, which is very well supported.
I got a file full of those codes, and I want to "translate" it into normal chars (a whole file, I mean). How can I do it?
Thank you very much in advance.
Looks like you originally had a UTF-8 file which has been interpreted as an 8 bit encoding (e.g. ISO-8859-15) and entity-encoded. I say this because the sequence C3A9 looks like a pretty plausible UTF-8 encoding sequence.
You will need to first entity-decode it, then you'll have a UTF-8 encoding again. You could then use something like iconv to convert to an encoding of your choosing.
To work through your example:
à © would be decoded as the byte sequence 0xC3A9
0xC3A9 = 11000011 10101001 in binary
the leading 110 in the first octet tells us this could be interpreted as a UTF-8 two byte sequence. As the second octet starts with 10, we're looking at something we can interpret as UTF-8. To do that, we take the last 5 bits of the first octet, and the last 6 bits of the second octet...
So, interpreted as UTF8 it's 00011101001 = E9 = é (LATIN SMALL LETTER E WITH ACUTE)
You mention wanting to handle this with PHP, something like this might do it for you:
//to load from a file, use
//$file=file_get_contents("/path/to/filename.txt");
//example below uses a literal string to demonstrate technique...
$file="&Précédent is a French word";
$utf8=html_entity_decode($file);
$iso8859=utf8_decode($utf8);
//$utf8 contains "Précédent is a French word" in UTF-8
//$iso8859 contains "Précédent is a French word" in ISO-8859
UTF-8 single byte characters map perfectly to Latin-1 (ISO 8859-1) characters (those below the character code of 128); basicly the default ASCII characters.
If I have a UTF-8 encoded string and pass it to a function, that expects a Latin-1 string is there any possibility that the Latin-1 function misinterprets parts of UTF-8 mutlibyte characters as ASCII characters?
I imagine something like this could happen:
(imagniray) UTF-8 multibyte character: 0xA330
(mis-)interpreted by Latin-1 function as two Latin-1 characters: 0xA3 0x30
The first of those characters does not lie within the ASCII set, but the second is the ASCII code for the 0 character. Is it possible that an multibyte UTF-8 character produces an artifact that looks like a single-byte UTF-8 / ASCII character like in the example above?
From my understanding of UTF-8 only single-byte characters contain any bytes with the most significant bit unset, so basicly multibyte characters never contain a byte that could be misinterpreted by a Latin-1 function as a valid ASCII character (because all those characters have the most significant bit unset). But I want to make sure this is true and I don't screw up on this, because this may have security implications when dealing with data sanitization - which I am apparently currently doing.
You are correct in your understanding that only single byte characters contain any bytes with the most significant bit unset. There is a nice table showing this at: http://en.wikipedia.org/wiki/UTF-8#Description