In my text file, I used a character with value larger than 127 for example 0xDC. Then I loaded that text file in a device. Then I read that text file and that character. Then the character was changed to 0xC3 and 0x9C. How come it change to two character?
Thanks
Because that's the sequence for the character when encoded in UTF-8:
>>> '\xc3\x9c'.decode('utf-8')
u'\xdc'
From wikipedia:
"UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters."
Related
I know that UTF-8 supports way more characters than Latin-1 (even with the extensions). But are there examples of files that are valid in both, but the characters are different? So essentially that the content changes, depending on how you think the file is encoded?
I also know that a big chunk of Latin-1 maps 1:1 to the same part in UTF-8. The question is: which code points could change the value if interpreted differently (not invalid, but different)?
Latin-1 is a single-byte encoding (meaning 1 character = 1 byte), which uses all possible byte values. So any byte maps to something in Latin-1. So literally any file is "valid" in Latin-1. So you can interpret any file as Latin-1 and you'll get… something… as a result.
So yes, interpret any valid UTF-8 file in Latin-1. It's valid both in UTF-8 and Latin-1. The first 128 characters are the same for both encodings and both based on ASCII; but if your UTF-8 file uses any non-ASCII characters, those will be interpreted as gibberish (yet valid) Latin-1.
bytes
encoding
text
e6bc a2e5 ad97
UTF-8
漢字
e6bc a2e5 ad97
Latin-1
æ¼¢å 👈 valid but nonsensical
Unicode is - somewhat simplified - a character set, and UTF-8 is one of multiple encodings for the binary representation of Unicode.
ISO-8859-1 is both, a character set and encoding.
At the character set level, ISO-8859-1 is a subset of Unicode, i.e. each ISO-8859-1 character also exists in Unicode, and the ISO-8859-1 code is even equal to the Unicode codepoint.
At the encoding level, ISO-8859-1 and UTF-8 use the same binary representation for the ISO-8859-1 characters up to 127. But for the characters between 128 and 255 they differ as UTF-8 needs 2 bytes to represent them.
Example:
Word
ISO-8859-1
UTF-8
Zürich
5a fc 72 69 63 68
5a c3 bc 72 69 63 68
I know I need to escape these in all cases:
quot "
amp &
apos '
lt <
gt >
But what about international characters that have accents, or Russian characters to name a couple. Do I need to escape characters of this type when my encoding instruction is set to UTF-8?
What If I were to set the encoding instruction to ASCII? Would I need to escape all those characters also?
This is a sample of the XML (from a legacy system) I am trying to reproduce using Nokogiri(lib2xml):
<?xml version="1.0" encoding="UTF-8"?>
<DESCRIPTION lang="rus">
<SHORT_DESCRIPTION>МОДУЛЬ- ELECTRONIC OUTPUT 120 V DC 5 mA</SHORT_DESCRIPTION>
<LONG_DESCRIPTION>МОДУЛЬ- ТИП ELECTRONIC OUTPUT ВХОД 120 V DC ВЫХОД 5 mA ИСТОЧНИК ПИТАНИЯ 120 V DC ДОПОЛНИТЕЛЬНАЯ ДЕТАЛЬ 1 ANALOG SM322-8S TOR</LONG_DESCRIPTION>
</DESCRIPTION>
You can see that the instruction in the sample says UTF-8 but they have escaped a lot of characters, characters that Nokogiri only escapes when I specify an ASCII encoding instruction. This is what is confusing me.
EDIT 2 : If I do not pass an encoding instruction to Nokogiri, the resulting XML leaves all the Russian characters in their native Cyrillic alphabet, BUT that would not be consistent with the XML I need to replicate.
You only need to represent a character with a character reference if either:
It would have special meaning in the current context (so the five characters you listed only need encoding sometimes)
It does not exist in the character encoding the file is encoded in
ASCII doesn't have many characters in it, so if you encoded your XML in ASCII you would have to use character references for many characters.
Don't encode your XML in ASCII. The default encoding for XML is UTF-8, which is very well supported.
UTF-8 single byte characters map perfectly to Latin-1 (ISO 8859-1) characters (those below the character code of 128); basicly the default ASCII characters.
If I have a UTF-8 encoded string and pass it to a function, that expects a Latin-1 string is there any possibility that the Latin-1 function misinterprets parts of UTF-8 mutlibyte characters as ASCII characters?
I imagine something like this could happen:
(imagniray) UTF-8 multibyte character: 0xA330
(mis-)interpreted by Latin-1 function as two Latin-1 characters: 0xA3 0x30
The first of those characters does not lie within the ASCII set, but the second is the ASCII code for the 0 character. Is it possible that an multibyte UTF-8 character produces an artifact that looks like a single-byte UTF-8 / ASCII character like in the example above?
From my understanding of UTF-8 only single-byte characters contain any bytes with the most significant bit unset, so basicly multibyte characters never contain a byte that could be misinterpreted by a Latin-1 function as a valid ASCII character (because all those characters have the most significant bit unset). But I want to make sure this is true and I don't screw up on this, because this may have security implications when dealing with data sanitization - which I am apparently currently doing.
You are correct in your understanding that only single byte characters contain any bytes with the most significant bit unset. There is a nice table showing this at: http://en.wikipedia.org/wiki/UTF-8#Description
Does anyone know of a Windows app that can scan through a directory and check which scripts are/aren't encoded as a specified charset (UTF-8 in this case)? I could do it manually, but that could take a while and is quite error prone!
UTF-8 isn't a character set, it's an encoding for Unicode characters. And, since this is not programming related, I'm nudging it over to superuser.
If you do want to write a program for detecting those sequences, it's pretty easy:
Illegal UTF-8 initial sequences
UTF-8 Sequence Reason for Illegality
10xxxxxx illegal as initial byte of character (80..BF)
1100000x illegal, overlong (C0 80..BF)
11100000 100xxxxx illegal, overlong (E0 80..9F)
11110000 1000xxxx illegal, overlong (F0 80..8F)
11111000 10000xxx illegal, overlong (F8 80..87)
11111100 100000xx illegal, overlong (FC 80..83)
1111111x illegal; prohibited by spec
Then, provided the first octet is legal, just remember that the number of octets forming a code point can be obtained by counting the number of 1 bits before the first 0 bit.
For example, 11110xxx is the start of a 4-octet sequence so you should skip ahead 4 octets once you've established its legality.
The other thing to do is ensure that all continuation octets start with 10.
Not sure if this is what you're looking for, but I use a command shell for-loop and dump the first few bytes of each file using my hdump utility, which displays the bytes of the file in hexadecimal form. I then look for the leading 3-byte UTF-8 signature (Byte Order Mark) at the start of each file.
My hdump utility is available at: http://david.tribble.com/programs.html
I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.