Can Unicode code points vary between platforms (Windows, Unix, Mac os)? - macos

I read today in a book about Java the author stating about unicode characters (translated):
Codes of characters are part of extensions that differ from one country or working environment to another. In those extensions, the characters are not always defined at the same position in the Unicode table.
The character "é" is defined at position 234 in the Unix Unicode table, but at position 200 in Mac OS Unicode table. The special characters, consequently accented characters, don't always have the same Unicode code from one environment to another.
For instance, characters é, è and ê have respectively the following Unicode codes:
Unix: \u00e9 \u00e8 \u00ea
Dos: \u0082 \u008a \u0088
Windows: \u00e9 \u00e8 \u00ea
MAC OS: \u00c8 \u00cb \u00cd
But from my understanding of Unicode, a same character has always the same code point in the Unicode table and there's no such thing as different Unicode tables for different OS. For instance, the character é is always \u00e9 be it on Windows, Mac OS or Unix.
So either I still don't grasp the concept of Unicode, or the author is wrong. But still he couldn't have made it up, so perhaps was this true at the infancy of Unicode?

The author is wrong. You're right, a given character has the same Unicode code point for any correct implementation of Unicode. I seriously doubt that there were multiple representations even at the infancy of Unicode; that would have defeated the whole purpose.
She may be describing non-Unicode character sets such as the various ISO-8859 standards and the Windows code pages such as 1252. Unicode code points in the range 0x80 to 0x9F (decimal 128 to 159) are control characters; some 8-bit character sets have used those codes for accented letters and other symbols.
The character 'é' has the Unicode code point 233 (0xe9). That is invariant. (Are you sure the book said it's 234 in "the Unix Unicode table?)
There are alternate ways of representing certain characters; for example, 'é' can also be represented as a combination of e (0x65) with a combining acute accent (0x301), but that's not what the author is talking about.
Copying information from comments, the book is in French, and is titled "Le Livre de Java premiere langage", by Anne Tasso; the cited version is the 3rd edition, published in 2005. It's available in PDF format here. (The web site name matches the name of the publisher and copyright holder on the first page, so it appears to be a legitimate copy.)
In the original French:
Le caractère é est défini en position 234 dans la table Unicode d’Unix,
alors qu’il est en position 200 dans la table Unicode du système Mac
OS. Les caractères spéciaux et, par conséquent, les caractères
accentués ne sont pas traités de la même façon d’un environnement à
l’autre : un même code Unicode ne correspond pas au même caractère
which, as far as I can tell from my somewhat limited ability to read French, is simply nonsense.
In the quoted table, the representations shown for Unix and Windows are identical, and are consistent with actual Unicode (which makes me think the "234" in the text above that is a typo in the book).
There is an 8-bit extended ASCII representation called Mac OS Roman, but it's inconsistent with what's shown in the table (for example 'é' is 0x8E, not 0xC8), and it's clearly not Unicode.
Windows-1252 is a common 8-bit encoding for Windows, and perhaps also for MS-DOS, but it's also inconsistent with anything shown in that table; 'é' is 0xE9, just as it is in Unicode.
I have no idea where the DOS and MacOS entries came from, or where the author got the idea that Unicode code points vary across operating systems.
I wonder if it's possible that some old implementations of Java have implemented Unicode incorrectly (though character display would be handled by the OS, not by Java). Even if that were the case, I'd expect that any modern Java implementation would get this right. Java might have problems with characters outside the Basic Multilingual Plane, but that's not relevant for characters like 'é'.

Related

How do I handle encoding for importing foreign movie titles with diacritics into a database?

There may not be one rule to this, but I am trying to find a way to deal with encoding between a PSQL import and retrieving and displaying records in bash or other programs. I am really new to the idea of encoding, so please bear with me! I'm running into issues with the character 'é'. I've gotten the error ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x72 0x61 on import (I believe the default on it is UTF-8) and was able to temporarily fix it by changing the encoding to Windows-1251 on the import. However, trying to retrieve the data in bash gave me the error ERROR: character with byte sequence 0xd0 0xb9 in encoding "UTF8" has no equivalent in encoding "WIN1252", so I assumed bash was using 1252 encoding.
I deleted it all and re-imported with WIN1252 encoding, and it worked for both import and retrieval. My concern is whether I may run into issues down the line with displaying this character or trying to retrieve it on a browser. Currently, if I select the movie by id in bash, I get Les MisΘrables. Honestly, not ideal, but it's okay with me if there won't be errors. It did scare me, though, when the query couldn't be completed because of an encoding mismatch.
From my little understanding of UTF-8, I feel like the character 'é' should have been accepted in the first place. I searched online for a character set and that was on it. Can anyone tell me a little more about any of this? My inclination is to go with UTF-8 as it seems the most ubiquitous, but I don't know why it was giving me trouble. Thanks!
EDIT: My complete lack of knowledge surrounding encoding led me to never save the file specifically encoded as UTF-8. That solved it. Thanks to all who looked into this!
Described three different individual problems in the question. Code examples given in Python (commonly comprehensible, I hope).
1st ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x72 0x61 means exactly what said: it's not UTF-8:
>>> print(bytearray([0xe9,0x72,0x61]).decode('latin1'))
éra
2nd ERROR: character with byte sequence 0xd0 0xb9 in encoding "UTF8" has no equivalent in encoding "WIN1252". Source is UTF-8, result is Cyrillic ('WIN1251'):
>>> print(bytearray([0xd0,0xb9]).decode('UTF-8'))
й
3rd I get Les MisΘrables (probably instead of Les Misérables?) It's a flagrant mojibake case:
>>> print('Les Misérables'.encode('latin1').decode('cp437'))
Les MisΘrables
In fine, here's a recapitulation of discussed byte values and corresponding characters. Note that the column 'CodePoint' contains Unicode (U+hhhh) and UTF-8 bytes:
Char CodePoint Category Description
---- --------- -------- -----------
é {U+00E9, 0xC3,0xA9} LowercaseLetter Latin Small Letter E With Acute
r {U+0072, 0x72} LowercaseLetter Latin Small Letter R
a {U+0061, 0x61} LowercaseLetter Latin Small Letter A
Ð {U+00D0, 0xC3,0x90} UppercaseLetter Latin Capital Letter Eth
¹ {U+00B9, 0xC2,0xB9} OtherNumber Superscript One
й {U+0439, 0xD0,0xB9} LowercaseLetter Cyrillic Small Letter Short I
Θ {U+0398, 0xCE,0x98} UppercaseLetter Greek Capital Letter Theta

Octal, Hex, Unicode

I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.
This is what my header looks like:
From: "\261Central Station <sip#...>"
Looking at the ASCII table the character in the picture is "±":
What I don't understand:
If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1".
e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.
From the page you linked to:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.
That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).
Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as #muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.
If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.
As to your questions:
If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?
The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.
How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?
We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).
If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)
That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print � for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.
Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).
Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.

Strange characters in text saved Brackets

I close it and reopen it I get some letters and strange characters, especially where words have accent.
look at an example:
Este exto es la creación de mí probia autoría
google translation
What you see is a byte representation of your string, which is UTF-8. UTF-8 is multibyte encoding, that means that some characters (eg. those with accents) are saved as several bytes, usually starting with Ã.
Your application probably doesn't understand that the string is UTF-8 and prints it as byte sequence. You should use a different text editor which will be able to display your UTF-8 text correctly.

Print French in dmg eula osx

I need to write french along with english in my eula for the software dmg. How do I display an e with accent grave (é) in the eula text file so that the disk images utility interprets it correctly? I have included
data \'STR#\' (5004, "French"), etc. as well as
data \'styl\' (5004, "French SLA") { etc. in my file. This does not seem to help though.
Currently, I just have it as is with an accent mark (é), but I have tried to use other encodings. Any help would be greatly appreciated!
I am using OSX Mavericks.
Your example was like this. I had to extract it from a comment because it was missing from the question, and I filled in context which is apparently missing.
data 'TEXT' (5002, "English") {
"IMPORTANT-READ CAREFULLY: THE TERMS OF THIS END USER\n"
<br/>Les parties ont demandé que cette convention\n"
"ainsi que tous les documents qui s'y rattachent soient\n"
"rédigés en anglais.\n".
...
};
Whereas text may work here as well (I wasn't aware of it until now), what we have in our own is in encoded binary:
data 'TEXT' (5000, "English SLA") {
$"596f 7572 2075 7361 6765 206f 6620 7468"
$"6973 2073 6f66 7477 6172 6520 6973 2067"
...
My guess is that your accented character went into your file as UTF-8. I would figure out what encoding English uses on OSX (I think it's MacRoman), encode your text in that encoding and then encode that as hex in this fashion, and that should work.

what kind of keyboard layout can type ISO 8859-1 Characters?

what kind of keyboard layout can type ISO 8859-1 Characters?
Example of what needs to be typed are:-
Ánam àbìa èbèa Ógbuá
First of all: Keyboard layouts and character sets are not directly tied to each other. If I type Ü on my keyboard while in a UTF-8 application, the resulting character will be
a UTF-8 character. If I type it in a ISO-8859-1 application, it will be a character from that character set.
That said, there isn't a keyboard layout that covers all ISO-8859-1 characters; every country layout covers a part of them.
Full list of characters
According to Wikipedia, ISO-8859-1 covers the following languages' special characters in full:
Afrikaans, Albanian, Basque, Breton, Catalan, English (UK and US), Faroese, Galician, German, Icelandic, Irish, (new
- orthography), Italian, Kurdish (The
Kurdish Unified Alphabet), Latin
(basic classical orthography), Leonese,
Luxembourgish (basic classical
orthography), Norwegian (Bokmål and
Nynorsk), Occitan, Portuguese,
Rhaeto-Romanic, Scottish, Gaelic,
Spanish, Swahili, Swedish, Walloon
so you can safely assume that the keyboard layouts of those countries cover a part of ISO-8859-1.
This is what I have decided to do. Hope it puts somebody else on the right footing.
With Special thanks to #Pekka for the patience, guidance and support.
// Replaces combination char with special chars
$phrase = "`U `are ^here tod`ay.";
$search = array("`U", "`a", "^h");
$replace = array("û", "ñ", "à");
$resulte = str_replace($search, $replace, $phrase);
Could be cleaner in a function though

Resources