I need to write french along with english in my eula for the software dmg. How do I display an e with accent grave (é) in the eula text file so that the disk images utility interprets it correctly? I have included
data \'STR#\' (5004, "French"), etc. as well as
data \'styl\' (5004, "French SLA") { etc. in my file. This does not seem to help though.
Currently, I just have it as is with an accent mark (é), but I have tried to use other encodings. Any help would be greatly appreciated!
I am using OSX Mavericks.
Your example was like this. I had to extract it from a comment because it was missing from the question, and I filled in context which is apparently missing.
data 'TEXT' (5002, "English") {
"IMPORTANT-READ CAREFULLY: THE TERMS OF THIS END USER\n"
<br/>Les parties ont demandé que cette convention\n"
"ainsi que tous les documents qui s'y rattachent soient\n"
"rédigés en anglais.\n".
...
};
Whereas text may work here as well (I wasn't aware of it until now), what we have in our own is in encoded binary:
data 'TEXT' (5000, "English SLA") {
$"596f 7572 2075 7361 6765 206f 6620 7468"
$"6973 2073 6f66 7477 6172 6520 6973 2067"
...
My guess is that your accented character went into your file as UTF-8. I would figure out what encoding English uses on OSX (I think it's MacRoman), encode your text in that encoding and then encode that as hex in this fashion, and that should work.
Related
There may not be one rule to this, but I am trying to find a way to deal with encoding between a PSQL import and retrieving and displaying records in bash or other programs. I am really new to the idea of encoding, so please bear with me! I'm running into issues with the character 'é'. I've gotten the error ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x72 0x61 on import (I believe the default on it is UTF-8) and was able to temporarily fix it by changing the encoding to Windows-1251 on the import. However, trying to retrieve the data in bash gave me the error ERROR: character with byte sequence 0xd0 0xb9 in encoding "UTF8" has no equivalent in encoding "WIN1252", so I assumed bash was using 1252 encoding.
I deleted it all and re-imported with WIN1252 encoding, and it worked for both import and retrieval. My concern is whether I may run into issues down the line with displaying this character or trying to retrieve it on a browser. Currently, if I select the movie by id in bash, I get Les MisΘrables. Honestly, not ideal, but it's okay with me if there won't be errors. It did scare me, though, when the query couldn't be completed because of an encoding mismatch.
From my little understanding of UTF-8, I feel like the character 'é' should have been accepted in the first place. I searched online for a character set and that was on it. Can anyone tell me a little more about any of this? My inclination is to go with UTF-8 as it seems the most ubiquitous, but I don't know why it was giving me trouble. Thanks!
EDIT: My complete lack of knowledge surrounding encoding led me to never save the file specifically encoded as UTF-8. That solved it. Thanks to all who looked into this!
Described three different individual problems in the question. Code examples given in Python (commonly comprehensible, I hope).
1st ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x72 0x61 means exactly what said: it's not UTF-8:
>>> print(bytearray([0xe9,0x72,0x61]).decode('latin1'))
éra
2nd ERROR: character with byte sequence 0xd0 0xb9 in encoding "UTF8" has no equivalent in encoding "WIN1252". Source is UTF-8, result is Cyrillic ('WIN1251'):
>>> print(bytearray([0xd0,0xb9]).decode('UTF-8'))
й
3rd I get Les MisΘrables (probably instead of Les Misérables?) It's a flagrant mojibake case:
>>> print('Les Misérables'.encode('latin1').decode('cp437'))
Les MisΘrables
In fine, here's a recapitulation of discussed byte values and corresponding characters. Note that the column 'CodePoint' contains Unicode (U+hhhh) and UTF-8 bytes:
Char CodePoint Category Description
---- --------- -------- -----------
é {U+00E9, 0xC3,0xA9} LowercaseLetter Latin Small Letter E With Acute
r {U+0072, 0x72} LowercaseLetter Latin Small Letter R
a {U+0061, 0x61} LowercaseLetter Latin Small Letter A
Ð {U+00D0, 0xC3,0x90} UppercaseLetter Latin Capital Letter Eth
¹ {U+00B9, 0xC2,0xB9} OtherNumber Superscript One
й {U+0439, 0xD0,0xB9} LowercaseLetter Cyrillic Small Letter Short I
Θ {U+0398, 0xCE,0x98} UppercaseLetter Greek Capital Letter Theta
What a quick web search will confirm that US ASCII is a subset of UTF-8, but what I've not yet found is how to convert &foo; and { to their corresponding native UTF-8 characters.
I know that at least 7-bit US ASCII is unchanged in UTF-8, but I haven't seen yet a program to filter through and convert &foo; to how it would naturally be expressed in UTF-8.
You can use html_entity_decode(s, "UTF-8") in PHP or html.unescape(s) in Python.
https://www.php.net/manual/en/function.html-entity-decode.php
https://docs.python.org/3/library/html.html#html.unescape
I close it and reopen it I get some letters and strange characters, especially where words have accent.
look at an example:
Este exto es la creación de mà probia autorÃa
google translation
What you see is a byte representation of your string, which is UTF-8. UTF-8 is multibyte encoding, that means that some characters (eg. those with accents) are saved as several bytes, usually starting with Ã.
Your application probably doesn't understand that the string is UTF-8 and prints it as byte sequence. You should use a different text editor which will be able to display your UTF-8 text correctly.
I have to print Ruby strings UTF-8 encoded, containing Italian language sentences, to a ESC/POS thermal printer (a printer that accept only ASCII-8BIT (1 byte) charset: http://maxdb.sap.com/doc/7_6/ca/bd35406ee32e34e10000000a155106/content.htm).
BTW, I use Ruby 2.x (on Windows or Linux). I'm confused about how to transcode,
by example let say the string
contained in a JSON UTF-8 enccoded on a remote server,
or contained in a template file as:
#!/bin/env ruby
# encoding: utf-8
string = "Però non è la città di Barnabù"
I have to translate the string (accented / internationalized 2 bytes) in 1 byte ('ASCII8-BIT' encodded)
Any suggestion to how to do transaltion FROM UTF-8 TO ASCII8-BIT?
I lost myself with methods like: .force_encoding('ASCII-8BIT'), or encode(") ...
EDIT:
many thanks
giorgio
I find the solution as:
string.encode "IBM437"
as said in comment, I reread the Epson TM-T20 printer; It's setted with default "code table" number: 0, meanining: "PC437: USA, Standard Europe"
In fact I understand PC437 refer to the 2Code Page 437', see:
en.wikipedia.org/wiki/Code_page_437
So i found the interesting gem:
github.com/ConradIrwin/encoding-codepage
showing that Code page 437 correspond to charset "IBM437"
Now printer print correctly!
I answer here myself to maybe help others.
giorgio
I read today in a book about Java the author stating about unicode characters (translated):
Codes of characters are part of extensions that differ from one country or working environment to another. In those extensions, the characters are not always defined at the same position in the Unicode table.
The character "é" is defined at position 234 in the Unix Unicode table, but at position 200 in Mac OS Unicode table. The special characters, consequently accented characters, don't always have the same Unicode code from one environment to another.
For instance, characters é, è and ê have respectively the following Unicode codes:
Unix: \u00e9 \u00e8 \u00ea
Dos: \u0082 \u008a \u0088
Windows: \u00e9 \u00e8 \u00ea
MAC OS: \u00c8 \u00cb \u00cd
But from my understanding of Unicode, a same character has always the same code point in the Unicode table and there's no such thing as different Unicode tables for different OS. For instance, the character é is always \u00e9 be it on Windows, Mac OS or Unix.
So either I still don't grasp the concept of Unicode, or the author is wrong. But still he couldn't have made it up, so perhaps was this true at the infancy of Unicode?
The author is wrong. You're right, a given character has the same Unicode code point for any correct implementation of Unicode. I seriously doubt that there were multiple representations even at the infancy of Unicode; that would have defeated the whole purpose.
She may be describing non-Unicode character sets such as the various ISO-8859 standards and the Windows code pages such as 1252. Unicode code points in the range 0x80 to 0x9F (decimal 128 to 159) are control characters; some 8-bit character sets have used those codes for accented letters and other symbols.
The character 'é' has the Unicode code point 233 (0xe9). That is invariant. (Are you sure the book said it's 234 in "the Unix Unicode table?)
There are alternate ways of representing certain characters; for example, 'é' can also be represented as a combination of e (0x65) with a combining acute accent (0x301), but that's not what the author is talking about.
Copying information from comments, the book is in French, and is titled "Le Livre de Java premiere langage", by Anne Tasso; the cited version is the 3rd edition, published in 2005. It's available in PDF format here. (The web site name matches the name of the publisher and copyright holder on the first page, so it appears to be a legitimate copy.)
In the original French:
Le caractère é est défini en position 234 dans la table Unicode d’Unix,
alors qu’il est en position 200 dans la table Unicode du système Mac
OS. Les caractères spéciaux et, par conséquent, les caractères
accentués ne sont pas traités de la même façon d’un environnement à
l’autre : un même code Unicode ne correspond pas au même caractère
which, as far as I can tell from my somewhat limited ability to read French, is simply nonsense.
In the quoted table, the representations shown for Unix and Windows are identical, and are consistent with actual Unicode (which makes me think the "234" in the text above that is a typo in the book).
There is an 8-bit extended ASCII representation called Mac OS Roman, but it's inconsistent with what's shown in the table (for example 'é' is 0x8E, not 0xC8), and it's clearly not Unicode.
Windows-1252 is a common 8-bit encoding for Windows, and perhaps also for MS-DOS, but it's also inconsistent with anything shown in that table; 'é' is 0xE9, just as it is in Unicode.
I have no idea where the DOS and MacOS entries came from, or where the author got the idea that Unicode code points vary across operating systems.
I wonder if it's possible that some old implementations of Java have implemented Unicode incorrectly (though character display would be handled by the OS, not by Java). Even if that were the case, I'd expect that any modern Java implementation would get this right. Java might have problems with characters outside the Basic Multilingual Plane, but that's not relevant for characters like 'é'.