Enhancing an ASCII protcol with multilingual fields - internationalization

I am enhancing a piece of software that implements a simple ASCII based protocol.
The protocol is simple... here is an example of what the messages look a little bit like (not the same though, I can't show you the real protocol):
AUTH 1 1 200<CR><LF>
To which we get a response looking similar to
230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME "Photo Black"<CR><LF>
The name "Photo Black" comes from a database sqlite database. I need to enhance it to support foreign languages. So I've been thinking that the field "Photo Black" needs to be "optionally" encoded as a UTF-8 string between the quotes. I'm wondering if there is a standard for this so that the client application can interpret the string in the quotes and straight away recognize it as either UTF-8 or plain ASCII. I'm not willing to rewrite the protocol, that would be too much work. Just slip in some kind of encoding for clients to recognize some Spanish or Swedish names.
I don't want the field to be always interpreted as UTF-8 either, long story there. You know how in C++ I can type 0xFF and the compiler knows that this is a hex string... is there an equivalent for UTF-8? Sorry I may be jumping the gun but I'm not that familiar with UTF-8 encoding and internationalization in general.

Do you have control over both the server and the client? If not, you can't change the protocol so you won't be able to do it. When you say you're "not wiling to rewrite the protocol" - you're going to have to do so at least to some extent. Whatever you do, you will be changing the protocol.
I'm not sure why you wouldn't want to always interpret the data as UTF-8 either - if it's currently only ASCII, then it would be completely backward compatible to always interpret it as UTF-8, as all ASCII is encoded the same way in UTF-8. Perhaps if you could give more information, we could provide more help.
You could introduce a prefix for UTF-8-encoded strings, e.g. U:
230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME U"Photo UTF-8 stuff here Black"<CR><LF>
would that help?
Do you actually have an 8-bit data path? If something is going to mangle the top bit of every byte, then you'll need to consider options like Punycode instead of UTF-8.

Read up on the concept of Ascii Compatible Encoding, or ACE. iDNS is an example. So is/was UTF-7.
Here's the master speaking.
You really can't code-switch in and out of UTF-8. For a nightmare, look up ISO-2022, which attempted to support that sort of thing. Also keep in mind that UTF-8 includes ASCII, but not Latin-1.

Why don't you want the field to be "always interpreted as UTF-8"? You don't say.
If you do have the client interpret the protocol as UTF-8 encoded text, all of the existing output will still work correctly, since UTF-8 is a proper superset of ASCII.

Related

Character set conversion problem - debug invalid characters - reverse engineer earlier conversions

Character conversion problem.
I have a few strings which are incorrectly encoded or decoded.
The strings came in an ASCII format CSV file.
The current strings I have are:
N‚met
Tet‹
I know, that the:
"‚" character (0x82) should be originally "é" (é acute accent)
"‹" character (0x8B) should be originally "ő" (o double acute accent)
How can I debug and reverse engineer, what conversions happened with the original characters to get the current characters?
I suppose that multiple decoding encoding happened, but I was not able to reproduce the original character.
I put an expanded version of my comment as answer:
Your viewer uses CP1252 (English and Western Europe, also called ANSI in Windows) or CP1250 (Eastern Europe) or an other similar code page. Most of characters are coded in the same manner, just few language specific changes. Your example do not includes character that are different on the two encoding, so I cannot say precisely.
That code pages are used on Microsoft Windows, and they are based (but not 100% compatible) with Latin-1, so it is common to see text interpreted with such encoding. MacOs and Linux are heavily (now) UTF-8 encoded. Windows uses Unicode internally (but UTF-16)
The old encoding is probably CP437: the standard code page in DOS, so it was used frequently also for CSV files. Other frequent old encoding are CP850 (Western Europe) and CP852 (Central Europe).
For the other answers you put in the comments, I think you should go to Superuser (if you are requesting tools (some editors allow you to specify the encoding. You may use the browser (opening a local file): browsers also allow you to choose the local encoding, and I think you may copy as Unicode [not sure], other tools sometime has hidden option to import files, but possibly not with all options), or as new question in this site, if you want to do it programmatically. But so you are required to specify the language. Python is well suited for such conversions (most scripting languages are created to handle texts): python has built in many encoding, you should just specify when reading and when writing the files. R also can be instructed on the input encoding.
I wrote my own utility that helped me to diagnose and fix many thorny encoding issues. It is available as part of an Open source library. The utility converts any String to unicode sequence and vise-versa. All you will have to do is:
String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Hello world");
And it will return String "\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064"
The same would work for any String in any language including special characters. Here is the link to the article Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison that explains about the library and where to get it (available both on Maven central and github. In the article search for paragraph: "String Unicode converter". So when you read your String convert it and see what comes up. This way you will see what symbols are there and if the info is correct and only distorted by some wrong encoding or the info itself is lost. You can easily find info on internet that provides tables of mapping of any symbol to a unicode

Windows encoding clarification

I would like to translate a game, this game loads the strings from a text file.
The destination language uses non-ascii characters, so I naïvely saved my file in utf8, but it does not work as letters with diacritics are not shown correctly.
Studying better in the configuration file where the string text filename is stored, I found a CHARSET option that can assume any of those values:
ANSI_CHARSET DEFAULT_CHARSET SYMBOL_CHARSET MAC_CHARSET SHIFTJIS_CHARSET HANGEUL_CHARSET JOHAB_CHARSET GB2312_CHARSET CHINESEBIG5_CHARSET GREEK_CHARSET TURKISH_CHARSET VIETNAMESE_CHARSET HEBREW_CHARSET ARABIC_CHARSET BALTIC_CHARSET RUSSIAN_CHARSET THAI_CHARSET EASTEUROPE_CHARSET OEM_CHARSET
That as far as I understood are fairly standard values in WinAPIs and charset and character encoding are synonymous.
So my question is, is there a correspondence between this names and standard names like utf8 or iso-8859-2? If it is the case what is it?
Try using EASTEUROPE_CHARSET
ISO 8859-2 is mostly equivalent to Windows-1250. According to this MSDN article, the 1250 code page is accessed using EASTEUROPE_CHARSET.
Note that you will need to save your text file in the 1250 code page as ISO 8859-2 is not exactly equivalent. From Wikipedia:
Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged (unlike Windows-1252, which keeps all printable characters from ISO-8859-1 in the same place). Most of the rearrangements seem to have been done to keep characters shared with Windows-1252 in the same place as in Windows-1252 but three of the characters moved (Ą,Ľ,ź) cannot be explained this way.
The names are symbolic identifiers for Windows code pages, which are character encodings (= charsets) defined or adopted by Microsoft. Many of them are registered at IANA with the prefix windows-. For example, EASTEUROPE_CHARSET stands for code page 1250, which has been registered as windows-1250 and is often called Windows Latin 2.
UTF-8 is something different. You need special routines to read and write UTF-8 encoded data. UTF-8 or UTF-16 is generally the only sensible choice for character encoding when you want to be truly global (support different languages and writing systems). For a single specific language, some of the code pages might be more practical in some cases.
You can get the the standard encoding names (as registered by IANA) using the table under the remarks section of this MSDN page.
Just find the Character set row and read the Code page number, the standard name is windows-[code page number].

tcl utf-8 characters not displaying properly in ui

Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)

How to verify browser support UTF-8 characters properly?

Is there a way to identify whether the browser encoding is set to/supports "UTF-8" from Javascript?
I want to send "UTF-8" or "English" letters based on browser setting transparently (i.e. without asking the User)
Edit: Sorry I was not very clear on the question. In a Browser the encoding is normally specified as Auto-Detect (or) Western (Windows/ISO-9959-1) (or) Unicode (UTF-8). If the user has set the default to Western then the characters I send are not readable. In this situation I want to inform the user to either set the encoding to "Auto Detect" (or) "UTF-8".
First off, UTF-8 is an encoding of the Unicode character set. English is a language. I assume you mean 'ASCII' (a character set and its encoding) instead of English.
Second, ASCII and UTF-8 overlap; any ASCII character is sent as exactly the same bits when sent as UTF-8. I'm pretty sure all modern browsers support UTF-8, and those that don't will probably just treat it as latin1 or cp1252 (both of which overlap ASCII) so it'll still work.
In other words, I wouldn't worry about it.
Just make sure to properly mark your documents as UTF-8, either in the HTTP headers or the meta tags.
I assume the length of the output (that you read back after outputting it) can tell you what happened (or, without JavaScript, use the Accept-Charset HTTP header, and assume the UTF-8 encoding is supported when Unicode is accepted).
But you'd better worry about sending the correct UTF-8 headers et cetera, and fallback scenarios for accessibility, rather than worrying about the current browsers' UTF-8 capabilities.

How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).
When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.
I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.
When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.
I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.
Answer:
ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.
The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.
I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.
Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!
Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.
Here is how notepad does that
There is also the python Universal Encoding Detector which you can check.
My guess is:
First, check if the file has byte values less than 32 (except for tab/newlines). If it does, it can't be ANSI or UTF-8. Thus - UTF-16. Just have to figure out the endianness. For this you should probably use some table of valid Unicode character codes. If you encounter invalid codes, try the other endianness if that fits. If either fit (or don't), check which one has larger percentage of alphanumeric codes. Also you might try searchung for line breaks and determine endianness from them. Other than that, I have no ideas how to check for endianness.
If the file contains no values less than 32 (apart from said whitespace), it's probably ANSI or UTF-8. Try parsing it as UTF-8 and see if you get any invalid Unicode characters. If you do, it's probably ANSI.
If you expect documents in non-English single-byte or multi-byte non-Unicode encodings, then you're out of luck. Best thing you can do is something like Internet Explorer which makes a histogram of character values and compares it to histograms of known languages. It works pretty often, but sometimes fails too. And you'll have to have a large library of letter histograms for every language.
ASCII? No modern OS uses ASCII any more. They all use 8 bit codes, at least, meaning it's either UTF-8, ISOLatinX, WinLatinX, MacRoman, Shift-JIS or whatever else is out there.
The only test I know of is to check for invalid UTF-8 chars. If you find any, then you know it can't be UTF-8. Same is probably possible for UTF-16. But when it's no Unicode set, then it'll be hard to tell which Windows code page it might be.
Most editors I know deal with this by letting the user choose a default from the list of all possible encodings.
There is code out there for checking validity of UTF chars.

Resources