Windows encoding clarification

Windows encoding clarification - winapi

I would like to translate a game, this game loads the strings from a text file.
The destination language uses non-ascii characters, so I naïvely saved my file in utf8, but it does not work as letters with diacritics are not shown correctly.
Studying better in the configuration file where the string text filename is stored, I found a CHARSET option that can assume any of those values:
ANSI_CHARSET DEFAULT_CHARSET SYMBOL_CHARSET MAC_CHARSET SHIFTJIS_CHARSET HANGEUL_CHARSET JOHAB_CHARSET GB2312_CHARSET CHINESEBIG5_CHARSET GREEK_CHARSET TURKISH_CHARSET VIETNAMESE_CHARSET HEBREW_CHARSET ARABIC_CHARSET BALTIC_CHARSET RUSSIAN_CHARSET THAI_CHARSET EASTEUROPE_CHARSET OEM_CHARSET
That as far as I understood are fairly standard values in WinAPIs and charset and character encoding are synonymous.
So my question is, is there a correspondence between this names and standard names like utf8 or iso-8859-2? If it is the case what is it?

Try using EASTEUROPE_CHARSET
ISO 8859-2 is mostly equivalent to Windows-1250. According to this MSDN article, the 1250 code page is accessed using EASTEUROPE_CHARSET.
Note that you will need to save your text file in the 1250 code page as ISO 8859-2 is not exactly equivalent. From Wikipedia:
Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged (unlike Windows-1252, which keeps all printable characters from ISO-8859-1 in the same place). Most of the rearrangements seem to have been done to keep characters shared with Windows-1252 in the same place as in Windows-1252 but three of the characters moved (Ą,Ľ,ź) cannot be explained this way.

The names are symbolic identifiers for Windows code pages, which are character encodings (= charsets) defined or adopted by Microsoft. Many of them are registered at IANA with the prefix windows-. For example, EASTEUROPE_CHARSET stands for code page 1250, which has been registered as windows-1250 and is often called Windows Latin 2.
UTF-8 is something different. You need special routines to read and write UTF-8 encoded data. UTF-8 or UTF-16 is generally the only sensible choice for character encoding when you want to be truly global (support different languages and writing systems). For a single specific language, some of the code pages might be more practical in some cases.

You can get the the standard encoding names (as registered by IANA) using the table under the remarks section of this MSDN page.
Just find the Character set row and read the Code page number, the standard name is windows-[code page number].

Related

Character set conversion problem - debug invalid characters - reverse engineer earlier conversions

Character conversion problem.
I have a few strings which are incorrectly encoded or decoded.
The strings came in an ASCII format CSV file.
The current strings I have are:
N‚met
Tet‹
I know, that the:
"‚" character (0x82) should be originally "é" (é acute accent)
"‹" character (0x8B) should be originally "ő" (o double acute accent)
How can I debug and reverse engineer, what conversions happened with the original characters to get the current characters?
I suppose that multiple decoding encoding happened, but I was not able to reproduce the original character.

I put an expanded version of my comment as answer:
Your viewer uses CP1252 (English and Western Europe, also called ANSI in Windows) or CP1250 (Eastern Europe) or an other similar code page. Most of characters are coded in the same manner, just few language specific changes. Your example do not includes character that are different on the two encoding, so I cannot say precisely.
That code pages are used on Microsoft Windows, and they are based (but not 100% compatible) with Latin-1, so it is common to see text interpreted with such encoding. MacOs and Linux are heavily (now) UTF-8 encoded. Windows uses Unicode internally (but UTF-16)
The old encoding is probably CP437: the standard code page in DOS, so it was used frequently also for CSV files. Other frequent old encoding are CP850 (Western Europe) and CP852 (Central Europe).
For the other answers you put in the comments, I think you should go to Superuser (if you are requesting tools (some editors allow you to specify the encoding. You may use the browser (opening a local file): browsers also allow you to choose the local encoding, and I think you may copy as Unicode [not sure], other tools sometime has hidden option to import files, but possibly not with all options), or as new question in this site, if you want to do it programmatically. But so you are required to specify the language. Python is well suited for such conversions (most scripting languages are created to handle texts): python has built in many encoding, you should just specify when reading and when writing the files. R also can be instructed on the input encoding.

I wrote my own utility that helped me to diagnose and fix many thorny encoding issues. It is available as part of an Open source library. The utility converts any String to unicode sequence and vise-versa. All you will have to do is:
String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Hello world");
And it will return String "\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064"
The same would work for any String in any language including special characters. Here is the link to the article Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison that explains about the library and where to get it (available both on Maven central and github. In the article search for paragraph: "String Unicode converter". So when you read your String convert it and see what comes up. This way you will see what symbols are there and if the info is correct and only distorted by some wrong encoding or the info itself is lost. You can easily find info on internet that provides tables of mapping of any symbol to a unicode

Ensuring consistency when encoding to UTF8 from extended ASCII

Maybe this is a non-issue but I look to the collected wisdom of SO to help me find out.
We're trying to ensure encodings are consistent across platforms. The way to go is clearly UTF8. However, some platforms unfortunately use extended ASCII (typically some form of Windows codepage), We're concerned that when encoding something with say, an umlaut, from a Windows codepage to UTF8, there are multiple possible choices within UTF8 for the character.
On a different platform (Linux, Mac OS), how do we ensure that the UTF8 character chosen there is consistent?
As I said, maybe this is a non-issue. Maybe there is some standard mapping I'm unaware of. We haven't seen any problems but a colleague just raised the concern so I'm on the hunt for information.
Thank you all in advance.

As long as you properly convert original text to Unicode first and than use Utf8 to store/transfer data there should be no problems.

The Unicode Consortium has compiled a set of mapping tables. Nominally informational, they constitute a de facto standard. Moreover, many of the mappings there reflect formal standards, as it has become normal to define any new character encoding in terms of Unicode, i.e. by specifying the Unicode number (and/or Unicode name) of each character.
Once a character has been mapped to Unicode (i.e., to a Unicode code point, or Unicode number), its encoding in each Unicode encoding, such as UTF-8, has been defined unambiguously.
So the issue is how you ensure that the conversion routines you use work according to those tables. Using ICU can be regarded as safe in this respect.
P.S. There is no extended ASCII. There are various character encodings, some of which coincide with ASCII in the range from 0 to 0x7F, some don’t.

tcl utf-8 characters not displaying properly in ui

Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: KÃ¡tai-PÃ¡l
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: KÃÂ¡tai-PÃÂ¡l
In UI I see name as “KÃÂ¡tai-PÃÂ¡l”

The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)

Windows API: ANSI and Wide-Character Strings -- Is it UTF8 or ASCII? UTF-16 or UCS-2 LE?

I'm not quite pro with encodings, but here's what I think I know (though it may be wrong):
ASCII is a 7-bit, fixed-length encoding, with the characters you can find in ASCII charts.
UTF8 is an 8-bit, variable-length encoding. All characters can be written in UTF8.
UCS-2 LE/BE are fixed-length, 16-bit encodings that support most common characters.
UTF-16 is a 16-bit, variable-length encoding. All characters can be written in UTF16.
Are those above all correct?
Now, for the questions:
Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
(I had more questions, but this is enough... I forgot some of them anyway...)
These are a lot of questions, so any links to explanations about how all these connect (aside from reading the Unicode standard, which won't help with the Windows API anyway) would also be greatly appreciated.
Thank you!

Are those above all correct?
Yes, if you don't assume the existence of characters not encoded in Unicode (for most practical applications, this assumption is fine).
Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
They take byte strings (i.e., strings whose code unit is a byte, which is always an octet on Windows) encoded in the current "ANSI"/MBCS/legacy encoding. "ANSI" is the historical terms for these encodings, but not correct. For Western Windows systems, this encoding is usually Windows-1252.
Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
Since Windows 2000, most of them support UTF-16. The name "wide" and the rest of the Microsoft terminology (e.g., "Unicode" meaning "UTF-16" or "UCS") were chosen before the modern Unicode standard unified the terminology.
In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
Every other encoding that WideCharToMultiByte supports is a "multi-byte encoding" in this context, including Windows-1251 and UTF-8.
Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
LPWSTR is a pointer to wchar_t which is always a 16-bit unsigned integer on Windows. Which characters can be displayed is unrelated to the encoding as long as that encoding can encode all Unicode characters. Windows is generally able to display non-BMP characters, but not everywhere (e.g., the console cannot).
Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
Don't really know, but I don't think they differ too much. I suppose you just try to convert some non-BMP character to UTF-8 and look whether the result is correct.
Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
File paths are indeed opaque arrays of UTF-16 characters, meaning that Windows doesn't perform any kind of translation when storing or reading file names (like Linux and unlike Mac OS X). But Windows still has its weird mostly-undefined case insensitive behavior which causes much trouble because file names that are treated equivalent aren't necessarily equal. That breaks many invariants; for example, on Linux without interference from other threads, if you successfully create two files A and a in some directory, you'll end up with two distinct files, while on Windows you get only one file (and in general, an unpredictable number of files).
What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
ANSI is the American standardization organization. Using this word when referring to encodings is a misnomer, but a frequent one, so you should be aware of it. I prefer the term legacy 8-bit encoding, because I think that's essentially what it is: a non-Unicode encoding that is kept only for compatibility with legacy (Windows 9x) applications. On Western systems, this is usually Windows-1252, which is a proper superset of ASCII.

Wide strings used to be UCS-2. From Windows 2000, wide strings are UTF-16. Good to know if you need to maintain some old legacy system.

*A functions used the active ANSI codepage.
*W function use UTF-16.
Multi-byte refers to whatever is passed in the CodePage parameter. It is most commonly either the active ANSI codepage or UTF-8.
LPWSTR is a UTF-16 string which may or may not be null-terminated (see MSDN)
I don't know anything about wcstombs, I always use WideCharToMultiByte.
File paths are in UTF-16. In fact all text is UTF-16 internally in Windows.
For ANSI encoding you will need to read up on that in some detail. You could do worse than to start with Wikipedia and follow the links from there.
I hope that helps and that if I've got anything wrong, anyone who knows more please do edit this to correct any errors!

First of all you'll find plenty of information in this SO topic.
ASCII is a charset, not encoding. Now, there's a number of 8-bit charsets, one of them being set as default in the system (you can change it in Regional Settings). *A functions accept 8-bit characters in that charset. UTF-8 is not a charset, but encoding of Unicode charset. *W functions, as I understand, use UTF-16 rather than UCS-2.

How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).
When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.
I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.
When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.
I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.
Answer:
ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.
The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.
I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.
Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!

Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.

Here is how notepad does that
There is also the python Universal Encoding Detector which you can check.

My guess is:
First, check if the file has byte values less than 32 (except for tab/newlines). If it does, it can't be ANSI or UTF-8. Thus - UTF-16. Just have to figure out the endianness. For this you should probably use some table of valid Unicode character codes. If you encounter invalid codes, try the other endianness if that fits. If either fit (or don't), check which one has larger percentage of alphanumeric codes. Also you might try searchung for line breaks and determine endianness from them. Other than that, I have no ideas how to check for endianness.
If the file contains no values less than 32 (apart from said whitespace), it's probably ANSI or UTF-8. Try parsing it as UTF-8 and see if you get any invalid Unicode characters. If you do, it's probably ANSI.
If you expect documents in non-English single-byte or multi-byte non-Unicode encodings, then you're out of luck. Best thing you can do is something like Internet Explorer which makes a histogram of character values and compares it to histograms of known languages. It works pretty often, but sometimes fails too. And you'll have to have a large library of letter histograms for every language.

ASCII? No modern OS uses ASCII any more. They all use 8 bit codes, at least, meaning it's either UTF-8, ISOLatinX, WinLatinX, MacRoman, Shift-JIS or whatever else is out there.
The only test I know of is to check for invalid UTF-8 chars. If you find any, then you know it can't be UTF-8. Same is probably possible for UTF-16. But when it's no Unicode set, then it'll be hard to tell which Windows code page it might be.
Most editors I know deal with this by letting the user choose a default from the list of all possible encodings.
There is code out there for checking validity of UTF chars.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio