Ensuring consistency when encoding to UTF8 from extended ASCII - utf-8

Maybe this is a non-issue but I look to the collected wisdom of SO to help me find out.
We're trying to ensure encodings are consistent across platforms. The way to go is clearly UTF8. However, some platforms unfortunately use extended ASCII (typically some form of Windows codepage), We're concerned that when encoding something with say, an umlaut, from a Windows codepage to UTF8, there are multiple possible choices within UTF8 for the character.
On a different platform (Linux, Mac OS), how do we ensure that the UTF8 character chosen there is consistent?
As I said, maybe this is a non-issue. Maybe there is some standard mapping I'm unaware of. We haven't seen any problems but a colleague just raised the concern so I'm on the hunt for information.
Thank you all in advance.

As long as you properly convert original text to Unicode first and than use Utf8 to store/transfer data there should be no problems.

The Unicode Consortium has compiled a set of mapping tables. Nominally informational, they constitute a de facto standard. Moreover, many of the mappings there reflect formal standards, as it has become normal to define any new character encoding in terms of Unicode, i.e. by specifying the Unicode number (and/or Unicode name) of each character.
Once a character has been mapped to Unicode (i.e., to a Unicode code point, or Unicode number), its encoding in each Unicode encoding, such as UTF-8, has been defined unambiguously.
So the issue is how you ensure that the conversion routines you use work according to those tables. Using ICU can be regarded as safe in this respect.
P.S. There is no extended ASCII. There are various character encodings, some of which coincide with ASCII in the range from 0 to 0x7F, some don’t.

Related

Windows encoding clarification

I would like to translate a game, this game loads the strings from a text file.
The destination language uses non-ascii characters, so I naïvely saved my file in utf8, but it does not work as letters with diacritics are not shown correctly.
Studying better in the configuration file where the string text filename is stored, I found a CHARSET option that can assume any of those values:
ANSI_CHARSET DEFAULT_CHARSET SYMBOL_CHARSET MAC_CHARSET SHIFTJIS_CHARSET HANGEUL_CHARSET JOHAB_CHARSET GB2312_CHARSET CHINESEBIG5_CHARSET GREEK_CHARSET TURKISH_CHARSET VIETNAMESE_CHARSET HEBREW_CHARSET ARABIC_CHARSET BALTIC_CHARSET RUSSIAN_CHARSET THAI_CHARSET EASTEUROPE_CHARSET OEM_CHARSET
That as far as I understood are fairly standard values in WinAPIs and charset and character encoding are synonymous.
So my question is, is there a correspondence between this names and standard names like utf8 or iso-8859-2? If it is the case what is it?
Try using EASTEUROPE_CHARSET
ISO 8859-2 is mostly equivalent to Windows-1250. According to this MSDN article, the 1250 code page is accessed using EASTEUROPE_CHARSET.
Note that you will need to save your text file in the 1250 code page as ISO 8859-2 is not exactly equivalent. From Wikipedia:
Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged (unlike Windows-1252, which keeps all printable characters from ISO-8859-1 in the same place). Most of the rearrangements seem to have been done to keep characters shared with Windows-1252 in the same place as in Windows-1252 but three of the characters moved (Ą,Ľ,ź) cannot be explained this way.
The names are symbolic identifiers for Windows code pages, which are character encodings (= charsets) defined or adopted by Microsoft. Many of them are registered at IANA with the prefix windows-. For example, EASTEUROPE_CHARSET stands for code page 1250, which has been registered as windows-1250 and is often called Windows Latin 2.
UTF-8 is something different. You need special routines to read and write UTF-8 encoded data. UTF-8 or UTF-16 is generally the only sensible choice for character encoding when you want to be truly global (support different languages and writing systems). For a single specific language, some of the code pages might be more practical in some cases.
You can get the the standard encoding names (as registered by IANA) using the table under the remarks section of this MSDN page.
Just find the Character set row and read the Code page number, the standard name is windows-[code page number].

How many valid utf8 characters are there?

I know that this is a little vague, so for context, think of it as "a character you could tweet," or something like that. My question is how many valid unicode characters are there that a browser or a service that supports utf8 could resolve, in such a way that a utf8 browser could copy and paste it around without any issues.
I guess what I don't want is the full character space, because I know a lot of it is reserved for command characters or reserved characters that wouldn't be shown (unless I'm super wrong!).
UTF-8 isn't the important factor, since all of the standard Unicode encodings (UTF-8, UTF-16, UTF-32) encode the same character space, just in different ways.
From your explanation I see you don't just want the 1,112,064 valid Unicode code points?
Unicode 6.0 and ISO/IEC 10646:2010 define 109,449 characters, but a handful of those are what you're calling "control characters". Which ones do or don't fall into that category depends on how you're counting. Copying and pasting may result in some characters being treated as identical to one another, or ignore altogether, depending on the OS and the programs doing the copying and pasting.
However because Unicode is forward compatible, some systems will correctly preserve characters which haven't yet been assigned. After all, just because you're running Windows XP and you copy and paste a document with characters that weren't standardised until 2009 doesn't mean you expect them to vanish. There could be a million or so extra possible characters by this way of thinking, although their visual appearance may be indistinguishable in some places.

What's the best way to identify unicode encoded text files in Windows?

I am working on a codebase which has some unicode encoded files scattered throughout as a result of multiple team members developing with different editors (and default settings). I would like to clean up our code base by finding all the unicode encoded files and converting them back to ANSI encoding.
Any thoughts on how to accomplish the "finding" part of this task would be truly appreciated.
See “How to detect the character encoding of a text-file?” or “How to reliably guess the encoding [...]?”
UTF-8 can be detected with validation. You can also look for the BOM EF BB BF, but don't rely on it.
UTF-16 can be detected by looking for the BOM.
UTF-32 can be detected by validation, or by the BOM.
Otherwise assume the ANSI code page.
Our codebase doesn't include any
non-ASCII chars. I will try to grep
for the BOM in files in our codebase.
Thanks for the clarification.
Well that makes things a lot simpler. UTF-8 without non-ASCII chars is ASCII.
Unicode is a standard, it is not an encoding. There are many encodings that implement Unicode, including UTF-8, UTF-16, UCS-2, and others. The translation of any of these encodings to ASCII depends entirely on what encoding your "different editors" use.
Some editors insert byte-order marks of BOMs at the start of Unicode files. If your editors do that, you can use them to detect the encoding.
ANSI is a standards body that has published several encodings for digital character data. The "ANSI" encoding used by MS DOS and supported in Windows is actually CP-1252, not an ANSI standard.
Does your codebase include non-ASCII characters? You may have better compatibility using a Unicode encoding rather than an ANSI one or CP-1252.
Actually, if you want to find out in windows if a file is unicode, simply run findstr on the file for a string you know is in there.
findstr /I /C:"SomeKnownString" file.txt
It will come back empty. Then to be sure, run findstr on a letter or digit you know is in the file:
FindStr /I /C:"P" file.txt
You will probably get many occurrences and the key is that they will be spaced apart. This is a sign the file is unicode and not ascii.
Hope this helps.
If you're looking for a programmatic solution, IsTextUnicode() might be an option.
It's kind of hard to say, but I'd start by looking for a BOM. Most Windows programs that write Unicode files emit BOMs.
If these files exist in your codebase presumably they compile. You might ask yourself whether you really need to do this "tidying up". If you do need to do it then I would ask how the tool chain that processes these files discovers their encoding. If you know that then you'll be able to use the same diagnostic.

Windows API: ANSI and Wide-Character Strings -- Is it UTF8 or ASCII? UTF-16 or UCS-2 LE?

I'm not quite pro with encodings, but here's what I think I know (though it may be wrong):
ASCII is a 7-bit, fixed-length encoding, with the characters you can find in ASCII charts.
UTF8 is an 8-bit, variable-length encoding. All characters can be written in UTF8.
UCS-2 LE/BE are fixed-length, 16-bit encodings that support most common characters.
UTF-16 is a 16-bit, variable-length encoding. All characters can be written in UTF16.
Are those above all correct?
Now, for the questions:
Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
(I had more questions, but this is enough... I forgot some of them anyway...)
These are a lot of questions, so any links to explanations about how all these connect (aside from reading the Unicode standard, which won't help with the Windows API anyway) would also be greatly appreciated.
Thank you!
Are those above all correct?
Yes, if you don't assume the existence of characters not encoded in Unicode (for most practical applications, this assumption is fine).
Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
They take byte strings (i.e., strings whose code unit is a byte, which is always an octet on Windows) encoded in the current "ANSI"/MBCS/legacy encoding. "ANSI" is the historical terms for these encodings, but not correct. For Western Windows systems, this encoding is usually Windows-1252.
Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
Since Windows 2000, most of them support UTF-16. The name "wide" and the rest of the Microsoft terminology (e.g., "Unicode" meaning "UTF-16" or "UCS") were chosen before the modern Unicode standard unified the terminology.
In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
Every other encoding that WideCharToMultiByte supports is a "multi-byte encoding" in this context, including Windows-1251 and UTF-8.
Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
LPWSTR is a pointer to wchar_t which is always a 16-bit unsigned integer on Windows. Which characters can be displayed is unrelated to the encoding as long as that encoding can encode all Unicode characters. Windows is generally able to display non-BMP characters, but not everywhere (e.g., the console cannot).
Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
Don't really know, but I don't think they differ too much. I suppose you just try to convert some non-BMP character to UTF-8 and look whether the result is correct.
Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
File paths are indeed opaque arrays of UTF-16 characters, meaning that Windows doesn't perform any kind of translation when storing or reading file names (like Linux and unlike Mac OS X). But Windows still has its weird mostly-undefined case insensitive behavior which causes much trouble because file names that are treated equivalent aren't necessarily equal. That breaks many invariants; for example, on Linux without interference from other threads, if you successfully create two files A and a in some directory, you'll end up with two distinct files, while on Windows you get only one file (and in general, an unpredictable number of files).
What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
ANSI is the American standardization organization. Using this word when referring to encodings is a misnomer, but a frequent one, so you should be aware of it. I prefer the term legacy 8-bit encoding, because I think that's essentially what it is: a non-Unicode encoding that is kept only for compatibility with legacy (Windows 9x) applications. On Western systems, this is usually Windows-1252, which is a proper superset of ASCII.
Wide strings used to be UCS-2. From Windows 2000, wide strings are UTF-16. Good to know if you need to maintain some old legacy system.
*A functions used the active ANSI codepage.
*W function use UTF-16.
Multi-byte refers to whatever is passed in the CodePage parameter. It is most commonly either the active ANSI codepage or UTF-8.
LPWSTR is a UTF-16 string which may or may not be null-terminated (see MSDN)
I don't know anything about wcstombs, I always use WideCharToMultiByte.
File paths are in UTF-16. In fact all text is UTF-16 internally in Windows.
For ANSI encoding you will need to read up on that in some detail. You could do worse than to start with Wikipedia and follow the links from there.
I hope that helps and that if I've got anything wrong, anyone who knows more please do edit this to correct any errors!
First of all you'll find plenty of information in this SO topic.
ASCII is a charset, not encoding. Now, there's a number of 8-bit charsets, one of them being set as default in the system (you can change it in Regional Settings). *A functions accept 8-bit characters in that charset. UTF-8 is not a charset, but encoding of Unicode charset. *W functions, as I understand, use UTF-16 rather than UCS-2.

How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).
When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.
I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.
When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.
I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.
Answer:
ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.
The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.
I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.
Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!
Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.
Here is how notepad does that
There is also the python Universal Encoding Detector which you can check.
My guess is:
First, check if the file has byte values less than 32 (except for tab/newlines). If it does, it can't be ANSI or UTF-8. Thus - UTF-16. Just have to figure out the endianness. For this you should probably use some table of valid Unicode character codes. If you encounter invalid codes, try the other endianness if that fits. If either fit (or don't), check which one has larger percentage of alphanumeric codes. Also you might try searchung for line breaks and determine endianness from them. Other than that, I have no ideas how to check for endianness.
If the file contains no values less than 32 (apart from said whitespace), it's probably ANSI or UTF-8. Try parsing it as UTF-8 and see if you get any invalid Unicode characters. If you do, it's probably ANSI.
If you expect documents in non-English single-byte or multi-byte non-Unicode encodings, then you're out of luck. Best thing you can do is something like Internet Explorer which makes a histogram of character values and compares it to histograms of known languages. It works pretty often, but sometimes fails too. And you'll have to have a large library of letter histograms for every language.
ASCII? No modern OS uses ASCII any more. They all use 8 bit codes, at least, meaning it's either UTF-8, ISOLatinX, WinLatinX, MacRoman, Shift-JIS or whatever else is out there.
The only test I know of is to check for invalid UTF-8 chars. If you find any, then you know it can't be UTF-8. Same is probably possible for UTF-16. But when it's no Unicode set, then it'll be hard to tell which Windows code page it might be.
Most editors I know deal with this by letting the user choose a default from the list of all possible encodings.
There is code out there for checking validity of UTF chars.

Resources