Encodings: OS, keyboard input etc - windows

I'm trying to understand how operating systems deal with different encodings. I have read that Windows uses UTF-16 internally. If I type text into a text editor, is it going to be saved in UTF-16 on the hard disk on Windows? Is the text typed in (and temporarily stored in RAM until saved) encoded in the OS's internal encoding unless explicitly converted by a text editor with such capability? If I create a UTF-8 SQL database, but fill it with text using my keyboard on Windows, is the OS pushing UTF-16 encoded text inside or does the system at some point realise that it should be UTF-8? When I make webpages I'm told that it's best to use UTF-8. So I make sure that my text editors are set to that, but how do I know that the input from the keyboard/OS is UTF-8?

The Unicode charset can be represented with many encodings, chief among them: utf-8, utf-16 and utf-32. The software you use will translate between different charsets and encodings internally as appropriate, though you might have to select the input and output encodings for permanent storage (files) yourself.
Because Windows created a whole new suite of wide APIs when people thought UTF-16 would stay UCS2 forever, most Windows components internally use UTF-16. Also, the narrow APIs are generally more restricted, where they exist at all, and generally use some obsolete ANSI codepage, not UTF-8. Thus, if your editor was originally developed on Windows, or uses a Windows standard control for text display and editing, your text will (mostly transparent to you) be UTF-16 in memory.
Most text editors (including Windows Editor) still save text per default as UTF-8. Microsoft editors (and much more windows software) though tend to prefix any Unicode text file with a BOM, which can make non-BOM-aware software choke.

Related

Cross-platform unicode in C/C++: Which encoding to use?

I'm currently working on a hobby project (C/C++) which is supposed to work on both Windows and Linux, with full support for Unicode. Sadly, Windows and Linux use different encodings making our lives more difficult.
In my code I'm trying to use the data as universal as possible, making it easy for both Windows and Linux. In Windows, wchar_t is encoded as UTF-16 by default, and as UCS-4 in Linux (correct me if I'm wrong).
My software opens ({_wfopen, UTF-16, Windows},{fopen, UTF-8, Linux}) and writes data to files in UTF-8. So far it's all doable. Until I decided to use SQLite.
SQLite's C/C++ interface allows for one or two-byte encoded strings (click).
Ofcourse this does not work with wchar_t in Linux, as the wchar_t in Linux is 4 bytes by default. Therefore, writing and reading from sqlite requires conversion for Linux.
Currently the code is cluttering up with exceptions for Windows/Linux. I was hoping to stick to the standard idea of storing data in wchar_t:
wchar_t in Windows: Filepaths without a problem, reading/writing to sqlite without a problem. Writing data to a file should be done in UTF-8 anyway.
wchar_t in Linux: Exception for the filepaths due to UTF-8 encoding, conversion before reading/writing to sqlite (wchar_t), and the same for windows when writing data to a file.
After reading (here) I was convinced I should stick to wchar_t in Windows. But after getting all that to work, the trouble began with porting to Linux.
Currently I'm thinking of redoing it all to stick with simple char(UTF-8) because it works with both Windows and Linux, keeping the fact in mind that I need to 'WideCharToMultiByte' every string in Windows to achieve UTF-8. Using simple char* based strings will greatly reduce the number of exceptions for Linux/Windows.
Do you have any experience with unicode for cross-platform? Any thoughts about the idea of simply storing data in UTF-8 instead of using wchar_t?
UTF-8 on all platforms, with just-in-time conversion to UTF-16 for Windows is a common tactic for cross-platform Unicode.
Our software is cross-platform as well, and we faced similar problems. We decided that our goal is to have the least amount of conversions possible. This means that we use wchar_t on Windows and char on Unix/Mac.
We do this by supporting _T and LPCTSTR and similar on Unix and by having generic functions that easily convert between std::string and std::wstring. We also have a generic std::basic_string<TCHAR> (tstring) which we use in most cases.
So far this works quite well. Basicly most functions take a tstring or a LPCTSTR and those which don't will get their parameters converted from a tstring. That means that most of the time we don't convert our strings and pass through most parameters.

Specifying the encoding when saving a file in a Windows app

I am writing a program that handles mostly Unicode text. The C standard library function 'fopen' provides for writing the characters to file in utf-8 format by including in the mode string argument "..., ccs=utf-8". It seems that the Windows API 'CreateFile' does give such provision. Must I use 'fopen' then?
This is Specific to programming under Windows, using Visual Studio, and Microsoft tools. My personal advice is to not to use fopen with the extended syntax, otherwise later there will be compatibility issues when porting your application to other operating systems. When under Windows, do the Windows way, use CreateFile.
The contents of the file are defined not by the file-opening function, but the actual data you write. After you get the file handle (either by fopen or CreateFile), you can write in UTF-8, or ANSI, or whatever you like.
Note that some encodings require a special bit at the beginning of the file.

Why isn't UTF-8 allowed as the "ANSI" code page?

The Windows _setmbcp function allows any valid code page...
(except UTF-7 and UTF-8, which are not supported)
OK, not supporting UTF-7 makes sense: Characters have non-unique representations and that introduces complexity and security risks.
But why not UTF-8?
As I understand it, the "ANSI" versions of the Windows API functions convert their arguments to UTF-16, call the equivalent "W" function, and convert any strings in the output to "ANSI". This is what I've been doing manually. So why can't Windows do it for me?
The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.
Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.
_setmbcp() is a VC++ RTL function, not a Win32 API function. It only affects how the RTL interprets strings. It has no effect whatsoever on Win32 API A functions. When they call their W counterparts internally, the A functions always use MultiByteToWideChar() and WideCharToMultiByte() specifying codepage 0 (CP_ACP) to use the system default Ansi codepage for the conversions.
Michael Kaplan, an internationalization expert from Microsoft, tried to answer this on his blog.
Basically his explanation is that even though the "ANSI" versions of Windows API functions are meant to handle different code pages, historically there was an implicit expectation that character encodings would require at most two bytes per code point. UTF-8 doesn't meet that expectation, and changing all of those functions now would require a massive amount of testing.
The reason is exactly like what was said in jamesdlin's answers and the comments below it: MBCS is the same as DBCS in Windows and some functions don't work with characters that are longer than 2 bytes
Microsoft said that a UTF-8 locale might break some functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also GB 18030, cp54936) could not be set as the locale.
https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8
So UTF-8 was allowed in functions like read/write but not when using as a locale
However Microsoft has finally fixed that so now we can use UTF-8 as a locale. In fact MS even started recommending the ANSI APIs (-A) again instead of the Unicode (-W) versions like before. There are some new options in MSVC: /execution-charset:utf-8 and /utf-8 to set the charset, or you can also set the ActiveCodePage property in appxmanifest of the UWP app
Since Windows 10 insider build 17035, before those options were introduced, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox had also been added for setting the locale code page to UTF-8
To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative
After enabling it you can call setlocale() to change to UTF-8 locale:
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".utf8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
UTF-8 Support
You can also use this in older Windows versions
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
See also
Is it possible to set “locale” of a Windows application to UTF-8?

Character Encoding, UTF or ANSI?

I'm using Eclipse in Ubuntu to edit PHP files.
But, unfortunately, some of these PHP files were created in Notepad++ in Windows XP, with ANSI encoding defined.
Also, these files generates HTML codes with charset=ISO-8859-1.
When I configured Eclipse to ISO-8859-1, many special characters were lost and changed to '???', and when I try to save a file with ISO enconding, Eclipse displays an error that was not possible to save the file because some characters aren't compatible with the charset.
How can I save these files without changing the encoding, or how can I change the encoding without lose characters.
To the point, you need to read those files using ANSI encoding and then write those files using ISO-8859-1 encoding. In Notepad++ you can change the encoding by Format menu option. Unfortunately there's no ISO-8859-1 option, but UTF-8 should suffice and is nowadays also the preferred choice for world domination since the ISO-8859-1 encoding only covers latin characters, not for example Cyrillic, Greek, Chinese, Arabic, etcetera.
By "ANSI" do you mean "Windows code page 1252"?
In either case, once you figure out the source encoding you can use iconv to convert from that encoding to UTF-8.
The latest version can CONVERT between ISO-8859-1 and UTF-8 without loosing info.
version 5.6.8 is able to do so.

Can I avoid using CP1252 on Windows?

I would like all my toolkit to use UTF-8 but find that some tools on Windows seem to use CP1252 (which appears to be Windows-specific). Does this create output which is incompatible and if so at which codepoints? If so, can I do anything about it?
(I don't completely understand the issues so I'd be grateful for basic education on these encodings).
Tools hard-coding for code page 1252 on Windows is very unlikely. Much more likely is that it happens to be the default code page on your machine. 1252 is used in Western Europe and the Americas. It is configured in Control Panel, Regional and Language options. They've been using different names for it, on Win7 it is in the Administrative tab, Change System Locale.
Yes, many tools use the default code page unless they have a good reason to chose another encoding. The BOM is such a good reason. Notable examples are Notepad (unless you change the Encoding in the File + Open dialog to something else than Ansi) and C/C++ compilers. There typically isn't anything special you need to do to use the default code page. Guessing the correct code page for a text file when you don't have a BOM is impossible to do accurately. Google "bush hid the facts" for a very amusing war story.
Six years old and still relevant: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Now, about your question: Yes, there are still tools out there that choke on UTF-8 files. But more and more tools are "getting it". If you're developing your own stuff, you might want to look into Python 3 where all strings are Unicode. The philosophy is to convert all your inputs into Unicode (if necessary) as early as possible, and reconvert them to a target encoding as late as possible. There are toolkits out there that will do a good job of guessing the encoding of a particular file (for example, Mark Pilgrim's chardet, a port of Mozilla's encoding detector). This is nice if you're working with files that don't specify an encoding.
CP1252 and UTF-8 are the same for all characters < 128. They differ above that. So if you stick to English and stay away from diacritical marks these will be the same.
Most of the Windows tools will use whatever is set as the current user's current codepage, which will default to 1252 for US Windows. You can change that to another codepage pretty easily. But UTF-8 is NOT one of the available codepage options for Windows. (I wish it was).
Some utilities under Windows will understand the UTF-8 byte-order mark at the start of a file. Unfortunately I don't know how to determine if this will work except to try it.
UTF-8 is supported on Windows but not as a current codepage. You can use UTF-8 for converting to/from it but you cannot set is as current codepage.
First do not try to waste time by setting the codepage - this approach will remind you of Sisyphus myth - you can't really solve the problem using codepages, you have to use Unicode.
The only real solution for you is to build your application as Unicode so it will use UTF-16 and to convert to/from UTF-8 on in/out operations. This is done quite simple because fopen supports reading or writing UTF-8.
Regarding the usage of other Windows tools with UTF-8 file, you should not be aware because if the tool is able to work with ASCII it will work with UTF-8 (even so it may not be able to distinguish between Unicode chars but at least it will be able to load/parse the files).
BTW, You forgot to specify what programming language are you using and what Windows tools are you considering for usage.
Also, if you ware interested about more internationalization stuff please visit my blog.i18n.ro

Resources