Cross-platform unicode in C/C++: Which encoding to use?

Cross-platform unicode in C/C++: Which encoding to use? - windows

I'm currently working on a hobby project (C/C++) which is supposed to work on both Windows and Linux, with full support for Unicode. Sadly, Windows and Linux use different encodings making our lives more difficult.
In my code I'm trying to use the data as universal as possible, making it easy for both Windows and Linux. In Windows, wchar_t is encoded as UTF-16 by default, and as UCS-4 in Linux (correct me if I'm wrong).
My software opens ({_wfopen, UTF-16, Windows},{fopen, UTF-8, Linux}) and writes data to files in UTF-8. So far it's all doable. Until I decided to use SQLite.
SQLite's C/C++ interface allows for one or two-byte encoded strings (click).
Ofcourse this does not work with wchar_t in Linux, as the wchar_t in Linux is 4 bytes by default. Therefore, writing and reading from sqlite requires conversion for Linux.
Currently the code is cluttering up with exceptions for Windows/Linux. I was hoping to stick to the standard idea of storing data in wchar_t:
wchar_t in Windows: Filepaths without a problem, reading/writing to sqlite without a problem. Writing data to a file should be done in UTF-8 anyway.
wchar_t in Linux: Exception for the filepaths due to UTF-8 encoding, conversion before reading/writing to sqlite (wchar_t), and the same for windows when writing data to a file.
After reading (here) I was convinced I should stick to wchar_t in Windows. But after getting all that to work, the trouble began with porting to Linux.
Currently I'm thinking of redoing it all to stick with simple char(UTF-8) because it works with both Windows and Linux, keeping the fact in mind that I need to 'WideCharToMultiByte' every string in Windows to achieve UTF-8. Using simple char* based strings will greatly reduce the number of exceptions for Linux/Windows.
Do you have any experience with unicode for cross-platform? Any thoughts about the idea of simply storing data in UTF-8 instead of using wchar_t?

UTF-8 on all platforms, with just-in-time conversion to UTF-16 for Windows is a common tactic for cross-platform Unicode.

Our software is cross-platform as well, and we faced similar problems. We decided that our goal is to have the least amount of conversions possible. This means that we use wchar_t on Windows and char on Unix/Mac.
We do this by supporting _T and LPCTSTR and similar on Unix and by having generic functions that easily convert between std::string and std::wstring. We also have a generic std::basic_string<TCHAR> (tstring) which we use in most cases.
So far this works quite well. Basicly most functions take a tstring or a LPCTSTR and those which don't will get their parameters converted from a tstring. That means that most of the time we don't convert our strings and pass through most parameters.

Related

Implications of using non-unicode library with a unicode-built application

I would like to use a non-unicode library from my unicode-built MFC application. However, I'm not sure whether there is a possibility of occurring events such as unintended memory allocations, string handling inside the non-unicode library.
Please explain any implications or provide a resource page.

Whether or not an application is Unicode is a compile-time, not a run-time, distinction - there is no inherent reason why a Unicode executable can't load an ANSI DLL. If the application and DLL both used MFC they would be linked to different MFC runtimes, which could cause problems, but as that isn't the case you should be fine.
Where you need to take care is to ensure that any string data transferred between the DLL and the application are interpreted consistently. Mostly this just means converting between ANSI and Unicode as necessary, and Windows provides API functions which allow you to do this easily enough.
You should, however, check the header files for any data types that are interpreted differently when compiling for Unicode than when compiling for ANSI. For example, if one of the DLL functions was declared as
DWORD process_string(TCHAR * string)
then the non-unicode library would interpret TCHAR as char, but your application would interpret it as wchar_t, hiding the fact that you need to convert the string to ANSI before calling the function.

Script for correcting linux filenames on windows and vice versa?

As far as I know, Unix-like systems use UTF-8 for encoding filenames, while Windows system use their own Windows single-byte encodings.
I am working with archives with japanese filenames in them quite oftenly. When I open such archive created in Windows, japanese letters are dead, because filename encoding is incorrect.
Same thing happens, when I create archive in my Linux and then someone opens it under Windows.
So, I have thought that this should be quite common problem, and, because filenames are recoverable there must already exist correcting .sh script for linux and .bat script for Windows.
But after googling for quite a long time I still have not found anything.
Is there such scripts at all? If not, what difficultuies may have stopped people from creating them?
Update
I would be happy with a solution that works for most Linux systems and most Windows systems.

Windows uses the two byte encoding UTF-16. Your problem is most likely that you are using single byte ANSI versions of whatever archive tool you are using.
Until you give more details of the code and tools you are using it's hard to give specific advice. However, the are no limitations on using the full range of Unicode characters in modern Windows file systems.

Thank you for your input. Case indeed looks quite complex for simple bash script, I'll need to use programming language.
I don't see anything like "close question" button, so I'll use this answer to do so.

Take a look to convmv tool available for Unix systems

Specifying the encoding when saving a file in a Windows app

I am writing a program that handles mostly Unicode text. The C standard library function 'fopen' provides for writing the characters to file in utf-8 format by including in the mode string argument "..., ccs=utf-8". It seems that the Windows API 'CreateFile' does give such provision. Must I use 'fopen' then?

This is Specific to programming under Windows, using Visual Studio, and Microsoft tools. My personal advice is to not to use fopen with the extended syntax, otherwise later there will be compatibility issues when porting your application to other operating systems. When under Windows, do the Windows way, use CreateFile.

The contents of the file are defined not by the file-opening function, but the actual data you write. After you get the file handle (either by fopen or CreateFile), you can write in UTF-8, or ANSI, or whatever you like.
Note that some encodings require a special bit at the beginning of the file.

Why isn't UTF-8 allowed as the "ANSI" code page?

The Windows _setmbcp function allows any valid code page...
(except UTF-7 and UTF-8, which are not supported)
OK, not supporting UTF-7 makes sense: Characters have non-unique representations and that introduces complexity and security risks.
But why not UTF-8?
As I understand it, the "ANSI" versions of the Windows API functions convert their arguments to UTF-16, call the equivalent "W" function, and convert any strings in the output to "ANSI". This is what I've been doing manually. So why can't Windows do it for me?

The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.
Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.

_setmbcp() is a VC++ RTL function, not a Win32 API function. It only affects how the RTL interprets strings. It has no effect whatsoever on Win32 API A functions. When they call their W counterparts internally, the A functions always use MultiByteToWideChar() and WideCharToMultiByte() specifying codepage 0 (CP_ACP) to use the system default Ansi codepage for the conversions.

Michael Kaplan, an internationalization expert from Microsoft, tried to answer this on his blog.
Basically his explanation is that even though the "ANSI" versions of Windows API functions are meant to handle different code pages, historically there was an implicit expectation that character encodings would require at most two bytes per code point. UTF-8 doesn't meet that expectation, and changing all of those functions now would require a massive amount of testing.

The reason is exactly like what was said in jamesdlin's answers and the comments below it: MBCS is the same as DBCS in Windows and some functions don't work with characters that are longer than 2 bytes
Microsoft said that a UTF-8 locale might break some functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also GB 18030, cp54936) could not be set as the locale.
https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8
So UTF-8 was allowed in functions like read/write but not when using as a locale
However Microsoft has finally fixed that so now we can use UTF-8 as a locale. In fact MS even started recommending the ANSI APIs (-A) again instead of the Unicode (-W) versions like before. There are some new options in MSVC: /execution-charset:utf-8 and /utf-8 to set the charset, or you can also set the ActiveCodePage property in appxmanifest of the UWP app
Since Windows 10 insider build 17035, before those options were introduced, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox had also been added for setting the locale code page to UTF-8
To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative
After enabling it you can call setlocale() to change to UTF-8 locale:
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".utf8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
UTF-8 Support
You can also use this in older Windows versions
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
See also
Is it possible to set “locale” of a Windows application to UTF-8?

Can I avoid using CP1252 on Windows?

I would like all my toolkit to use UTF-8 but find that some tools on Windows seem to use CP1252 (which appears to be Windows-specific). Does this create output which is incompatible and if so at which codepoints? If so, can I do anything about it?
(I don't completely understand the issues so I'd be grateful for basic education on these encodings).

Tools hard-coding for code page 1252 on Windows is very unlikely. Much more likely is that it happens to be the default code page on your machine. 1252 is used in Western Europe and the Americas. It is configured in Control Panel, Regional and Language options. They've been using different names for it, on Win7 it is in the Administrative tab, Change System Locale.
Yes, many tools use the default code page unless they have a good reason to chose another encoding. The BOM is such a good reason. Notable examples are Notepad (unless you change the Encoding in the File + Open dialog to something else than Ansi) and C/C++ compilers. There typically isn't anything special you need to do to use the default code page. Guessing the correct code page for a text file when you don't have a BOM is impossible to do accurately. Google "bush hid the facts" for a very amusing war story.

Six years old and still relevant: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Now, about your question: Yes, there are still tools out there that choke on UTF-8 files. But more and more tools are "getting it". If you're developing your own stuff, you might want to look into Python 3 where all strings are Unicode. The philosophy is to convert all your inputs into Unicode (if necessary) as early as possible, and reconvert them to a target encoding as late as possible. There are toolkits out there that will do a good job of guessing the encoding of a particular file (for example, Mark Pilgrim's chardet, a port of Mozilla's encoding detector). This is nice if you're working with files that don't specify an encoding.

CP1252 and UTF-8 are the same for all characters < 128. They differ above that. So if you stick to English and stay away from diacritical marks these will be the same.
Most of the Windows tools will use whatever is set as the current user's current codepage, which will default to 1252 for US Windows. You can change that to another codepage pretty easily. But UTF-8 is NOT one of the available codepage options for Windows. (I wish it was).

Some utilities under Windows will understand the UTF-8 byte-order mark at the start of a file. Unfortunately I don't know how to determine if this will work except to try it.

UTF-8 is supported on Windows but not as a current codepage. You can use UTF-8 for converting to/from it but you cannot set is as current codepage.
First do not try to waste time by setting the codepage - this approach will remind you of Sisyphus myth - you can't really solve the problem using codepages, you have to use Unicode.
The only real solution for you is to build your application as Unicode so it will use UTF-16 and to convert to/from UTF-8 on in/out operations. This is done quite simple because fopen supports reading or writing UTF-8.
Regarding the usage of other Windows tools with UTF-8 file, you should not be aware because if the tool is able to work with ASCII it will work with UTF-8 (even so it may not be able to distinguish between Unicode chars but at least it will be able to load/parse the files).
BTW, You forgot to specify what programming language are you using and what Windows tools are you considering for usage.
Also, if you ware interested about more internationalization stuff please visit my blog.i18n.ro

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio