Under Windows the only way to get Unicode support is to use wchar_t (UTF-16 under Windows) instead of char.
The problem is that I found that at least one of the boost libraries (boost::program_options) doesn't support Unicode at all: you are not able to compile the examples as Unicode.
Shouldn't boost be able to compiled with wide strings? - I would even expect this to be the default behavior under Windows.
Related
I am diving into the world of COBOL and have written a simple program that compiles and runs as intended from my KDE Plasma command line using open-cobol (cobc). I have seen a few sites mention that COBOL is quite portable and does not require multiple compilations, but when I try to run the same output program on Windows 10 (ie 32-bit), the system states that the program is a 16-bit application and thus cannot run.
Are there parameters that I can use with cobc to compile in such a way that my programs will run on Windows 10, or am I fundamentally misunderstanding the portability of this language?
Compilation command: cobc -x -o program program.cob
Your program is likely already a 64bit executable (depending on your actual OS, otherwise its 32bit), but it is definitely no Windows binary (and because Windows doesn't recognize it, it just guesses this is a 16bit executable).
COBOL itself is portable, even between different compilers (if you restrict yourself to "standard" COBOL or use only the extensions that the compilers used share), but you need "some" native parts in any case.
As a well known example take Java or .NET: the "runtime" is a native binary, which executes the java (or msil) byte code.
There are some COBOL compilers generating intermediate code which is actually portable and can be used with the "native runtime" you have to install beforehand.
The easiest option for your case: take a compatible compiler and recompile your COBOL source for this platform on this platform.
I'd suggesting the successor of OpenCOBOL: GnuCOBOL, using the official windows binaries.
I understand that the sqlite3 command (as well as the lib version) can be built with a statically linked ICU lib to fully support unicode operations. It should also be possible to dynamically load an ICU extension.
But neither of these seem to be the case with OS X's sqlite3 command (as of 10.10.5).
Here's the test I'm using to determine ICU presence:
SELECT upper('ä');
This should result in "Ä" if ICU is used by the engine. "ä" indicates missing ICU support. Is that a valid test?
Apart from compiling a new command and probably dylib, too, is there another way to enable unicode support so that upper, lower and comparisons know about unicode letters and not just ASCII?
I want to test my software works on Windows regardless of language. Is there a difference between a localised install Windows 7 and an install with the language changed by a language pack? Is it enough just change language packs to confirm my software works or do I need to install a localised version? In particular I want to be sure that the code page used by the API changes.
It's equivalent, you are perfectly safe using a language pack. I'm a bit concerned about your comment regarding code pages though--Windows 7 natively runs in Unicode (Windows 95/98/ME did not and relied on code pages, but this has changed from Windows XP) and it seems unthinkable nowadays not to develop in Unicode, or are you doing something very specific that requires you to use non-Unicode encodings?
If that clarifies things, on Windows XP and beyond, all the ANSI versions of the APIs (ending with a A) simply wrap the Unicode APIs (ending with a W), doing conversion to Unicode and back to ANSI on the way out.
I'm currently working on a hobby project (C/C++) which is supposed to work on both Windows and Linux, with full support for Unicode. Sadly, Windows and Linux use different encodings making our lives more difficult.
In my code I'm trying to use the data as universal as possible, making it easy for both Windows and Linux. In Windows, wchar_t is encoded as UTF-16 by default, and as UCS-4 in Linux (correct me if I'm wrong).
My software opens ({_wfopen, UTF-16, Windows},{fopen, UTF-8, Linux}) and writes data to files in UTF-8. So far it's all doable. Until I decided to use SQLite.
SQLite's C/C++ interface allows for one or two-byte encoded strings (click).
Ofcourse this does not work with wchar_t in Linux, as the wchar_t in Linux is 4 bytes by default. Therefore, writing and reading from sqlite requires conversion for Linux.
Currently the code is cluttering up with exceptions for Windows/Linux. I was hoping to stick to the standard idea of storing data in wchar_t:
wchar_t in Windows: Filepaths without a problem, reading/writing to sqlite without a problem. Writing data to a file should be done in UTF-8 anyway.
wchar_t in Linux: Exception for the filepaths due to UTF-8 encoding, conversion before reading/writing to sqlite (wchar_t), and the same for windows when writing data to a file.
After reading (here) I was convinced I should stick to wchar_t in Windows. But after getting all that to work, the trouble began with porting to Linux.
Currently I'm thinking of redoing it all to stick with simple char(UTF-8) because it works with both Windows and Linux, keeping the fact in mind that I need to 'WideCharToMultiByte' every string in Windows to achieve UTF-8. Using simple char* based strings will greatly reduce the number of exceptions for Linux/Windows.
Do you have any experience with unicode for cross-platform? Any thoughts about the idea of simply storing data in UTF-8 instead of using wchar_t?
UTF-8 on all platforms, with just-in-time conversion to UTF-16 for Windows is a common tactic for cross-platform Unicode.
Our software is cross-platform as well, and we faced similar problems. We decided that our goal is to have the least amount of conversions possible. This means that we use wchar_t on Windows and char on Unix/Mac.
We do this by supporting _T and LPCTSTR and similar on Unix and by having generic functions that easily convert between std::string and std::wstring. We also have a generic std::basic_string<TCHAR> (tstring) which we use in most cases.
So far this works quite well. Basicly most functions take a tstring or a LPCTSTR and those which don't will get their parameters converted from a tstring. That means that most of the time we don't convert our strings and pass through most parameters.
The Windows _setmbcp function allows any valid code page...
(except UTF-7 and UTF-8, which are not supported)
OK, not supporting UTF-7 makes sense: Characters have non-unique representations and that introduces complexity and security risks.
But why not UTF-8?
As I understand it, the "ANSI" versions of the Windows API functions convert their arguments to UTF-16, call the equivalent "W" function, and convert any strings in the output to "ANSI". This is what I've been doing manually. So why can't Windows do it for me?
The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.
Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.
_setmbcp() is a VC++ RTL function, not a Win32 API function. It only affects how the RTL interprets strings. It has no effect whatsoever on Win32 API A functions. When they call their W counterparts internally, the A functions always use MultiByteToWideChar() and WideCharToMultiByte() specifying codepage 0 (CP_ACP) to use the system default Ansi codepage for the conversions.
Michael Kaplan, an internationalization expert from Microsoft, tried to answer this on his blog.
Basically his explanation is that even though the "ANSI" versions of Windows API functions are meant to handle different code pages, historically there was an implicit expectation that character encodings would require at most two bytes per code point. UTF-8 doesn't meet that expectation, and changing all of those functions now would require a massive amount of testing.
The reason is exactly like what was said in jamesdlin's answers and the comments below it: MBCS is the same as DBCS in Windows and some functions don't work with characters that are longer than 2 bytes
Microsoft said that a UTF-8 locale might break some functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also GB 18030, cp54936) could not be set as the locale.
https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8
So UTF-8 was allowed in functions like read/write but not when using as a locale
However Microsoft has finally fixed that so now we can use UTF-8 as a locale. In fact MS even started recommending the ANSI APIs (-A) again instead of the Unicode (-W) versions like before. There are some new options in MSVC: /execution-charset:utf-8 and /utf-8 to set the charset, or you can also set the ActiveCodePage property in appxmanifest of the UWP app
Since Windows 10 insider build 17035, before those options were introduced, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox had also been added for setting the locale code page to UTF-8
To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative
After enabling it you can call setlocale() to change to UTF-8 locale:
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".utf8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
UTF-8 Support
You can also use this in older Windows versions
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
See also
Is it possible to set “locale” of a Windows application to UTF-8?