Are BSTR UTF-16 Encoded? - windows

I'm in the process of trying to learn Unicode? For me the most difficult part is the Encoding. Can BSTRs (Basic String) content code points U+10000 or higher? If no, then what's the encoding for BSTRs?

In Microsoft-speak, Unicode is generally synonymous with UTF-16 (little endian if memory serves). In the case of BSTR, the answer seems to be it depends:
On Microsoft Windows, consists of a string of Unicode characters (wide or
double-byte characters).
On Apple Power Macintosh, consists of a single-byte string.
May contain multiple embedded null characters.
So, on Windows, yes, it can contain characters outside the basic multilingual plane but these will require two 'wide' chars to store.

BSTR's on Windows originally contained UCS-2, but can in principle contain the entire unicode set, using surrogate pairs. UTF-16 support is actually up to the API that receives the string - the BSTR has no say how it gets treated. Most API's support UTF-16 by now. (Michael Kaplan sorts out the details.)
The windows headers still contain another definition for BSTR, it's basically
#if defined(_WIN32) && !defined(OLE2ANSI)
typedef wchar_t OLECHAR;
#else
typedef char OLECHAR;
#endif
typedef OLECHAR * BSTR;
There's no real reason to consider the char, however, unless you desperately want to be compatible with whatever this was for. (IIRC it was active - or could be activated - for early MFC builds, and might even have been used in Office for Mac or something like that.)

Related

Why does the Tool Help Library offer 2 versions of same functions/structures?

I've noticed that the Tool Help Library offers some functions and structures with 2 versions: normal and ending with W. For example: Process32First and Process32FirstW. Since their documentation is identical, I wonder what are the differences between those two?
The W and A versions stand for "wide" and "ANSI". In the past they made different functions, structures and types for both ANSI and unicode strings. For the purpose of this answer, unicode is widechar which is 2 bytes per character and ANSI is 1 byte per character (but it's actually more complicated than that). By supplying both types, the developer can use whichever he wants but the standard today is to use unicode.
If you look at the ToolHelp32 header file it does include both A and W versions of the structures and functions. If you're not finding them, you're not looking hard enough, do an explicit search for the identifiers and you will find them. If you're just doing "view definition" you will find the #ifdef macros. If you still can't find them, change your character set in your Visual Studio project and check again.
Due to wide char arrays being twice the size, structure alignment will be incorrect if you do not use the correct types. Let the macros resolve them for you, by setting the correct character set and using PROCESSENTRY32 instead of indicating A or W, this is the preferred method. Some APIs you are better off using the ANSI version to be honest but that is something you will learn with experience and have to make your own decision.
Here is an excellent article on the topic of character sets / encoding

can mvprintw(), curses function work with usual ascii codes?

I've developed a little console C++ game, that uses ASCII graphics, using cout for the moment. But because I want to make things work better, I have to use pdcurses. The thing is curses functions like printw(), or mvprintw() don't use the regular ascii codes, and for this game I really need to use the smiley characters, heart, spades and so on.
Is there a way to make curses work with the regular ascii codes ?
You shouldn't think of characters like the smiley face as "regular ASCII codes", because they really aren't ASCII at all. (ASCII only covers characters 32-127, plus a handful of control codes under 32.) They're a special case, and the only reason you're able to see them in (I assume?) your Windows CMD shell is that it's maintaining backwards compatibility with IBM Code Page 437 (or similar) from ancient DOS systems. Meanwhile, outside of the DOS box, Windows uses a completely different mapping, Windows-1252 (a modified version of ISO-8859-1), or similar, for its 8-bit, so-called "ANSI" character set. But both of these types of character sets are obsolete, compared to Unicode. Confused yet? :)
With curses, your best bet is to use pure ASCII, plus the defined ACS_* macros, wherever possible. That will be portable. But it won't get you a smiley face. With PDCurses, there are a couple of ways to get that smiley face: If you can safely assume that your console is using an appropriate code page, then you can pass the A_ALTCHARSET attribute, or'ed with the character, to addch(); or you can use addrawch(); or you can call raw_output(TRUE) before printing the character. (Those are all roughly equivalent.) Alternatively, you can use the "wide" build of PDCurses, figure out the Unicode equivalents of the CP437 characters, and print those, instead. (That approach is also portable, although it's questionable whether the characters will be present on non-PCs.)

How to print UTF-8 strings without using platform specific functions?

Is it possible to print UTF-8 strings without using platform specific functions?
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main()
{
ios_base::sync_with_stdio(false);
wcout.imbue(locale("en_US.UTF-8")); // broken on Windows (?)
wstring ws1 = L"Wide string.";
wstring ws2 = L"Wide string with special chars \u20AC"; // Euro character
wcout << ws1 << endl;
wcout << ws2 << endl;
wcout << ws1 << endl;
}
I get this runtime error:
terminate called after throwing an instance of 'std::runtime_error'
what(): locale::facet::_S_create_c_locale name not valid
If I remove the line wcout.imbue(locale("en_US.UTF-8"));, I get only ws1 printed, and just once.
In another question ("How can I cin and cout some unicode text?"), Philipp writes:
"wcin and wcout don't work on Windows, just like the equivalent C functions. Only the native API works." Is it true form MinGW, too?
Thank you for any hint!
Platform:
MinGW/GCC
Windows 7
I haven't used gcc in a mingw environment on Windows, but from what I gather it doesn't support C++ locales.
Since it doesn't support C++ locales this isn't really relevant, but FYI, Windows doesn't use the same locale naming scheme as most other platforms. They use a similar language_country.encoding, but the language and country are not codes, and the encoding is a Windows code page number. So the locale would be "English_United States.65001", however this is not a supported combination (code page 65001 (UTF-8) isn't supported as part of any locale).
The reason that only ws1 prints, and only once is that when the character \u20AC is printed, the stream fails and the fail bit is set. You have to clear the error before anything further will be printed.
C++11 introduced some things that will portably deal with UTF-8, but not everything is supported yet, and the additions don't completely solve the problem. But here's the way things currently stand:
When char16_t and char32_t are supported in VS as native types rather than typedefs you will be able to use the standard codecvt facet specializations codecvt<char16_t,char,mbstate_t> and codecvt<char32_t,char,mbstate_t> which are required to convert between UTF-16 or UTF-32 respectively, and UTF-8 (rather than the execution charset or system encoding). This doesn't work yet because in the current VS (and in VS11DP) these types are only typedefs and template specializations don't work on typedefs, but the code is already in the headers in VS 2010, just protected behind an #ifdef.
The standard also defines some special purpose codecvt facet templates which are supported, codecvt_utf8, and codecvt_utf8_utf16. The former converts between UTF-8 and either UCS-2 or UCS-4 depending on the size of the wide char type you use, and the latter converts between UTF-8 and UTF-16 code units independent of the size of the wide char type.
std::wcout.imbue(std::locale(std::locale::classic(),new std::codecvt_utf8_utf16<wchar_t>()));
std::wcout << L"ØÀéîðüýþ\n";
This will output UTF-8 code units through whatever is attached to wcout. If output has been redirected to file then opening it will show a UTF-8 encoded file. However, because of the console model on Windows, and the way the standard streams are implemented, you will not get correct display of Unicode characters in the command prompt this way (even if you set the console output code page to UTF-8 with SetConsoleOutputCP(CP_UTF8)). The UTF-8 code units are output one at a time, and the console will look at each individual chunk passed to it expecting each chunk (i.e. single byte in this case) passed to be complete and valid encodings. Incomplete or invalid sequences in the chunk (every byte of all multibyte character representations in this case) will be replaced with U+FFFD when the string is displayed.
If instead of using iostreams you use the C function puts to write out an entire UTF-8 encoded string (and if the console output code page is correctly set) then you can print a UTF-8 string and have it displayed in the console. The same codecvt facets can be used with some other C++11 convinence classes to do this:
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;
puts(convert(L"ØÀéîðüýþ\n).to_bytes().c_str());
The above is still not quite portable, because it assumes that wchar_t is UTF-16, which is the case on Windows but not on most other platforms, and it is not required by the standard. (In fact my understanding is that it's not technically conforming because UTF-16 needs multiple code units to represent some characters and the standard requires that all characters in the chosen encoding must be representable in a single wchar_t).
std::wstring_convert<std::codecvt_utf8<wchar_t>,wchar_t> convert;
The above will portably handle UCS-4 and USC-2, but won't work outside the Basic Multilingual Plane on platforms using UTF-16.
You could use the conditional type trait to select between these two facets based on the size of wchar_t and get something that mostly works:
std::wstring_convert<
std::conditional<sizeof(wchar_t)==2,std::codecvt_utf8_utf16<wchar_t>,
std::codecvt_utf8<wchar_t>
>::type,
wchar_t
> convert;
Or just use preprocessor macros to define an appropriate typedef, if your coding standards allow macros.
Windows support for UTF-8 is pretty poor, and whilst it's possible to do it using the Windows API it's not at all fun, also, your question specifies that you DON'T want to use platform specific functions...
As for doing it in 'standard C++', I'm not sure if it's possible under Windows without platform specific code. HOWEVER, there are numerous third party libraries available which will abstract away these platform details and allow you to write portable code.
I have recently updated my applications to use UTF-8 internally with the help of the Boost.Locale library.
http://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/index.html
Its locale generation class will allow you to generate a UTF-8 based locale object which you can then imbue into all the standard streams etc.
I am using this right now under both MSVC and GCC via MinGW-w64 successfully! I highly suggest you check it out. Yes, unfortunately it's not technically 'standard C++', however Boost is available pretty much everywhere, and is practically a de-facto standard, so I don't think that's a huge concern.

About the "Character set" option in Visual Studio

I have an inquiry about the "Character set" option in Visual Studio. The Character Set options are:
Not Set
Use Unicode Character Set
Use Multi-Byte Character Set
I want to know what the difference between three options in Character Set?
Also if I choose something of them, will affect the support for languages ​​other than English (like RTL languages)?
It is a compatibility setting, intended for legacy code that was written for old versions of Windows that were not Unicode enabled. Versions in the Windows 9x family, Windows ME was the last and widely ignored one. With "Not Set" or "Use Multi-Byte Character Set" selected, all Windows API functions that take a string as an argument are redefined to a little compatibility helper function that translates char* strings to wchar_t* strings, the API's native string type.
Such code critically depends on the default system code page setting. The code page maps 8-bit characters to Unicode which selects the font glyph. Your program will only produce correct text when the machine that runs your code has the correct code page. Characters whose value >= 128 will get rendered wrong if the code page doesn't match.
Always select "Use Unicode Character Set" for modern code. Especially when you want to support languages with a right-to-left layout and you don't have an Arabic or Hebrew code page selected on your dev machine. Use std::wstring or wchar_t[] in your code. Getting actual RTL layout requires turning on the WS_EX_RTLREADING style flag in the CreateWindowEx() call.
Hans has already answered the question, but I found these settings to have curious names. (What exactly is not being set, and why do the other two options sound so similar?) Regarding that:
"Unicode" here is Microsoft-speak for UCS-2 encoding in particular. This is the recommended and non-codepage-dependent described by Hans. There is a corresponding C++ #define flag called _UNICODE.
"Multi-Byte Character Set" (aka MBCS) here the official Microsoft phrase for describing their former international text-encoding scheme. As Hans described, there are different MBCS codepages describing different languages. The encodings are "multi-byte" in that some or all characters may be represented by multiple bytes. (Some codepages use a variable-length encoding akin to UTF-8.) Your typical codepage will still represent all the ASCII characters as one-byte each. There is a corresponding C++ #define flag called _MBCS
"Not set" apparently refers to compiling with_UNICODE nor _MBCS being #defined. In this case Windows works with a strict one-byte per character encoding. (Once again there are several different codepages available in this case.)
Difference between MBCS and UTF-8 on Windows goes into these issues in a lot more detail.

how does windows wchar_t handle unicode characters outside the basic multilingual plane?

I've looked at a number of other posts here and elsewhere (see below), but I still don't have a clear answer to this question: How does windows wchar_t handle unicode characters outside the basic multilingual plane?
That is:
many programmers seem to feel that UTF-16 is harmful because it is a variable-length code.
wchar_t is 16-bits wide on windows, but 32-bits wide on Unix/MacOS
The Windows APIs use wide-characters, not Unicode.
So what does Windows do when you want to code something like 𠂊 (U+2008A) Han Character on Windows?
The implementation of wchar_t under the Windows stdlib is UTF-16-oblivious: it knows only about 16-bit code units.
So you can put a UTF-16 surrogate sequence in a string, and you can choose to treat that as a single character using higher level processing. The string implementation won't do anything to help you, nor to hinder you; it will let you include any sequence of code units in your string, even ones that would be invalid when interpreted as UTF-16.
Many of the higher-level features of Windows do support characters made out of UTF-16 surrogates, which is why you can call a file 𐐀.txt and see it both render correctly and edit correctly (taking a single keypress, not two, to move past the character) in programs like Explorer that support complex text layout (typically using Windows's Uniscribe library).
But there are still places where you can see the UTF-16-obliviousness shining through, such as the fact you can create a file called 𐐀.txt in the same folder as 𐐨.txt, where case-insensitivity would otherwise disallow it, or the fact that you can create [U+DC01][U+D801].txt programmatically.
This is how pedants can have a nice long and basically meaningless argument about whether Windows “supports” UTF-16 strings or only UCS-2.
Windows used to use UCS-2 but adopted UTF-16 with Windows 2000. Windows wchar_t APIs now produce and consume UTF-16.
Not all third party programs handle this correctly and so may be buggy with data outside the BMP.
Also, note that UTF-16, being a variable length encoding, does not conform to the C or C++ requirements for an encoding used with wchar_t. This causes some problems such as some standard functions that take a single wchar_t, such as wctomb, can't handle characters beyond the BMP on Windows, and Windows defining some additional functions that use a wider type in order to be able to handle single characters outside the BMP. I forget what function it was, but I ran into a Windows function that returned int instead of wchar_t (and it wasn't one where EOF was a possible result).

Resources