I'd like to preview WCHAR strings in the variable display of the Xcode 3.2 debugger.
Bascially if I have
WCHAR wtext[128];
wcscpy(wtext, L"Hello World");
I'd like to see "Hello World" for wtext when tracing into the function.
You'll have to write a Custom Data Formatter for your wchar types, complete with your assumption of what the byte width is and what the character encoding is. THe C++ standard does not specify either of these, which is why wchar and wstring are not very portable and not well-supported on Mac OS X.
One example, with caveats about how you have to customize it for your particular mode of use, is here.
Related
I have done some research on getting UTF-8/16 to work properly in cmd.exe. I've found these articles:
https://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
https://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/
http://www.siao2.com/2008/03/18/8306597.aspx
and also this SO question: Output unicode strings in Windows console app
The life-saving function is _setmode which causes cmd.exe to Just Work™. But what does it actually do? The first article states that
The Visual C++ runtime library can convert automatically between internal UTF-16 and external UTF-8, if you just ask it to do so by calling the _setmode function with the appropriate file descriptor number and mode flag. E.g., mode _O_U8TEXT causes conversion to/from UTF-8.
That's all nice, but the following (to me) sort of contradicts it.
Let's take this simple program:
#include <fcntl.h>
#include <io.h>
#include <iostream>
int main(void)
{
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"привет śążź Ειρήνη";
// yes, wcout; I can use both wprintf and wcout, they both seem to have the same effect
getchar();
return 0;
}
This prints to console properly (provided we select the right font, of course); without the _setmode call I get garbage. But what is actually being translated here? What does the function really do? Does it convert FROM UTF-16 to whatever codepage the console is using? Windows uses UTF-16 internally, why is a conversion needed in the first place?
Furthermore, if I change the second parameter to _O_U8TEXT, the program works just as fine as with _O_U16TEXT, which confuses me further; the UTF-16 representation of и is very different from the UTF-8 one, so how come this still works?
I should mention that I'm using Visual Studio 2015 (MSVC 14.0) and the source file is encoded as UTF-8 with BOM.
Is it possible to print UTF-8 strings without using platform specific functions?
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main()
{
ios_base::sync_with_stdio(false);
wcout.imbue(locale("en_US.UTF-8")); // broken on Windows (?)
wstring ws1 = L"Wide string.";
wstring ws2 = L"Wide string with special chars \u20AC"; // Euro character
wcout << ws1 << endl;
wcout << ws2 << endl;
wcout << ws1 << endl;
}
I get this runtime error:
terminate called after throwing an instance of 'std::runtime_error'
what(): locale::facet::_S_create_c_locale name not valid
If I remove the line wcout.imbue(locale("en_US.UTF-8"));, I get only ws1 printed, and just once.
In another question ("How can I cin and cout some unicode text?"), Philipp writes:
"wcin and wcout don't work on Windows, just like the equivalent C functions. Only the native API works." Is it true form MinGW, too?
Thank you for any hint!
Platform:
MinGW/GCC
Windows 7
I haven't used gcc in a mingw environment on Windows, but from what I gather it doesn't support C++ locales.
Since it doesn't support C++ locales this isn't really relevant, but FYI, Windows doesn't use the same locale naming scheme as most other platforms. They use a similar language_country.encoding, but the language and country are not codes, and the encoding is a Windows code page number. So the locale would be "English_United States.65001", however this is not a supported combination (code page 65001 (UTF-8) isn't supported as part of any locale).
The reason that only ws1 prints, and only once is that when the character \u20AC is printed, the stream fails and the fail bit is set. You have to clear the error before anything further will be printed.
C++11 introduced some things that will portably deal with UTF-8, but not everything is supported yet, and the additions don't completely solve the problem. But here's the way things currently stand:
When char16_t and char32_t are supported in VS as native types rather than typedefs you will be able to use the standard codecvt facet specializations codecvt<char16_t,char,mbstate_t> and codecvt<char32_t,char,mbstate_t> which are required to convert between UTF-16 or UTF-32 respectively, and UTF-8 (rather than the execution charset or system encoding). This doesn't work yet because in the current VS (and in VS11DP) these types are only typedefs and template specializations don't work on typedefs, but the code is already in the headers in VS 2010, just protected behind an #ifdef.
The standard also defines some special purpose codecvt facet templates which are supported, codecvt_utf8, and codecvt_utf8_utf16. The former converts between UTF-8 and either UCS-2 or UCS-4 depending on the size of the wide char type you use, and the latter converts between UTF-8 and UTF-16 code units independent of the size of the wide char type.
std::wcout.imbue(std::locale(std::locale::classic(),new std::codecvt_utf8_utf16<wchar_t>()));
std::wcout << L"ØÀéîðüýþ\n";
This will output UTF-8 code units through whatever is attached to wcout. If output has been redirected to file then opening it will show a UTF-8 encoded file. However, because of the console model on Windows, and the way the standard streams are implemented, you will not get correct display of Unicode characters in the command prompt this way (even if you set the console output code page to UTF-8 with SetConsoleOutputCP(CP_UTF8)). The UTF-8 code units are output one at a time, and the console will look at each individual chunk passed to it expecting each chunk (i.e. single byte in this case) passed to be complete and valid encodings. Incomplete or invalid sequences in the chunk (every byte of all multibyte character representations in this case) will be replaced with U+FFFD when the string is displayed.
If instead of using iostreams you use the C function puts to write out an entire UTF-8 encoded string (and if the console output code page is correctly set) then you can print a UTF-8 string and have it displayed in the console. The same codecvt facets can be used with some other C++11 convinence classes to do this:
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;
puts(convert(L"ØÀéîðüýþ\n).to_bytes().c_str());
The above is still not quite portable, because it assumes that wchar_t is UTF-16, which is the case on Windows but not on most other platforms, and it is not required by the standard. (In fact my understanding is that it's not technically conforming because UTF-16 needs multiple code units to represent some characters and the standard requires that all characters in the chosen encoding must be representable in a single wchar_t).
std::wstring_convert<std::codecvt_utf8<wchar_t>,wchar_t> convert;
The above will portably handle UCS-4 and USC-2, but won't work outside the Basic Multilingual Plane on platforms using UTF-16.
You could use the conditional type trait to select between these two facets based on the size of wchar_t and get something that mostly works:
std::wstring_convert<
std::conditional<sizeof(wchar_t)==2,std::codecvt_utf8_utf16<wchar_t>,
std::codecvt_utf8<wchar_t>
>::type,
wchar_t
> convert;
Or just use preprocessor macros to define an appropriate typedef, if your coding standards allow macros.
Windows support for UTF-8 is pretty poor, and whilst it's possible to do it using the Windows API it's not at all fun, also, your question specifies that you DON'T want to use platform specific functions...
As for doing it in 'standard C++', I'm not sure if it's possible under Windows without platform specific code. HOWEVER, there are numerous third party libraries available which will abstract away these platform details and allow you to write portable code.
I have recently updated my applications to use UTF-8 internally with the help of the Boost.Locale library.
http://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/index.html
Its locale generation class will allow you to generate a UTF-8 based locale object which you can then imbue into all the standard streams etc.
I am using this right now under both MSVC and GCC via MinGW-w64 successfully! I highly suggest you check it out. Yes, unfortunately it's not technically 'standard C++', however Boost is available pretty much everywhere, and is practically a de-facto standard, so I don't think that's a huge concern.
I am using gSoap to generate ANSI C source code, that I would like to build within the LabWindows/CVI environment, on a Windows 7, 64 bit OS. The gSoap file stdsoap2.c includes several instances of the _setmode() function, with the following prototype:
int _setmode (int fd, int mode);
Where fd is a file descriptor, and mode is set to either _O_TEXT or _O_BINARY.
Oddly enough, even though LW/CVI contains an interface to Microsoft's SDK, this SDK does not contain a prototype to _setmode in any of its included header files, even though the help link to the SDK contains information on the function.
Is anyone aware of the method in LabWindows/CVI used to set file (or stream) translation mode to text, or binary.
Thanks,
Ryyker
Closing the loop on this question.
I could not use the only offered answer for the reason listed in my comment above.
Although I did use the SDK, it was not to select a different version of the OpenFile function, rather it was to support the use of a function that an auto-code generator used, _setmode() but that was not supported by my primary development environment (LabWindows/CVI).
So, in summary, my solution WAS to include the SDK to give me definition for _setmode as well as including the following in my non- auto-generated code:
#define _O_TEXT 0x4000 /* file mode is text (translated) */
#define _O_BINARY 0x8000 /* file mode is binary (untranslated) */
So, with the caveat that this post describes what I actually did, I am going to mark the answer #gary offered as the answer, as it was in the ball park. Thanks #gary.
It sounds like you just want to open a file as either ASCII or binary. So you should be able to replace the instances of _setmode() with the LW/CVI OpenFile() function as described here. Here's a short example reading a file as binary.
char filename = "path//to//file.ext"
int result;
result = OpenFile(filename, VAL_READ_ONLY, VAL_OPEN_AS_IS, VAL_BINARY);
if (result < 0)
// Error, notify user.
else
// No error.
Also note this warning from the page:
Caution The Windows SDK also contains an OpenFile function. If you
include windows.h and do not include formatio.h, you will get compile
errors if you call OpenFile.
I'm in the process of trying to learn Unicode? For me the most difficult part is the Encoding. Can BSTRs (Basic String) content code points U+10000 or higher? If no, then what's the encoding for BSTRs?
In Microsoft-speak, Unicode is generally synonymous with UTF-16 (little endian if memory serves). In the case of BSTR, the answer seems to be it depends:
On Microsoft Windows, consists of a string of Unicode characters (wide or
double-byte characters).
On Apple Power Macintosh, consists of a single-byte string.
May contain multiple embedded null characters.
So, on Windows, yes, it can contain characters outside the basic multilingual plane but these will require two 'wide' chars to store.
BSTR's on Windows originally contained UCS-2, but can in principle contain the entire unicode set, using surrogate pairs. UTF-16 support is actually up to the API that receives the string - the BSTR has no say how it gets treated. Most API's support UTF-16 by now. (Michael Kaplan sorts out the details.)
The windows headers still contain another definition for BSTR, it's basically
#if defined(_WIN32) && !defined(OLE2ANSI)
typedef wchar_t OLECHAR;
#else
typedef char OLECHAR;
#endif
typedef OLECHAR * BSTR;
There's no real reason to consider the char, however, unless you desperately want to be compatible with whatever this was for. (IIRC it was active - or could be activated - for early MFC builds, and might even have been used in Office for Mac or something like that.)
What's the meaning of BSTR, LPCOLESTR, LPCWSTR, LPTSTR, LPCWCHAR, and many others if they're all just a bunch of defines that resolve to wchar_t anyway?
LPTSTR indicates the string buffer can be ANSI or UNICODE depending on the definition of the macro: UNICODE.
LPCOLESTR was invented by the OLE team because it switches its behaviour between char and wchar_t based on the definition of OLE2ANSI
LPCWSTR is a wchar_t string
BSTR is an LPOLESTR thats been allocated with SysAllocString.
LPCWCHAR is a pointer to a single constant wide character.
They're actually all rather different. Or at least, were at some time different. Ole was developed - and needed - wide strings while the windows API was still Win16 and didnt support wide strings natively at all.
Also, early versions of the Windows SDK didnt use wchar_t for WCHAR but unsigned short. The windows SDK on GCC gets interesting as - im led to belive that GCC 32bit has a 32bit wchar_t - on compilers with 32bit wchar_t, WCHAR would be defined as an unsigned short or some other type thats 16bits on that compiler.
LPTSTR and LPWSTR and similar defines are really just defines. BSTR and LPOLESTR have special meanings - they indicate the the string pointed to is allocated in a special way.
String pointed to by BSTR must be allocated with SysAllocString() family functions. String pointed to by LPOLESTR is usually to be allocated with CoTaskMemAlloc() (this should be looked up in the documentation to the COM call accepting/returning it).
If allocation/deallocation requirements for strings pointed to by BSTR and LPOLESTR are violated the program can run into undefined behaviour.
MSDN's page on Windows Data Types might provide clarification as to the differences between some of these data types.
LPCWSTR - Pointer to a constant null-terminated string of 16-bit
Unicode characters.
LPTSTR - An LPWSTR if UNICODE is defined, an LPSTR otherwise.