C++/CLI Changing encoding - utf-8

Good Day,
I've been writing a simple program using the Windows API, it's written in C++/CLI.
The problem I've encountered is, I'm loading a library (.dll) and then calling its functions. one of the functions returns char*. So I add the returned value to my textbox
output->Text = System::Runtime::InteropServices::Marshal::PtrToStringAnsi
(IntPtr(Function()));
Now, as you can see this is encoded in ANSI, the char* returned is, I presume, also ANSI (or Windows-1252, w/e you guys call it :>). The original data, which the function in LIBRARY gets is encoded in UTF-8, variable-length byte field, terminated by 0x00. There are a lot of non-Latin characters in my program, so this is troubling. I've also tried this
USES_CONVERSION;
wchar_t* pUnicodeString = 0;
pUnicodeString = A2W( Function());
output->Text = System::Runtime::InteropServices::Marshal::PtrToStringUni
(IntPtr(pUnicodeString));
using atlconv.h. It still prints malformed/wrong characters. So my question would be, can I convert it to something like UTF-8 so I would be able to see correct output, or does the char* loose the necessary information required to do so? Maybe changing the .dll source code would help, but it's quite old and written in C, so i don't want to mess with it :/
I hope the information I provided was sufficient, if you need anything more, just ask.

As I know there is no standard way to handle UTF-8. Try to google appropriate converters, e.g. http://www.nuclex.org/articles/cxx/10-marshaling-strings-in-cxx-cli , Convert from C++/CLI pointer to native C++ pointer .
Also, your second code snippet doesn't use pUnicodeString, it doesn't look right.

Related

Getting the default RTL codepage in Lazarus

Lazarus Wiki states
Lazarus (actually its LazUtils package) takes advantage of that API
and changes it to UTF-8 (CP_UTF8). It means also Windows users now use
UTF-8 strings in the RTL
In our cross-platform and cross-compiler code, we'd like to detect this specific situation. GetACP() Windows API function still returns "1252", and so does GetDefaultTextEncoding() function in Lazarus. But the text (specifically, the filename returned by FindFirst() function) contains the string with UTF8-encoded filename, and the codepage of the string (variable) is 65001 too.
So, how do we figure out that the RTL operates with UTF8 strings by default? I've spent several hours trying to figure this out from Lazarus source code, but probably I am missing something ...
I understand that in many scenarios, we need to inspect the codepage of each specific string, but I am interested in the way to find out the default RTL codepage which is UTF8 in Lazarus, yet Windows-defined one in FPC/Windows without Lazarus.
Turns out, that there's no single code page variable or function. Results of the filesystem API calls are converted to the codepage, defined in DefaultRTLFileSystemCodePage variable. The only problem is that this variable is present in the source code and is supposed to be in system unit, but the compiler doesn't see it.

How to decode a single UTF-8 character and step onto the next using only the Rust standard library?

Does Rust provide a way to decode a single character (unicode-scalar-value to be exact) from a &[u8], which may be multiple bytes, returning a single USV?
Something like GLib's g_utf8_get_char & g_utf8_next_char:
// Example of what glib's functions might look like once ported to Rust.
let i = 0;
while i < slice.len() {
let unicode_char = g_utf8_get_char(&slice[i..]);
// do something with the unicode character
funcion(unicode_char);
// move onto the next.
i += g_utf8_next_char(&slice[i..]);
}
Short of porting parts of the GLib API to Rust, does Rust provide a way to do this, besides some trial & error calls to from_utf8 which stop once the second character is reached?
See GLib's code.
No, there is no such functionality publicly exposed in the Rust standard library as of Rust 1.14.
And neither should there be. Rust doesn't believe in a gigantic standard library. Crates are trivial to use and prevent people from rewriting code. Many people have an incorrect opinion (yeah, that's right: an opinion is incorrect) that using dependencies makes their program weaker.
Anything put in the standard library has to be maintained forever. There are zero plans for a Rust 2.0 that would break backwards compatibility. Python is the normal example here, with a multitude of "get data from a URL" parts of the standard library that are all redundant and deprecated now. The Python maintainers have to waste time keeping those working, instead of advancing the language.
Third-party crates allow things to be created, evolve, and die without burdening the entire language.
You can convert a byte slice (&[u8]) into a string slice (&str) by using str::from_utf8 (note that this validates that the whole byte slice is valid UTF-8). You can then use the chars() iterator on the string slice to iterate on each character (char) in the string.

How to print UTF-8 strings without using platform specific functions?

Is it possible to print UTF-8 strings without using platform specific functions?
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main()
{
ios_base::sync_with_stdio(false);
wcout.imbue(locale("en_US.UTF-8")); // broken on Windows (?)
wstring ws1 = L"Wide string.";
wstring ws2 = L"Wide string with special chars \u20AC"; // Euro character
wcout << ws1 << endl;
wcout << ws2 << endl;
wcout << ws1 << endl;
}
I get this runtime error:
terminate called after throwing an instance of 'std::runtime_error'
what(): locale::facet::_S_create_c_locale name not valid
If I remove the line wcout.imbue(locale("en_US.UTF-8"));, I get only ws1 printed, and just once.
In another question ("How can I cin and cout some unicode text?"), Philipp writes:
"wcin and wcout don't work on Windows, just like the equivalent C functions. Only the native API works." Is it true form MinGW, too?
Thank you for any hint!
Platform:
MinGW/GCC
Windows 7
I haven't used gcc in a mingw environment on Windows, but from what I gather it doesn't support C++ locales.
Since it doesn't support C++ locales this isn't really relevant, but FYI, Windows doesn't use the same locale naming scheme as most other platforms. They use a similar language_country.encoding, but the language and country are not codes, and the encoding is a Windows code page number. So the locale would be "English_United States.65001", however this is not a supported combination (code page 65001 (UTF-8) isn't supported as part of any locale).
The reason that only ws1 prints, and only once is that when the character \u20AC is printed, the stream fails and the fail bit is set. You have to clear the error before anything further will be printed.
C++11 introduced some things that will portably deal with UTF-8, but not everything is supported yet, and the additions don't completely solve the problem. But here's the way things currently stand:
When char16_t and char32_t are supported in VS as native types rather than typedefs you will be able to use the standard codecvt facet specializations codecvt<char16_t,char,mbstate_t> and codecvt<char32_t,char,mbstate_t> which are required to convert between UTF-16 or UTF-32 respectively, and UTF-8 (rather than the execution charset or system encoding). This doesn't work yet because in the current VS (and in VS11DP) these types are only typedefs and template specializations don't work on typedefs, but the code is already in the headers in VS 2010, just protected behind an #ifdef.
The standard also defines some special purpose codecvt facet templates which are supported, codecvt_utf8, and codecvt_utf8_utf16. The former converts between UTF-8 and either UCS-2 or UCS-4 depending on the size of the wide char type you use, and the latter converts between UTF-8 and UTF-16 code units independent of the size of the wide char type.
std::wcout.imbue(std::locale(std::locale::classic(),new std::codecvt_utf8_utf16<wchar_t>()));
std::wcout << L"ØÀéîðüýþ\n";
This will output UTF-8 code units through whatever is attached to wcout. If output has been redirected to file then opening it will show a UTF-8 encoded file. However, because of the console model on Windows, and the way the standard streams are implemented, you will not get correct display of Unicode characters in the command prompt this way (even if you set the console output code page to UTF-8 with SetConsoleOutputCP(CP_UTF8)). The UTF-8 code units are output one at a time, and the console will look at each individual chunk passed to it expecting each chunk (i.e. single byte in this case) passed to be complete and valid encodings. Incomplete or invalid sequences in the chunk (every byte of all multibyte character representations in this case) will be replaced with U+FFFD when the string is displayed.
If instead of using iostreams you use the C function puts to write out an entire UTF-8 encoded string (and if the console output code page is correctly set) then you can print a UTF-8 string and have it displayed in the console. The same codecvt facets can be used with some other C++11 convinence classes to do this:
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;
puts(convert(L"ØÀéîðüýþ\n).to_bytes().c_str());
The above is still not quite portable, because it assumes that wchar_t is UTF-16, which is the case on Windows but not on most other platforms, and it is not required by the standard. (In fact my understanding is that it's not technically conforming because UTF-16 needs multiple code units to represent some characters and the standard requires that all characters in the chosen encoding must be representable in a single wchar_t).
std::wstring_convert<std::codecvt_utf8<wchar_t>,wchar_t> convert;
The above will portably handle UCS-4 and USC-2, but won't work outside the Basic Multilingual Plane on platforms using UTF-16.
You could use the conditional type trait to select between these two facets based on the size of wchar_t and get something that mostly works:
std::wstring_convert<
std::conditional<sizeof(wchar_t)==2,std::codecvt_utf8_utf16<wchar_t>,
std::codecvt_utf8<wchar_t>
>::type,
wchar_t
> convert;
Or just use preprocessor macros to define an appropriate typedef, if your coding standards allow macros.
Windows support for UTF-8 is pretty poor, and whilst it's possible to do it using the Windows API it's not at all fun, also, your question specifies that you DON'T want to use platform specific functions...
As for doing it in 'standard C++', I'm not sure if it's possible under Windows without platform specific code. HOWEVER, there are numerous third party libraries available which will abstract away these platform details and allow you to write portable code.
I have recently updated my applications to use UTF-8 internally with the help of the Boost.Locale library.
http://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/index.html
Its locale generation class will allow you to generate a UTF-8 based locale object which you can then imbue into all the standard streams etc.
I am using this right now under both MSVC and GCC via MinGW-w64 successfully! I highly suggest you check it out. Yes, unfortunately it's not technically 'standard C++', however Boost is available pretty much everywhere, and is practically a de-facto standard, so I don't think that's a huge concern.

how does windows wchar_t handle unicode characters outside the basic multilingual plane?

I've looked at a number of other posts here and elsewhere (see below), but I still don't have a clear answer to this question: How does windows wchar_t handle unicode characters outside the basic multilingual plane?
That is:
many programmers seem to feel that UTF-16 is harmful because it is a variable-length code.
wchar_t is 16-bits wide on windows, but 32-bits wide on Unix/MacOS
The Windows APIs use wide-characters, not Unicode.
So what does Windows do when you want to code something like 𠂊 (U+2008A) Han Character on Windows?
The implementation of wchar_t under the Windows stdlib is UTF-16-oblivious: it knows only about 16-bit code units.
So you can put a UTF-16 surrogate sequence in a string, and you can choose to treat that as a single character using higher level processing. The string implementation won't do anything to help you, nor to hinder you; it will let you include any sequence of code units in your string, even ones that would be invalid when interpreted as UTF-16.
Many of the higher-level features of Windows do support characters made out of UTF-16 surrogates, which is why you can call a file 𐐀.txt and see it both render correctly and edit correctly (taking a single keypress, not two, to move past the character) in programs like Explorer that support complex text layout (typically using Windows's Uniscribe library).
But there are still places where you can see the UTF-16-obliviousness shining through, such as the fact you can create a file called 𐐀.txt in the same folder as 𐐨.txt, where case-insensitivity would otherwise disallow it, or the fact that you can create [U+DC01][U+D801].txt programmatically.
This is how pedants can have a nice long and basically meaningless argument about whether Windows “supports” UTF-16 strings or only UCS-2.
Windows used to use UCS-2 but adopted UTF-16 with Windows 2000. Windows wchar_t APIs now produce and consume UTF-16.
Not all third party programs handle this correctly and so may be buggy with data outside the BMP.
Also, note that UTF-16, being a variable length encoding, does not conform to the C or C++ requirements for an encoding used with wchar_t. This causes some problems such as some standard functions that take a single wchar_t, such as wctomb, can't handle characters beyond the BMP on Windows, and Windows defining some additional functions that use a wider type in order to be able to handle single characters outside the BMP. I forget what function it was, but I ran into a Windows function that returned int instead of wchar_t (and it wasn't one where EOF was a possible result).

Parsing CArchive (MFC classes) files in Ruby

I have a legacy app that seems to be exporting/saving files with CArchive (legacy MFC application).
We're currently refactoring the tool for the web. Is there a library I can look at in Ruby for parsing and loading these legacy files?
What possible libraries could I look into?
Problems with the file format according to XML serialization for MFC include:
Non-robustness—your program will probably crash if you read an archive produced by another version of your program. This can be avoided by complex and unwieldly version management. By using XML, this can be largely avoided.
- Heavy dependencies between your program object model and the archived data. Change the program model and it is almost impossible to read data from a previous version.
- Archived data cannot be edited, understood, and changed, except with the associated application.
Also - 4 versions of the legacy software exists, how would I be able to overcome this ObjectModel, Archived data problem for the different versions? Total backward (import) capabilities are required.
CArchive doesn't have a format that you can parse. It's just a binary file. You have to know what is in it to know how to read it. A library could make it easier to read some data types (CString, CArray, etc.) but I'm not sure you'll find anything like this.
CArchive works like this (storing part):
CArchive ar;
int i = 5;
float f = 5.42f;
CString str("string");
ar << i << f << str;
Then all this is dumped into binary file. You would have to read binary data and somehow interpret it. This is easy in C++ because MFC knows how to serialize types, including complex types like CString and CArray. But you'll have to do this on your own using Ruby.
For example you might read 4 bytes (because you know that int is that big) and interpret it as integer. Next four bytes for float. And then you have to see how to load CString, it stores the length first and then data, but you'll have to take a look at the exact format it uses. You could create utility functions for each type to make your life easier but don't expect this to be simple.
You could write an exporter in C++ using the old functionality, that would read in the CArchive and then output an xml file or whatever of the contents. Reading CArchives directly from Ruby (or any other language than C++/MFC) is going to be a major project. Maybe you can get away with it if the data that is written is just a struct with a few ints or longs, but as soon as your CArchive contains UDT's you're in for a world of pain. For example I don't even think CArchive makes promises on alignment.

Resources