How to widen a standard string while maintaining the characters? - c++11

I have a function to take a std::string and change it into a wchar_t*. My current widen function looks like this
wchar_t* widen(const std::string& str){
wchar_t * dest = new wchar_t[str.size()+1];
char * temp = new char[str.size()];
for(int i=0;i<str.size();i++)
dest[i] = str[i];
dest[str.size()] = '\0';
return dest;
}
This works just fine for standard characters, however (and I cannot believe this hasn't been an issue before now) when I have characters like á, é, í, ó, ú, ñ, or ü it breaks and the results are vastly different.
Ex: my str comes in as "Database Function: áFákéFúnctíóñü"
But dest ends up as: "Database Function: £F£k←Fnct■￳￱"
How can I change from a std::string to a wchar_t* while maintaining international characters?

Short answer: You can't.
Longer answer: std::string contains char elements which typically contain ASCII in the first 127 values, while everything else ("international characters") is in the values above (or the negative ones, if char is signed). In order to determine the according representation in a wchar_t string, you first need to know the encoding in the source string (could be ISO-8859-15 or even UTF-8) and the one in the target string (often UTF-16, UCS2 or UTF-32) and then transcode accordingly.

It depends if the source is using old ANSI code page or UTF8. For ANSI code page, you have to know the locale, and use mbstowcs. For UTF8 you can make a conversion to UTF16 using codecvt_utf8_utf16. However codecvt_utf8_utf16 is deprecated and it has no replacement as of yet. In Windows you can use WinAPI function to make the conversions more reliably.
#include <iostream>
#include <string>
#include <codecvt>
std::wstring widen(const std::string& src)
{
int len = src.size();
std::wstring dst(len + 1, 0);
mbstowcs(&dst[0], src.c_str(), len);
return dst;
}
int main()
{
//ANSI code page?
std::string src = "áFákéFúnctíóñü";
setlocale(LC_ALL, "en"); //English assumed
std::wstring dst = widen(src);
std::wcout << dst << "\n";
//UTF8?
src = u8"áFákéFúnctíóñü";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
dst = convert.from_bytes(src);
std::wcout << dst << "\n";
return 0;
}

For a Windows solution, here's some utility functions I use based on the wisdom of http://utf8everywhere.org/
/// Convert a windows UTF-16 string to a UTF-8 string
///
/// #param s[in] the UTF-16 string
/// #return std::string UTF-8 string
inline std::string Narrow(std::wstring_view wstr) {
if (wstr.empty()) return {};
int len = ::WideCharToMultiByte(CP_UTF8, 0, &wstr[0], wstr.size(), nullptr, 0,
nullptr, nullptr);
std::string out(len, 0);
::WideCharToMultiByte(CP_UTF8, 0, &wstr[0], wstr.size(), &out[0], len,
nullptr, nullptr);
return out;
}
/// Convert a UTF-8 string to a windows UTF-16 string
///
/// #param s[in] the UTF-8 string
/// #param n[in] the UTF-8 string's length, or -1 if string is null-terminated
/// #return std::wstring UTF-16 string
inline std::wstring Widen(std::string_view str) {
if (str.empty()) return {};
int len = ::MultiByteToWideChar(CP_UTF8, 0, &str[0], str.size(), NULL, 0);
std::wstring out(len, 0);
::MultiByteToWideChar(CP_UTF8, 0, &str[0], str.size(), &out[0], len);
return out;
}
Usually used inline in windows API calls like:
std::string message = "Hello world!";
::MessageBoxW(NULL, Widen(message).c_str(), L"Title", MB_OK);
A cross-platform and possibly faster solution could be found by exploring Boost.Nowide's conversion functions: https://github.com/boostorg/nowide/blob/develop/include/boost/nowide/utf/convert.hpp

Related

winAPI GetAdaptersAddresses unprintable friendly name

(https://learn.microsoft.com/en-us/windows/win32/api/iphlpapi/nf-iphlpapi-getadaptersaddresses)
Why are some of the user friendly names in PIP_ADAPTER_ADDRESSES unprintable? (aswell as a few other attributes such as dns suffix)
By unprintable, I mean containing non-printable characters. for exmaple, the first character in one of the friendly names I tested had a unicode value fo 8207 (decimal)
A minimal complete viable example
#include <winsock2.h>
#include <iphlpapi.h>
#include <vector>
#include <iostream>
int main()
{
PIP_ADAPTER_ADDRESSES adapterAddresses;
DWORD dwReqSize;
DWORD retVal;
DWORD count = 0;
std::string tempForWstringConv;
retVal = GetAdaptersAddresses(AF_INET, GAA_FLAG_INCLUDE_PREFIX, NULL, NULL, &dwReqSize); // for knowing the required size
if (retVal != ERROR_BUFFER_OVERFLOW) {
return -1;
}
adapterAddresses = (PIP_ADAPTER_ADDRESSES)malloc(dwReqSize);
retVal = GetAdaptersAddresses(AF_INET, GAA_FLAG_INCLUDE_PREFIX, NULL, adapterAddresses, &dwReqSize); // this time actually getting the desired content
if (retVal != ERROR_SUCCESS) {
return -1;
}
for (PIP_ADAPTER_ADDRESSES adapter = adapterAddresses; adapter != NULL; adapter = adapter->Next)
{
//outLog.push_back(Adapter());
printf("\tFriendly name: %ls\n", adapter->FriendlyName);
}
return 0;
}
I finally found A solution!
meet _setmode(_fileno(stdout), _O_U16TEXT);
the problem was that the output buffer wasn't allowing these characters because the mode was incorrect. Alas, our desired output:
inorder to use this you MUST A: switch all occurences of cout to wcou; B: switch all occurences of printf to wprintf. C: include and

Concatenation tchar variables

I work with WinAPI and I have a function get_disk_drives() for retrieves available disk drives and a helper function get_current_disk_drive() for retrieves the full path and file name of the specified file.
void get_current_disk_drive(TCHAR dirname[]) {
TCHAR *fileExt = NULL;
TCHAR szDir[MAX_PATH];
GetFullPathName(dirname, MAX_PATH, szDir, &fileExt);
_tprintf(_T("Full path: %s \nFilename: %s\n"), szDir, fileExt);
}
void get_disk_drives() {
DWORD drives_bitmask = GetLogicalDrives();
for (int i = 0; i < 26; i++) {
if ((drives_bitmask >> i) & 1) {
TCHAR drive_name = (char)(65 + i);
TCHAR drive_path[] = drive_name + "\\";
get_current_disk_drive(drive_path);
}
}
}
int _tmain(int argc, _TCHAR* argv[]) {
get_disk_drives();
return 0;
}
Here I can't make concatenation:
TCHAR drive_name = (char)(65 + i);
TCHAR drive_path[] = drive_name + "\\";
get_current_disk_drive(drive_path);
Why? Where is my mistake?
operator+ cannot be used for C-strings, string literals, or characters. The effect (for legal expressions anyway) is pointer arithmetic. For concatenation you have to either explicitly call one of the strcat functions, or use std::basic_string instead:
typedef std::basic_string<TCHAR> tstring;
tstring drive_name;
drive_name += TCHAR( 65 + i );
tstring drive_path = drive_name + _T( '\\' );
You can access a C-string from a std::basic_string by invoking its c_str() member. Since this is a C-string represented as a pointer, you would have to change the signature of get_current_disk_drive to void get_current_disk_drive(const TCHAR* dirname), or pass a const tstring&.
It's also a good idea to stop using Code::Blocks. Defaulting to MBCS character encoding in 2015 is a crime.

UTF8ToUTF16 failing

I have the following code which is just three sets of functions for converting UTF8 to UTF16 and vice-versa. It converts using 3 different techniques..
However, all of them fail:
std::ostream& operator << (std::ostream& os, const std::string &data)
{
SetConsoleOutputCP(CP_UTF8);
DWORD slen = data.size();
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), data.size(), &slen, nullptr);
return os;
}
std::wostream& operator <<(std::wostream& os, const std::wstring &data)
{
DWORD slen = data.size();
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), slen, &slen, nullptr);
return os;
}
std::wstring AUTF8ToUTF16(const std::string &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(data);
}
std::string AUTF16ToUTF8(const std::wstring &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(data);
}
std::wstring BUTF8ToUTF16(const std::string& utf8)
{
std::wstring utf16;
int len = MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, NULL, 0);
if (len > 1)
{
utf16.resize(len - 1);
wchar_t* ptr = &utf16[0];
MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, ptr, len);
}
return utf16;
}
std::string BUTF16ToUTF8(const std::wstring& utf16)
{
std::string utf8;
int len = WideCharToMultiByte(CP_UTF8, 0, utf16.c_str(), -1, NULL, 0, 0, 0);
if (len > 1)
{
utf8.resize(len - 1);
char* ptr = &utf8[0];
WideCharToMultiByte(CP_UTF8, 0, utf16.c_str(), -1, ptr, len, 0, 0);
}
return utf8;
}
std::string CUTF16ToUTF8(const std::wstring &data)
{
std::string result;
result.resize(std::wcstombs(nullptr, &data[0], data.size()));
std::wcstombs(&result[0], &data[0], data.size());
return result;
}
std::wstring CUTF8ToUTF16(const std::string &data)
{
std::wstring result;
result.resize(std::mbstowcs(nullptr, &data[0], data.size()));
std::mbstowcs(&result[0], &data[0], data.size());
return result;
}
int main()
{
std::string str = "консоли";
MessageBoxA(nullptr, str.c_str(), str.c_str(), 0); //Works Fine!
std::wstring wstr = AUTF8ToUTF16(str); //Crash!
MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Crash + Display nothing..
wstr = BUTF8ToUTF16(str);
MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Random chars..
wstr = CUTF8ToUTF16(str);
MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Question marks..
std::cin.get();
}
The only thing that works above is the MessageBoxA. I don't understand why because I'm told that Windows converts everything to UTF16 anyway so why can't I convert it myself?
Why does none of my conversions work?
Is there a reason my code does not work?
The root problem why all of your approaches fail is that they require the std::string to be UTF-8 encoded but std::string str = "консоли" is not UTF-8 encoded unless you save the .cpp file as UTF-8 and configure your compiler's default codepage to UTF-8. In most C++11 compilers, you can use the u8 prefix to force the string to use UTF-8:
std::string str = u8"консоли";
However, VS 2013 does not support that feature yet:
Support For C++11 Features
Unicode string literals
2010 No
2012 No
2013 No
Windows itself does not support UTF-8 in most API functions that take a char* as input (an exception is MultiByteToWideChar() when using CP_UTF8). When you call an A function, it calls the corresponding W function internally, converting any char* data to/from UTF-16 using Windows' default codepage (CP_ACP). So you get random results when you use non CP_ACP data with functions that are expecting it. As such, MessageBoxA() will work correctly only if your .cpp file and compiler are using the same codepage as CP_ACP so the unprefixed char* data matches what MessageBoxA() is expecting.
I don't know why AUTF8ToUTF16() is crashing, probably a bug in your compiler's STL implementation when processing bad data.
BUTF8ToUTF16() is not handling this case in the documentation: "If the input byte/char sequences are invalid, returns U+FFFD for UTF encodings." Also, your implementation is not optimal. Use length() instead of -1 on inputs to avoid dealing with null terminator issues.
CUTF8ToUTF16() is not doing any error handling or validations. However converting non-valid input to question marks or U+FFFD is very common in most libraries.

How do you convert a 'System::String ^' to 'TCHAR'?

i asked a question here involving C++ and C# communicating. The problem got solved but led to a new problem.
this returns a String (C#)
return Marshal.PtrToStringAnsi(decryptsn(InpData));
this expects a TCHAR* (C++)
lpAlpha2[0] = Company::Pins::Bank::Decryption::Decrypt::Decryption("123456");
i've googled how to solve this problem, but i am not sure why the String has a carrot(^) on it. Would it be best to change the return from String to something else that C++ would accept? or would i need to do a convert before assigning the value?
String has a ^ because that's the marker for a managed reference. Basically, it's used the same way as * in unmanaged land, except it can only point to an object type, not to other pointer types, or to void.
TCHAR is #defined (or perhaps typedefed, I can't remember) to either char or wchar_t, based on the _UNICODE preprocessor definition. Therefore, I would use that and write the code twice.
Either inline:
TCHAR* str;
String^ managedString
#ifdef _UNICODE
str = (TCHAR*) Marshal::StringToHGlobalUni(managedString).ToPointer();
#else
str = (TCHAR*) Marshal::StringToHGlobalAnsi(managedString).ToPointer();
#endif
// use str.
Marshal::FreeHGlobal(IntPtr(str));
or as a pair of conversion methods, both of which assume that the output buffer has already been allocated and is large enough. Method overloading should make it pick the correct one, based on what TCHAR is defined as.
void ConvertManagedString(String^ managedString, char* outString)
{
char* str;
str = (char*) Marshal::StringToHGlobalAnsi(managedString).ToPointer();
strcpy(outString, str);
Marshal::FreeHGlobal(IntPtr(str));
}
void ConvertManagedString(String^ managedString, wchar_t* outString)
{
wchar_t* str;
str = (wchar_t*) Marshal::StringToHGlobalUni(managedString).ToPointer();
wcscpy(outString, str);
Marshal::FreeHGlobal(IntPtr(str));
}
The syntax String^ is C++/CLI talk for "(garbage collected) reference to a System.String".
You have a couple of options for the conversion of a String into a C string, which is another way to express the TCHAR*. My preferred way in C++ would be to store the converted string into a C++ string type, either std::wstring or std::string, depending on you building the project as a Unicode or MBCS project.
In either case you can use something like this:
std::wstring tmp = msclr::interop::marshal_as<std::wstring>( /* Your .NET String */ );
or
std::string tmp = msclr::interop::marshal_as<std::string>(...);
Once you've converted the string into the correct wide or narrow string format, you can then access its C string representation using the c_str() function, like so:
callCFunction(tmp.c_str());
Assuming that callCFunction expects you to pass it a C-style char* or wchar_t* (which TCHAR* will "degrade" to depending on your compilation settings.
That is a really rambling way to ask the question, but if you mean how to convert a String ^ to a char *, then you use the same marshaller you used before, only backwards:
char* unmanagedstring = (char *) Marshal::StringToHGlobalAnsi(managedstring).ToPointer();
Edit: don't forget to release the memory allocated when you're done using Marshal::FreeHGlobal.

Is there a format specifier that always means char string with _tprintf?

When you build an app on Windows using TCHAR support, %s in _tprintf() means char * string for Ansi builds and wchar_t * for Unicode builds while %S means the reverse.
But are there any format specifiers that always mean char * string no matter if it's an Ansi or Unicode build? Since even on Windows UTF-16 is not really used for files or networking it turns out to still be fairly often that you'll want to deal with byte-based strings regardless of the native character type you compile your app as.
The h modifier forces both %s and %S to char*, and the l modifier forces both to wchar_t*, ie: %hs, %hS, %ls, and %lS.
This might also solve your problem:
_TCHAR *message;
_tprintf(_T("\n>>>>>> %d") TEXT(" message is:%s\n"),4,message);
You can easily write something like this:
#ifdef _UNICODE
#define PF_ASCIISTR "%S"L
#define PF_UNICODESTR "%s"L
#else
#define PF_ASCIISTR "%s"
#define PF_UNICODESTR "%S"
#endif
and then you use the PF_ASCIISTR or the PF_UNICODESTR macros in your format string, exploiting the C automatic string literals concatenation:
_tprintf(_T("There are %d ") PF_ASCIISTR _T(" over the table"), 10, "pens");
I found, that '_vsntprintf_s' uses '%s' for type TCHAR and works for both, GCC and MSVC.
So you could wrap it like:
int myprintf(const TCHAR* lpszFormat, va_list argptr) {
int len = _vsctprintf(lpszFormat, argptr); // -1:err
if (len<=0) {return len;}
auto* pT = new TCHAR[2 + size_t(len)];
_vsntprintf_s(pT, (2+len)*sizeof(TCHAR), 1+len, lpszFormat, argptr);
int rv = printf("%ls", pT);
delete[] pT;
return rv;
}
int myprintf(const TCHAR* lpszFormat, ...) {
va_list argptr;
va_start(argptr, lpszFormat);
int rv = myprintf(lpszFormat, argptr);
va_end(argptr);
return rv;
}
int main(int, char**) { return myprintf(_T("%s"), _T("Test")); }

Resources