UTF8ToUTF16 failing - c++11

I have the following code which is just three sets of functions for converting UTF8 to UTF16 and vice-versa. It converts using 3 different techniques..
However, all of them fail:
std::ostream& operator << (std::ostream& os, const std::string &data)
{
SetConsoleOutputCP(CP_UTF8);
DWORD slen = data.size();
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), data.size(), &slen, nullptr);
return os;
}
std::wostream& operator <<(std::wostream& os, const std::wstring &data)
{
DWORD slen = data.size();
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), slen, &slen, nullptr);
return os;
}
std::wstring AUTF8ToUTF16(const std::string &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(data);
}
std::string AUTF16ToUTF8(const std::wstring &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(data);
}
std::wstring BUTF8ToUTF16(const std::string& utf8)
{
std::wstring utf16;
int len = MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, NULL, 0);
if (len > 1)
{
utf16.resize(len - 1);
wchar_t* ptr = &utf16[0];
MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, ptr, len);
}
return utf16;
}
std::string BUTF16ToUTF8(const std::wstring& utf16)
{
std::string utf8;
int len = WideCharToMultiByte(CP_UTF8, 0, utf16.c_str(), -1, NULL, 0, 0, 0);
if (len > 1)
{
utf8.resize(len - 1);
char* ptr = &utf8[0];
WideCharToMultiByte(CP_UTF8, 0, utf16.c_str(), -1, ptr, len, 0, 0);
}
return utf8;
}
std::string CUTF16ToUTF8(const std::wstring &data)
{
std::string result;
result.resize(std::wcstombs(nullptr, &data[0], data.size()));
std::wcstombs(&result[0], &data[0], data.size());
return result;
}
std::wstring CUTF8ToUTF16(const std::string &data)
{
std::wstring result;
result.resize(std::mbstowcs(nullptr, &data[0], data.size()));
std::mbstowcs(&result[0], &data[0], data.size());
return result;
}
int main()
{
std::string str = "консоли";
MessageBoxA(nullptr, str.c_str(), str.c_str(), 0); //Works Fine!
std::wstring wstr = AUTF8ToUTF16(str); //Crash!
MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Crash + Display nothing..
wstr = BUTF8ToUTF16(str);
MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Random chars..
wstr = CUTF8ToUTF16(str);
MessageBoxW(nullptr, wstr.c_str(), wstr.c_str(), 0); //Fail - Question marks..
std::cin.get();
}
The only thing that works above is the MessageBoxA. I don't understand why because I'm told that Windows converts everything to UTF16 anyway so why can't I convert it myself?
Why does none of my conversions work?
Is there a reason my code does not work?

The root problem why all of your approaches fail is that they require the std::string to be UTF-8 encoded but std::string str = "консоли" is not UTF-8 encoded unless you save the .cpp file as UTF-8 and configure your compiler's default codepage to UTF-8. In most C++11 compilers, you can use the u8 prefix to force the string to use UTF-8:
std::string str = u8"консоли";
However, VS 2013 does not support that feature yet:
Support For C++11 Features
Unicode string literals
2010 No
2012 No
2013 No
Windows itself does not support UTF-8 in most API functions that take a char* as input (an exception is MultiByteToWideChar() when using CP_UTF8). When you call an A function, it calls the corresponding W function internally, converting any char* data to/from UTF-16 using Windows' default codepage (CP_ACP). So you get random results when you use non CP_ACP data with functions that are expecting it. As such, MessageBoxA() will work correctly only if your .cpp file and compiler are using the same codepage as CP_ACP so the unprefixed char* data matches what MessageBoxA() is expecting.
I don't know why AUTF8ToUTF16() is crashing, probably a bug in your compiler's STL implementation when processing bad data.
BUTF8ToUTF16() is not handling this case in the documentation: "If the input byte/char sequences are invalid, returns U+FFFD for UTF encodings." Also, your implementation is not optimal. Use length() instead of -1 on inputs to avoid dealing with null terminator issues.
CUTF8ToUTF16() is not doing any error handling or validations. However converting non-valid input to question marks or U+FFFD is very common in most libraries.

Related

Wide-character string converted from multibyte string using MultiByteToWideChar print and write to file error

I can print the wide-character string correctly, writing to text file is also OK, using the following code:
const wchar_t wcs[] = L"Hello大家好";
_setmode(_fileno(stdout), _O_WTEXT);
wprintf(L"%ls", wcs);
FILE *pf;
const wchar_t _FileName[] = L"wcs.txt";
pf = _wfopen(_FileName, L"a, ccs=UTF-8");
fwprintf(pf, L"%ls", wcs);
fflush(pf);
fclose(pf);
But I need to convert the multibyte string to wide-character string, using MultiByteToWideChar, and than print or write to text file, error occurred, the result displayed or the text file is all messy code. My code is below.
const char cs[] = "Hello大家好";
wchar_t lpWideCharStr[64];
MultiByteToWideChar(CP_ACP, 0, cs, 0, lpWideCharStr, 64);
wprintf(L"%ls", lpWideCharStr);
FILE *pf;
pf = _wfopen(_FileName, L"a, ccs=UTF-8");
fwprintf(pf, L"%ls", lpWideCharStr);
fflush(pf);
fclose(pf);
Can anybody help? Thanks in advance.
This works for me.
#include <Windows.h>
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
bool ConvertToWideFromUTF8orACP(char* strData, wchar_t* w_strData, int w_strDataLen)
{
int nSize = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, strData, -1, w_strData, w_strDataLen);
if (nSize == 0)
{
nSize = MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, strData, -1, w_strData, w_strDataLen);
}
return nSize - 1;
}
int main()
{
char cs[] = "Hello大家好";
printf("%s\n", cs);
wchar_t* lpWideCharStr = new wchar_t[sizeof(cs)];
memset(lpWideCharStr, 0, sizeof(cs));
int nSize = ConvertToWideFromUTF8orACP(cs, lpWideCharStr,sizeof(cs));
_setmode(_fileno(stdout), _O_U16TEXT);
if (nSize > 0)
{
wprintf(L"%s", lpWideCharStr);
}
FILE* pf;
pf = _wfopen(_FileName, L"a, ccs=UTF-8");
fwprintf(pf, L"%ls", lpWideCharStr);
delete[] lpWideCharStr;
fflush(pf);
fclose(pf);
}
Thanks #Remy Lebeau for help:)

Read binary data from QProcess in Windows

I have some .exe file (say some.exe) that writes to the standard output binary data. I have no sources of this program. I need to run some.exe from my C++/Qt application and read standard output of the process I created. When I'm trying to do this with QProcess::readAll someone replaces byte \n (0x0d) to \r\n (0x0a 0x0d).
Here is a code:
QProcess some;
some.start( "some.exe", QStringList() << "-I" << "inp.txt" );
// some.setTextModeEnabled( false ); // no effect at all
some.waitForFinished();
QByteArray output = some.readAll();
I tried in cmd.exe to redirect output to file like this:
some.exe -I inp.txt > out.bin
and viewed out.bin with hexedit there was 0a 0d in the place where should be 0d.
Edit:
Here is a simple program to emulate some.exe behaviour:
#include <stdio.h>
int main() {
char buf[] = { 0x00, 0x11, 0x0a, 0x33 };
fwrite( buf, sizeof( buf[ 0 ] ), sizeof( buf ), stdout );
}
run:
a.exe > out.bin
//out.bin
00 11 0d 0a 33
Note, that I can't modify some.exe that's why I shouldn't modify my example like _setmode( _fileno( stdout, BINARY ) )
The question is: how can I say to QProcess or to Windows or to console do not change CR with LF CR?
OS: Windows 7
Qt: 5.6.2
how can I say to QProcess or to Windows or to console do not change CR with LF CR?
They don't change anything. some.exe is broken. That's all. It outputs the wrong thing. Whoever made it output brinary data in text mode has messed up badly.
There's a way to recover, though. You have to implement a decoder that will fix the broken output of some.exe. You know that every 0a has to be preceded by 0d. So you have to parse the output, and if you find a 0a, and there's 0d before it, remove the 0d, and continue. Optionally, you can abort if a 0a is not preceded by 0d - some.exe should not produce such output since it's broken.
The appendBinFix function takes the corrupted data and appends the fixed version to a buffer.
// https://github.com/KubaO/stackoverflown/tree/master/questions/process-fix-binary-crlf-51519654
#include <QtCore>
#include <algorithm>
bool appendBinFix(QByteArray &buf, const char *src, int size) {
bool okData = true;
if (!size) return okData;
constexpr char CR = '\x0d';
constexpr char LF = '\x0a';
bool hasCR = buf.endsWith(CR);
buf.resize(buf.size() + size);
char *dst = buf.end() - size;
const char *lastSrc = src;
for (const char *const end = src + size; src != end; src++) {
char const c = *src;
if (c == LF) {
if (hasCR) {
std::copy(lastSrc, src, dst);
dst += (src - lastSrc);
dst[-1] = LF;
lastSrc = src + 1;
} else
okData = false;
}
hasCR = (c == CR);
}
dst = std::copy(lastSrc, src, dst);
buf.resize(dst - buf.constData());
return okData;
}
bool appendBinFix(QByteArray &buf, const QByteArray &src) {
return appendBinFix(buf, src.data(), src.size());
}
The following test harness ensures that it does the right thing, including emulating the output of some.exe (itself):
#include <QtTest>
#include <cstdio>
#ifdef Q_OS_WIN
#include <fcntl.h>
#include <io.h>
#endif
const auto dataFixed = QByteArrayLiteral("\x00\x11\x0d\x0a\x33");
const auto data = QByteArrayLiteral("\x00\x11\x0d\x0d\x0a\x33");
int writeOutput() {
#ifdef Q_OS_WIN
_setmode(_fileno(stdout), _O_BINARY);
#endif
auto size = fwrite(data.data(), 1, data.size(), stdout);
qDebug() << size << data.size();
return (size == data.size()) ? 0 : 1;
}
class AppendTest : public QObject {
Q_OBJECT
struct Result {
QByteArray d;
bool ok;
bool operator==(const Result &o) const { return ok == o.ok && d == o.d; }
};
static Result getFixed(const QByteArray &src, int split) {
Result f;
f.ok = appendBinFix(f.d, src.data(), split);
f.ok = appendBinFix(f.d, src.data() + split, src.size() - split) && f.ok;
return f;
}
Q_SLOT void worksWithLFCR() {
const auto lf_cr = QByteArrayLiteral("\x00\x11\x0a\x0d\x33");
for (int i = 0; i < lf_cr.size(); ++i)
QCOMPARE(getFixed(lf_cr, i), (Result{lf_cr, false}));
}
Q_SLOT void worksWithCRLF() {
const auto cr_lf = QByteArrayLiteral("\x00\x11\x0d\x0a\x33");
const auto cr_lf_fixed = QByteArrayLiteral("\x00\x11\x0a\x33");
for (int i = 0; i < cr_lf.size(); ++i)
QCOMPARE(getFixed(cr_lf, i), (Result{cr_lf_fixed, true}));
}
Q_SLOT void worksWithCRCRLF() {
for (int i = 0; i < data.size(); ++i) QCOMPARE(getFixed(data, i).d, dataFixed);
}
Q_SLOT void worksWithQProcess() {
QProcess proc;
proc.start(QCoreApplication::applicationFilePath(), {"output"},
QIODevice::ReadOnly);
proc.waitForFinished(5000);
QCOMPARE(proc.exitCode(), 0);
QCOMPARE(proc.exitStatus(), QProcess::NormalExit);
QByteArray out = proc.readAllStandardOutput();
QByteArray fixed;
appendBinFix(fixed, out);
QCOMPARE(out, data);
QCOMPARE(fixed, dataFixed);
}
};
int main(int argc, char *argv[]) {
QCoreApplication app(argc, argv);
if (app.arguments().size() > 1) return writeOutput();
AppendTest test;
QTEST_SET_MAIN_SOURCE_PATH
return QTest::qExec(&test, argc, argv);
}
#include "main.moc"
Unfortunately it has nothing to do with QProcess or Windows or console. It's all about CRT. Functions like printf or fwrite are taking into account _O_TEXT flag to add an additional 0x0D (true only for Windows). So the only solution is to modify stdout's fields of your some.exe with WriteProcessMemory or call the _setmode inside an address space of your some.exe with DLL Injection technique or patch the lib. But it's a tricky job.

How to widen a standard string while maintaining the characters?

I have a function to take a std::string and change it into a wchar_t*. My current widen function looks like this
wchar_t* widen(const std::string& str){
wchar_t * dest = new wchar_t[str.size()+1];
char * temp = new char[str.size()];
for(int i=0;i<str.size();i++)
dest[i] = str[i];
dest[str.size()] = '\0';
return dest;
}
This works just fine for standard characters, however (and I cannot believe this hasn't been an issue before now) when I have characters like á, é, í, ó, ú, ñ, or ü it breaks and the results are vastly different.
Ex: my str comes in as "Database Function: áFákéFúnctíóñü"
But dest ends up as: "Database Function: £F£k←Fnct■￳￱"
How can I change from a std::string to a wchar_t* while maintaining international characters?
Short answer: You can't.
Longer answer: std::string contains char elements which typically contain ASCII in the first 127 values, while everything else ("international characters") is in the values above (or the negative ones, if char is signed). In order to determine the according representation in a wchar_t string, you first need to know the encoding in the source string (could be ISO-8859-15 or even UTF-8) and the one in the target string (often UTF-16, UCS2 or UTF-32) and then transcode accordingly.
It depends if the source is using old ANSI code page or UTF8. For ANSI code page, you have to know the locale, and use mbstowcs. For UTF8 you can make a conversion to UTF16 using codecvt_utf8_utf16. However codecvt_utf8_utf16 is deprecated and it has no replacement as of yet. In Windows you can use WinAPI function to make the conversions more reliably.
#include <iostream>
#include <string>
#include <codecvt>
std::wstring widen(const std::string& src)
{
int len = src.size();
std::wstring dst(len + 1, 0);
mbstowcs(&dst[0], src.c_str(), len);
return dst;
}
int main()
{
//ANSI code page?
std::string src = "áFákéFúnctíóñü";
setlocale(LC_ALL, "en"); //English assumed
std::wstring dst = widen(src);
std::wcout << dst << "\n";
//UTF8?
src = u8"áFákéFúnctíóñü";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
dst = convert.from_bytes(src);
std::wcout << dst << "\n";
return 0;
}
For a Windows solution, here's some utility functions I use based on the wisdom of http://utf8everywhere.org/
/// Convert a windows UTF-16 string to a UTF-8 string
///
/// #param s[in] the UTF-16 string
/// #return std::string UTF-8 string
inline std::string Narrow(std::wstring_view wstr) {
if (wstr.empty()) return {};
int len = ::WideCharToMultiByte(CP_UTF8, 0, &wstr[0], wstr.size(), nullptr, 0,
nullptr, nullptr);
std::string out(len, 0);
::WideCharToMultiByte(CP_UTF8, 0, &wstr[0], wstr.size(), &out[0], len,
nullptr, nullptr);
return out;
}
/// Convert a UTF-8 string to a windows UTF-16 string
///
/// #param s[in] the UTF-8 string
/// #param n[in] the UTF-8 string's length, or -1 if string is null-terminated
/// #return std::wstring UTF-16 string
inline std::wstring Widen(std::string_view str) {
if (str.empty()) return {};
int len = ::MultiByteToWideChar(CP_UTF8, 0, &str[0], str.size(), NULL, 0);
std::wstring out(len, 0);
::MultiByteToWideChar(CP_UTF8, 0, &str[0], str.size(), &out[0], len);
return out;
}
Usually used inline in windows API calls like:
std::string message = "Hello world!";
::MessageBoxW(NULL, Widen(message).c_str(), L"Title", MB_OK);
A cross-platform and possibly faster solution could be found by exploring Boost.Nowide's conversion functions: https://github.com/boostorg/nowide/blob/develop/include/boost/nowide/utf/convert.hpp

Concatenation tchar variables

I work with WinAPI and I have a function get_disk_drives() for retrieves available disk drives and a helper function get_current_disk_drive() for retrieves the full path and file name of the specified file.
void get_current_disk_drive(TCHAR dirname[]) {
TCHAR *fileExt = NULL;
TCHAR szDir[MAX_PATH];
GetFullPathName(dirname, MAX_PATH, szDir, &fileExt);
_tprintf(_T("Full path: %s \nFilename: %s\n"), szDir, fileExt);
}
void get_disk_drives() {
DWORD drives_bitmask = GetLogicalDrives();
for (int i = 0; i < 26; i++) {
if ((drives_bitmask >> i) & 1) {
TCHAR drive_name = (char)(65 + i);
TCHAR drive_path[] = drive_name + "\\";
get_current_disk_drive(drive_path);
}
}
}
int _tmain(int argc, _TCHAR* argv[]) {
get_disk_drives();
return 0;
}
Here I can't make concatenation:
TCHAR drive_name = (char)(65 + i);
TCHAR drive_path[] = drive_name + "\\";
get_current_disk_drive(drive_path);
Why? Where is my mistake?
operator+ cannot be used for C-strings, string literals, or characters. The effect (for legal expressions anyway) is pointer arithmetic. For concatenation you have to either explicitly call one of the strcat functions, or use std::basic_string instead:
typedef std::basic_string<TCHAR> tstring;
tstring drive_name;
drive_name += TCHAR( 65 + i );
tstring drive_path = drive_name + _T( '\\' );
You can access a C-string from a std::basic_string by invoking its c_str() member. Since this is a C-string represented as a pointer, you would have to change the signature of get_current_disk_drive to void get_current_disk_drive(const TCHAR* dirname), or pass a const tstring&.
It's also a good idea to stop using Code::Blocks. Defaulting to MBCS character encoding in 2015 is a crime.

Problem with writing out large temp files

I have a set of large image files that I'm using as temporary swap files on Windows in Visual Studio 2010. I'm writing and reading the files out as necessary.
Problem is, even though each of the files are the same size, I'm getting different file sizes.
So, I can do:
template <typename T>
std::string PlaceFileOnDisk(T* inImage, const int& inSize)
TCHAR lpTempPathBuffer[MAX_PATH];
TCHAR szTempFileName[MAX_PATH];
DWORD dwRetVal = GetTempPath(MAX_PATH, lpTempPathBuffer);
UINT uRetVal = GetTempFileName(lpTempPathBuffer, TEXT("IMAGE"), 0, szTempFileName);
FILE* fp;
fopen_s(&fp, szTempFileName, "w+");
fwrite(inImage, sizeof(T), inSize, fp);
fclose(fp);
std::string theRealTempFileName(szTempFileName);
return theRealTempFileName;
}
but that results in files between 53 and 65 mb in size (the image is 4713 * 5908 * sizeof (unsigned short).
I figured that maybe that 'fwrite' might not be stable for large files, so I broke things up into:
template <typename T>
std::string PlaceFileOnDisk(T* inImage, const int& inYSize, const int& inXSize)
TCHAR lpTempPathBuffer[MAX_PATH];
TCHAR szTempFileName[MAX_PATH];
DWORD dwRetVal = GetTempPath(MAX_PATH, lpTempPathBuffer);
UINT uRetVal = GetTempFileName(lpTempPathBuffer, TEXT("IMAGE"), 0, szTempFileName);
int y;
FILE* fp;
for (y = 0; y < inYSize; y++){
fopen_s(&fp, szTempFileName, "a");
fwrite(&(inImage[y*inXSize]), sizeof(T), inXSize, fp);
fclose(fp);
}
std::string theRealTempFileName(szTempFileName);
return theRealTempFileName;
}
Same thing: the files that are saved to disk are variable sized, not the expected size.
What's going on? Why are they not the same?
The read in function:
template <typename T>
T* RecoverFileFromDisk(const std::string& inFileName, const int& inSize){
T* theBuffer = NULL;
FILE* fp;
try {
theBuffer = new T[inYSize*inXSize];
fopen_s(&fp, inFileName.c_str(), "r");
fread(theBuffer, sizeof(T), inSize, fp);
fclose(fp);
}
catch(...){
if (theBuffer != NULL){
delete [] theBuffer;
theBuffer = NULL;
}
}
return theBuffer;
}
This function may be suffering from similar problems, but I'm not getting that far, because I can't get past the writing function.
I did try to use the read/write information on this page:
http://msdn.microsoft.com/en-us/library/aa363875%28v=vs.85%29.aspx
But the suggestions there just didn't work at all, so I went with the file functions with which I'm more familiar. That's where I got the temp file naming conventions, though.
Are you able to open the image after it's written? It sounds like you're having trouble writing it, too?
You're just asking about why the file sizes are different for the same size picture? What about how the sizes of the initial files compare to each other? It may have something to do with the how the initial image files are compressed.
I'm not sure what you're doing with the files, but have you considered a more basic "copy" function?

Resources