WideCharToMultiByte when is lpUsedDefaultChar true? - windows

I am trying to understand WideCharToMultiByte and I was wondering when lpUsedDefaultChar would be set to be TRUE.
Here is a sample: What should be lpszW inorder for the flag to be set to be true?
lpszW = L”__WHAT SHOULD_BE_HERE__”;
int c = ??;
BOOL fUsedDefaultChar = false;
DWORD dwSize = WideCharToMultiByte(CP_ACP, 0, lpszW, c, myOutStr ,myOutLen, NULL, &fUsedDefaultChar);
http://msdn.microsoft.com/en-us/library/dd374130(VS.85).aspx
Any books/tutorials for understanding Unicode/UTF stuff would be great.
Thanks!

Anything that is not present in the current codepage will map to ? (by default) and UsedDefaultChar will be != FALSE.
Windows-1252 is probably the most common codepage and most of those characters map to the same value in unicode.
Take Ω (ohm) for example, it is probably not present in whatever your current codepage is and therefore will not map to a valid narrow character:
BOOL fUsedDefaultChar=FALSE;
DWORD dwSize;
char myOutStr[MAX_PATH];
WCHAR lpszW[10]=L"Hello";
*lpszW=0x2126; //ohm sign, you could also use the \u2126 syntax if your compiler supports it.
dwSize = WideCharToMultiByte(CP_ACP, 0, lpszW, -1, myOutStr ,MAX_PATH, NULL, &fUsedDefaultChar);
printf("%d %s\n",fUsedDefaultChar,myOutStr); //This prints "1 ?ello" on my system

The MSDN documentation is very clear about when lpUsedDefaultChar is set to TRUE:
lpDefaultChar [in] Optional. Pointer
to the character to use if a character
cannot be represented in the specified
code page. The application sets this
parameter to NULL if the function is
to use a system default value. To
obtain the system default character,
the application can call the GetCPInfo
or GetCPInfoEx function.
lpUsedDefaultChar [out] Optional.
Pointer to a flag that indicates if
the function has used a default
character in the conversion. The flag
is set to TRUE if one or more
characters in the source string cannot
be represented in the specified code
page. Otherwise, the flag is set to
FALSE. This parameter can be set to
NULL.
That does not leave much room for misunderstanding, in my opinion.

Related

Is MAX_PATH enough to hold the path from GetSystemDirectory()?

From my understanding, that path will be a single-letter (the driver), followed by "\WINDOWS\SYSTEM32" so that MAX_PATH is more than enough to hold that path filled by GetSystemDirectory(). So it's safe to do:
TCHAR dir[MAX_PATH] = {0};
if(GetSystemDirectory(dir, sizeof(dir) / sizeof(*dir)) == 0) {
// check for GetLastError()
}
Or am I missing something?
The documentation provided for the recommended alternative to GetSystemDirectory (which is ShGetFolderPath) says the following about its pszPath parameter:
A pointer to a null-terminated string of length MAX_PATH which will
receive the path. If an error occurs or S_FALSE is returned, this
string will be empty. The returned path does not include a trailing
backslash. For example, "C:\Users" is returned rather than
"C:\Users\".
So, yes, MAX_PATH will be a big enough buffer size.

How to detect snprintf failure?

I am using snprintf to format string using user-defined format (also given as string). The code looks like this:
void DataPoint::valueReceived( QVariant value ) {
// Get the formating QVariant, which is only considered valid if it's string
QVariant format = this->property("format");
if( format.isValid() && format.type()==QMetaType::QString && !format.isNull() ) {
// Convert QString to std string
const std::string formatStr = format.toString().toStdString();
LOGMTRTTIINFO(pointName<<"="<<value.toString().toUtf8().constData()<<"=>"<<formatStr<<"["<<formatStr.length()<<'\n');
// The attempt to catch exceptions caused by invalid formating string
try {
if( value.type() == QMetaType::QString ) {
// Treat value as string (values are allways ASCII)
const std::string array = value.toString().toStdString();
const char* data = (char*)array.c_str();
// Assume no more than 10 characters are added during formating.
char* result = (char*)calloc(array.length()+10, sizeof(char));
snprintf(result, array.length()+10, formatStr.c_str(), data);
value = result;
}
// If not string, then it's a number.
else {
double data = value.toDouble();
char* result = (char*)calloc(30, sizeof(char));
// Even 15 characters is already longer than largest number you can make any sense of
snprintf(result, 30, formatStr.c_str(), data);
LOGMTRTTIINFO(pointName<<"="<<data<<"=>"<<formatStr<<"["<<formatStr.length()<<"]=>"<<result<<'\n');
value = result;
}
} catch(...) {
LOGMTRTTIERR("Format error in "<<pointName<<'\n');
}
}
ui->value->setText(value.toString());
}
As you can see I assumed there will be some exception. But there's not, invalid formatting string results in gibberish. This is what I get if I try to format double using %s:
So is there a way to detect that invalid formatting option was selected, such as formatting number as string or vice-versa? And what if totally invalid formatting string is given?
You ask if it's possible to detect format/argument mismatch at run-time, right? Then the short and only answer is no.
To expand on that "no" it's because Variable-argument functions (functions using the ellipsis ...) have no kind of type-safety. The compiler will convert some types of arguments to others (e.g. char or short will be converted to int, float will be converted to double), and if you use a literal string for the format some compilers will be able to parse the string and check the arguments you pass.
However since you pass a variable string, that can change at run-time, the compiler have no possibility for any kind of compile-time checking, and the function must trust that the format string passed is using the correct formatting for the arguments passed. If it's not then you have undefined behavior.
It should be noted that snprintf might not actually fail when being passed mismatching format specifier and argument value.
For example if using the %d format to print an int value, but then passing a double value, the snprintf would happily extract sizeof(int) bytes from the double value, and interpret it as an int value. The value printed will be quite unexpected, but there won't be a "failure" as such. Only undefined behavior (as mentioned above).
Thus it's not really possible to detect such errors or problems at all. At least not through the code. This is something that needs proper testing and code-review to catch.
What happens when snprintf fails? When snprintf fails, POSIX requires that errno is set:
If an output error was encountered, these functions shall return a negative value and set errno to indicate the error.
Also you can find some relevant information regarding how to handle snprintf failures Here.

Powerbuilder: ImportFile of UTF-8 (Converting UTF-8 to ANSI)

My Powerbuilder version is 6.5, cannot use a higher version as this is what I am supporting.
My problem is, when I am doing dw_1.ImportFile(file) the first row and first column has a funny string like this:

Which I dont understand until I tried opening the file and saving it to a new text file and trying to import that new file.which worked flawlessly without the funny string.
My conclusion is that this is happening because the file is UTF-8 (as shown in NOTEPAD++) and the new file is Ansi. The file I am trying to import is automatically given by a 3rd party and my users dont want the extra job of doing this.
How do I force convert this files to ANSI in powerbuilder. If there is none, I might have to do a command prompt conversion, any ideas?
The weird  characters are the (optional) utf-8 BOM that tells editors that the file is utf-8 encoded (as it can be difficult to know it unless we encounter an escaped character above code 127). You cannot just rid it off because if your file contains any character above 127 (accents or any special char), you will still have garbage in your displayed data (for example: é -> é, € -> €, ...) where special characters will become from 2 to 4 garbage chars.
I recently needed to convert some utf-8 encoded string to "ansi" windows 1252 encoding. With version of PB10+, a reencoding between utf-8 and ansi is as simple as
b = blob(s, encodingutf8!)
s2 = string(b, encodingansi!)
But string() and blob() do not support encoding specification before the release 10 of PB.
What you can do is to read the file yourself, skip the BOM, ask Windows to convert the string encoding via MultiByteToWideChar() + WideCharToMultiByte() and load the converted string in the DW with ImportString().
Proof of concept to get the file contents (with this reading method, the file cannot be bigger than 2GB):
string ls_path, ls_file, ls_chunk, ls_ansi
ls_path = sle_path.text
int li_file
if not fileexists(ls_path) then return
li_file = FileOpen(ls_path, streammode!)
if li_file > 0 then
FileSeek(li_file, 3, FromBeginning!) //skip the utf-8 BOM
//read the file by blocks, FileRead is limited to 32kB
do while FileRead(li_file, ls_chunk) > 0
ls_file += ls_chunk //concatenate in loop works but is not so performant
loop
FileClose(li_file)
ls_ansi = utf8_to_ansi(ls_file)
dw_tab.importstring( text!, ls_ansi)
end if
utf8_to_ansi() is a globlal function, it was written for PB9, but it should work the same with PB6.5:
global type utf8_to_ansi from function_object
end type
type prototypes
function ulong MultiByteToWideChar(ulong CodePage, ulong dwflags, ref string lpmultibytestr, ulong cchmultibyte, ref blob lpwidecharstr, ulong cchwidechar) library "kernel32.dll"
function ulong WideCharToMultiByte(ulong CodePage, ulong dwFlags, ref blob lpWideCharStr, ulong cchWideChar, ref string lpMultiByteStr, ulong cbMultiByte, ref string lpUsedDefaultChar, ref boolean lpUsedDefaultChar) library "kernel32.dll"
end prototypes
forward prototypes
global function string utf8_to_ansi (string as_utf8)
end prototypes
global function string utf8_to_ansi (string as_utf8);
//convert utf-8 -> ansi
//use a wide-char native string as pivot
constant ulong CP_ACP = 0
constant ulong CP_UTF8 = 65001
string ls_wide, ls_ansi, ls_null
blob lbl_wide
ulong ul_len
boolean lb_flag
setnull(ls_null)
lb_flag = false
//get utf-8 string length converted as wide-char
setnull(lbl_wide)
ul_len = multibytetowidechar(CP_UTF8, 0, as_utf8, -1, lbl_wide, 0)
//allocate buffer to let windows write into
ls_wide = space(ul_len * 2)
lbl_wide = blob(ls_wide)
//convert utf-8 -> wide char
ul_len = multibytetowidechar(CP_UTF8, 0, as_utf8, -1, lbl_wide, ul_len)
//get the final ansi string length
setnull(ls_ansi)
ul_len = widechartomultibyte(CP_ACP, 0, lbl_wide, -1, ls_ansi, 0, ls_null, lb_flag)
//allocate buffer to let windows write into
ls_ansi = space(ul_len)
//convert wide-char -> ansi
ul_len = widechartomultibyte(CP_ACP, 0, lbl_wide, -1, ls_ansi, ul_len, ls_null, lb_flag)
return ls_ansi
end function

How can I calculate the complete buffer size for GetModuleFileName?

The GetModuleFileName() takes a buffer and size of buffer as input; however its return value can only tell us how many characters is has copied, and if the size is not enough (ERROR_INSUFFICIENT_BUFFER).
How do I determine the real required buffer size to hold entire file name for GetModuleFileName()?
Most people use MAX_PATH but I remember the path can exceed that (260 by default definition)...
(The trick of using zero as size of buffer does not work for this API - I've already tried before)
The usual recipe is to call it setting the size to zero and it is guaranteed to fail and provide the size needed to allocate sufficient buffer. Allocate a buffer (don't forget room for nul-termination) and call it a second time.
In a lot of cases MAX_PATH is sufficient because many of the file systems restrict the total length of a path name. However, it is possible to construct legal and useful file names that exceed MAX_PATH, so it is probably good advice to query for the required buffer.
Don't forget to eventually return the buffer from the allocator that provided it.
Edit: Francis points out in a comment that the usual recipe doesn't work for GetModuleFileName(). Unfortunately, Francis is absolutely right on that point, and my only excuse is that I didn't go look it up to verify before providing a "usual" solution.
I don't know what the author of that API was thinking, except that it is possible that when it was introduced, MAX_PATH really was the largest possible path, making the correct recipe easy. Simply do all file name manipulation in a buffer of length no less than MAX_PATH characters.
Oh, yeah, don't forget that path names since 1995 or so allow Unicode characters. Because Unicode takes more room, any path name can be preceeded by \\?\ to explicitly request that the MAX_PATH restriction on its byte length be dropped for that name. This complicates the question.
MSDN has this to say about path length in the article titled File Names, Paths, and Namespaces:
Maximum Path Length
In the Windows API (with some
exceptions discussed in the following
paragraphs), the maximum length for a
path is MAX_PATH, which is defined as
260 characters. A local path is
structured in the following order:
drive letter, colon, backslash,
components separated by backslashes,
and a terminating null character. For
example, the maximum path on drive D
is "D:\<some 256 character path
string><NUL>" where "<NUL>" represents
the invisible terminating null
character for the current system
codepage. (The characters < > are used
here for visual clarity and cannot be
part of a valid path string.)
Note File I/O functions in the
Windows API convert "/" to "\" as part
of converting the name to an NT-style
name, except when using the "\\?\"
prefix as detailed in the following
sections.
The Windows API has many functions
that also have Unicode versions to
permit an extended-length path for a
maximum total path length of 32,767
characters. This type of path is
composed of components separated by
backslashes, each up to the value
returned in the
lpMaximumComponentLength parameter of
the GetVolumeInformation function. To
specify an extended-length path, use
the "\\?\" prefix. For example,
"\\?\D:\<very long path>". (The
characters < > are used here for
visual clarity and cannot be part of a
valid path string.)
Note The maximum path of 32,767
characters is approximate, because the
"\\?\" prefix may be expanded to a
longer string by the system at run
time, and this expansion applies to
the total length.
The "\\?\" prefix can also be used
with paths constructed according to
the universal naming convention (UNC).
To specify such a path using UNC, use
the "\\?\UNC\" prefix. For example,
"\\?\UNC\server\share", where "server"
is the name of the machine and "share"
is the name of the shared folder.
These prefixes are not used as part of
the path itself. They indicate that
the path should be passed to the
system with minimal modification,
which means that you cannot use
forward slashes to represent path
separators, or a period to represent
the current directory. Also, you
cannot use the "\\?\" prefix with a
relative path, therefore relative
paths are limited to MAX_PATH
characters as previously stated for
paths not using the "\\?\" prefix.
When using an API to create a
directory, the specified path cannot
be so long that you cannot append an
8.3 file name (that is, the directory name cannot exceed MAX_PATH minus 12).
The shell and the file system have
different requirements. It is possible
to create a path with the Windows API
that the shell user interface might
not be able to handle.
So an easy answer would be to allocate a buffer of size MAX_PATH, retrieve the name and check for errors. If it fit, you are done. Otherwise, if it begins with "\\?\", get a buffer of size 64KB or so (the phrase "maximum path of 32,767 characters is approximate" above is a tad troubling here so I'm leaving some details for further study) and try again.
Overflowing MAX_PATH but not beginning with "\\?\" appears to be a "can't happen" case. Again, what to do then is a detail you'll have to deal with.
There may also be some confusion over what the path length limit is for a network name which begins "\\Server\Share\", not to mention names from the kernel object name space which begin with "\\.\". The above article does not say, and I'm not certain about whether this API could return such a path.
Implement some reasonable strategy for growing the buffer like start with MAX_PATH, then make each successive size 1,5 times (or 2 times for less iterations) bigger then the previous one. Iterate until the function succeeds.
Using
extern char* _pgmptr
might work.
From the documentation of GetModuleFileName:
The global variable _pgmptr is automatically initialized to the full path of the executable file, and can be used to retrieve the full path name of an executable file.
But if I read about _pgmptr:
When a program is not run from the command line, _pgmptr might be initialized to the program name (the file's base name without the file name extension) or to a file name, relative path, or full path.
Anyone who knows how _pgmptr is initialized? If SO had support for follow-up questions I would posted this question as a follow up.
While the API is proof of bad design, the solution is actually very simple. Simple, yet sad it has to be this way, for it's somewhat of a performance hog as it might require multiple memory allocations. Here is some keypoints to the solution:
You can't really rely on the return value between different Windows-versions as it can have different semantics on different Windows-versions (XP for example).
If the supplied buffer is too small to hold the string, the return value is the amount of characters including the 0-terminator.
If the supplied buffer is large enough to hold the string, the return value is the amount of characters excluding the 0-terminator.
This means that if the returned value exactly equals the buffer size, you still don't know whether it succeeded or not. There might be more data. Or not. In the end you can only be certain of success if the buffer length is actually greater than required. Sadly...
So, the solution is to start off with a small buffer. We then call GetModuleFileName passing the exact buffer length (in TCHARs) and comparing the return result with it. If the return result is less than our buffer length, it succeeded. If the return result is greater than or equal to our buffer length, we have to try again with a larger buffer. Rinse and repeat until done. When done we make a string copy (strdup/wcsdup/tcsdup) of the buffer, clean up, and return the string copy. This string will have the right allocation size rather than the likely overhead from our temporary buffer. Note that the caller is responsible for freeing the returned string (strdup/wcsdup/tcsdup mallocs memory).
See below for an implementation and usage code example. I have been using this code for over a decade now, including in enterprise document management software where there can be a lot of quite long paths. The code can ofcourse be optimized in various ways, for example by first loading the returned string into a local buffer (TCHAR buf[256]). If that buffer is too small you can then start the dynamic allocation loop. Other optimizations are possible but that's beyond the scope here.
Implementation and usage example:
/* Ensure Win32 API Unicode setting is in sync with CRT Unicode setting */
#if defined(_UNICODE) && !defined(UNICODE)
# define UNICODE
#elif defined(UNICODE) && !defined(_UNICODE)
# define _UNICODE
#endif
#include <stdio.h> /* not needed for our function, just for printf */
#include <tchar.h>
#include <windows.h>
LPCTSTR GetMainModulePath(void)
{
TCHAR* buf = NULL;
DWORD bufLen = 256;
DWORD retLen;
while (32768 >= bufLen)
{
if (!(buf = (TCHAR*)malloc(sizeof(TCHAR) * (size_t)bufLen))
{
/* Insufficient memory */
return NULL;
}
if (!(retLen = GetModuleFileName(NULL, buf, bufLen)))
{
/* GetModuleFileName failed */
free(buf);
return NULL;
}
else if (bufLen > retLen)
{
/* Success */
LPCTSTR result = _tcsdup(buf); /* Caller should free returned pointer */
free(buf);
return result;
}
free(buf);
bufLen <<= 1;
}
/* Path too long */
return NULL;
}
int main(int argc, char* argv[])
{
LPCTSTR path;
if (!(path = GetMainModulePath()))
{
/* Insufficient memory or path too long */
return 0;
}
_tprintf("%s\n", path);
free(path); /* GetMainModulePath malloced memory using _tcsdup */
return 0;
}
Having said all that, I like to point out you need to be very aware of various other caveats with GetModuleFileName(Ex). There are varying issues between 32/64-bit/WOW64. Also the output is not necessarily a full, long path, but could very well be a short-filename or be subject to path aliasing. I expect when you use such a function that the goal is to provide the caller with a useable, reliable full, long path, therefor I suggest to indeed ensure to return a useable, reliable, full, long absolute path, in such a way that it is portable between various Windows-versions and architectures (again 32/64-bit/WOW64). How to do that efficiently is beyond the scope here.
While this is one of the worst Win32 APIs in existance, I wish you alot of coding joy nonetheless.
My example is a concrete implementation of the "if at first you don't succeed, double the length of the buffer" approach. It retrieves the path of the executable that is running, using a string (actually a wstring, since I want to be able to handle Unicode) as the buffer. To determine when it has successfully retrieved the full path, it checks the value returned from GetModuleFileNameW against the value returned by wstring::length(), then uses that value to resize the final string in order to strip the extra null characters. If it fails, it returns an empty string.
inline std::wstring getPathToExecutableW()
{
static const size_t INITIAL_BUFFER_SIZE = MAX_PATH;
static const size_t MAX_ITERATIONS = 7;
std::wstring ret;
DWORD bufferSize = INITIAL_BUFFER_SIZE;
for (size_t iterations = 0; iterations < MAX_ITERATIONS; ++iterations)
{
ret.resize(bufferSize);
DWORD charsReturned = GetModuleFileNameW(NULL, &ret[0], bufferSize);
if (charsReturned < ret.length())
{
ret.resize(charsReturned);
return ret;
}
else
{
bufferSize *= 2;
}
}
return L"";
}
Here is a another solution with std::wstring:
DWORD getCurrentProcessBinaryFile(std::wstring& outPath)
{
// #see https://msdn.microsoft.com/en-us/magazine/mt238407.aspx
DWORD dwError = 0;
DWORD dwResult = 0;
DWORD dwSize = MAX_PATH;
SetLastError(0);
while (dwSize <= 32768) {
outPath.resize(dwSize);
dwResult = GetModuleFileName(0, &outPath[0], dwSize);
dwError = GetLastError();
/* if function has failed there is nothing we can do */
if (0 == dwResult) {
return dwError;
}
/* check if buffer was too small and string was truncated */
if (ERROR_INSUFFICIENT_BUFFER == dwError) {
dwSize *= 2;
dwError = 0;
continue;
}
/* finally we received the result string */
outPath.resize(dwResult);
return 0;
}
return ERROR_BUFFER_OVERFLOW;
}
Windows cannot handle properly paths longer than 260 characters, so just use MAX_PATH.
You cannot run a program having path longer than MAX_PATH.
My approach to this is to use argv, assuming you only want to get the filename of the running program. When you try to get the filename from a different module, the only secure way to do this without any other tricks is described already, an implementation can be found here.
// assume argv is there and a char** array
int nAllocCharCount = 1024;
int nBufSize = argv[0][0] ? strlen((char *) argv[0]) : nAllocCharCount;
TCHAR * pszCompleteFilePath = new TCHAR[nBufSize+1];
nBufSize = GetModuleFileName(NULL, (TCHAR*)pszCompleteFilePath, nBufSize);
if (!argv[0][0])
{
// resize memory until enough is available
while (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
{
delete[] pszCompleteFilePath;
nBufSize += nAllocCharCount;
pszCompleteFilePath = new TCHAR[nBufSize+1];
nBufSize = GetModuleFileName(NULL, (TCHAR*)pszCompleteFilePath, nBufSize);
}
TCHAR * pTmp = pszCompleteFilePath;
pszCompleteFilePath = new TCHAR[nBufSize+1];
memcpy_s((void*)pszCompleteFilePath, nBufSize*sizeof(TCHAR), pTmp, nBufSize*sizeof(TCHAR));
delete[] pTmp;
pTmp = NULL;
}
pszCompleteFilePath[nBufSize] = '\0';
// do work here
// variable 'pszCompleteFilePath' contains always the complete path now
// cleanup
delete[] pszCompleteFilePath;
pszCompleteFilePath = NULL;
I had no case where argv didn't contain the file path (Win32 and Win32-console application), yet. But just in case there is a fallback to a solution that has been described above. Seems a bit ugly to me, but still gets the job done.

Converting LPCWSTR with WideCharToMultiByte. Need help

i have a function like this:
BOOL WINAPI MyFunction(HDC hdc, LPCWSTR text, UINT cbCount){
char AnsiBuffer[255];
int written = WideCharToMultiByte(CP_ACP, 0, text, cbCount, AnsiBuffer , 0, NULL, NULL);
if(written > -1) AnsiBuffer[written] = '\0';
if(written>0){
ofstream myfile;
myfile.open ("C:\\example.txt", ios::app);
myfile.write(AnsiBuffer, sizeof(AnsiBuffer));
myfile.write("\n", 1);
myfile.close();
}
....
When i display the input LPCWSTR text with MessageBoxW(), the text shows up fine. When i try to convert it to multibyte, the return value looks normal (ex: 22, 45, etc), but the result is strings of gibberish (ex ÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌ). Suggestions?
I see two problems;
1) You are passing '0' to WideCharToMultiByte for the size of the multibyte buffer. If you read the documents this results in the function returning the NUMBER of bytes needed but performing no actual conversion. (This is to allow you to subsequently allocate a buffer of the correct size and recall the function).
2) in file.write sizeof(AnsiBuffer) will result in 255 bytes being written regardless of what is in the buffer. sizeof is a compile-time calculation that returns the size of a variable. You should replace this with the 'written' variable that represents the length of the string.
You need to pass the length of the buffer to the API, instead of passing 0. When you pass 0, the function returns the required length of the buffer, but doesn't write to it. You're seeing the results of the uninitialized array.
Here's the right call, with the 255 in the right place:
int written = WideCharToMultiByte(CP_ACP, 0, text, cbCount, AnsiBuffer , 255, NULL, NULL);

Resources