Unicode characters in window caption - winapi

We're having trouble setting window captions using cyrillic or japanese characters. We either see question marks or random garbage, but not the text we want. We've tried using different encodings, SetWindowText(), SetWindowTextW(), SetWindowTextA(), and so on. We can't even get it to work by passing a string literal to SetWindowText().
Our Windows install does appear to have everything it needs - Internet Explorer and Firefox do show cyrillic and japanese captions correctly, for example. So I'm pretty sure we're not finding the correct encoding/method combination. Any suggestions?

The problem you've got (at a guess) is that the top-level frame window of your application is an ANSI window. Under the hood, when you create a window (with CreateWindow() or CreateWindowEx()) a window class must be supplied. This window class determines the window's properties, including whether or not it, by default, accepts ANSI messages or Unicode messages. In turn, this is set by whether you (or your framework) register the window class by calling RegisterClassExA() or RegisterClassExW().
What's almost certainly happing is that your top-level window's class is being registered with RegisterClassExA(). This means that the default window procedure for the window will translate all Unicode strings in messages to ANSI before processing them, hence the question marks and odd characters everywhere.
The easiest solution to all this is to just make your application Unicode throughout (usually done by defining _UNICODE). The other way is to figure out where RegisterClassEx() is being called, and make it RegisterClassExW(). This may cause ANSI/Unicode issues with other messages, but it should (in theory, at least) work. Of course, either way will break Windows 9X, if that's an issue.
If all this sounds horrendously complicated, you're not wrong ...

SetWindowText()? Did you compile your application as Unicode? If not, SetWindowText() is equivalent to SetWindowTextA(), which in turn is limited to your current system locale (aka "Language for non-Unicode applications").
Also, how did you CREATE your window? Using an explicitely Unicode-aware API such as CreateWindowExW()? If not, make sure that your program is compiled as Unicode.
If your program is not compiled as Unicode, you might want to either modify your "Language for non-Unicode applications" in CP/Regional Options. Reboot required. Or more easily: Use MS AppLocale to simulate a cyrillic system locale

You have to compile your application with _UNICODE defined. Otherwise all windows will still be MBCS and not utf-16 and therefore can't show cyrillic or japanese characters if the codepage doesn't match.

Related

Has anyone heard of this strange bug with the standard Windows message box?

Years ago, I was messing around with Visual Basic and I discovered a bug with the MsgBox function. I tried searching for it, but nobody had ever said anything about it. It's not just with Visual Basic though; it's with anything that uses the standard Windows MessageBox API call.
The bug is triggered when the title text has more than one character, and the first character is a lowercase 'y' with an umlaut ('ÿ'). What's so special about this character? It almost definitely not the character itself, but rather its ASCII value that's special. 'ÿ' is character 255 (0xFF), meaning it's the highest value that can be stored in an unsigned byte, and all its bits are set to 1.
What does this bug do? Well, there are two different possibilities, which depend on the number of characters in the title text. If there are an even number of characters (unless it's 2) in the title text, no message box appears, and you just hear the alert sound. If there are two characters in the title text, or any odd number other than 1 (in which case the bug wouldn't be triggered)...then this happens:
And that's not all--the message will also be truncated to one line. It seems like the kind of bug that would occur in at least one semi-high-profile incident, considering how often this API call is used. Are there any reports of this on the Internet, or anything showing what could cause it? Maybe it's a Unicode-related glitch, like that "bush hid the facts" glitch in Notepad?
I made a program in case you want to play around with this; download it here.
Alternatively, copy the following into Notepad, save it with a .vbs extension, and double-click it to display the dialog box seen above:
MsgBox "Windows 3.1 font, anyone?", 0, "ÿ ODD NUMBER!"
Or for a different font:
MsgBox "I CAN HAS CHEEZBURGER?", 0, "ÿ HImpact"
EDIT: It seems that if the first four characters are ÿ's, it doesn't ever display the message, even if there's an odd number of characters.
This is a bug with dialog templates generally. It is not a message box bug as such.
For example, in Visual Studio create the default win32 application. In the .rc file, change the caption in the template for the about box from
CAPTION "About sampleapp"
to
CAPTION "ÿT"
and the bug will manifest itself when you display the about box.
In the DLGTEMPLATEEX documentation note that the menu and class name have type sz_Or_Ord which means either a null-terminated string or 0xFFFF followed by a single word resource identifier.
Windows incorrectly applies a similar scheme to the dialog title: if the first character is 0xFF then it treats the title as being two WORDs long, but only when it is trying to locate the font information. When it is displaying the title it correctly treats the title as a string.
In other words, Windows is looking for the font information inside the title string. In most case this won't specify a valid font, so Windows defaults to the system font.
To prove this, I constructed a dialog template in memory (based on this). Once this was working I deleted the code that writes the font information to the template and used the dialog title "ÿa\xd\x200\x21SimSun". This displays the dialog in italic SimSun because windows is reading the font information from the title string.
This bug is likely a hangover from 16-bit Windows, where (I guess) 0xFF was used as the resource ID marker.
A strange bug. I suspect the symptoms are the result of the way the MessageBox() actually displays the dialog.
Internally, MessageBox() builds a dialog template dynamically. If you look at the description of a DLGTEMPLATE structure you'll find the following nugget of information:
In a standard template for a dialog box, the DLGTEMPLATE structure is
always immediately followed by three variable-length arrays that
specify the menu, class, and title for the dialog box. When the
DS_SETFONT style is specified, these arrays are also followed by a
16-bit value specifying point size and another variable-length array
specifying a typeface name.
So, the in-memory layout of a dialog template has the font specification immediately following the dialog box title.
Visual Basic does not use Unicode and so the function you're calling is actually MessageBoxA(). This is simply a thunk that converts the passed-in strings from multibyte to Unicode and then calls MessageBoxW().
I believe what's happening is that, for some reason, the conversion of that string from multibyte to Unicode is either going wrong, or returning a spurious length value. This has the knock-on effect, when the dialog template is built in memory, of corrupting the memory immediately following the title string - which, as we know, is the font specification.

Translating shortcut keys

When extracting strings from a desktop Windows application for translation, should I translate shortcut keys as well?
In other words, should Ctrl-C copy to the clipboard even for Chinese software?
Yes, CTRL-C is universal. You can safely assume that typical CTRL-something shortcuts behave the same in all (modern) Windows versions, regardless of the language.
However, there might be several ways to present them depending on the language. For example, French would translate the name of the key (the combination remains the same).
But you are asking about Chinese (presumably Simplified Chinese), which will simply display it as CTRL-C. After all, the keyboard layout is the same (with the same symbols), all they do differently is they use so-called Input Method Editors. And although there are several different IMEs, I haven't yet seen the one that would override CTRL-C...

What is the Difference between "Csharp editor" and "Csharp editor with encoding"?

In the Open with menu of a .cs file there's Csharp editor and Csharp editor with encoding. I opened a solution with both and didn't see a difference.
What's the difference between them?
Unless your .cs file includes characters outside of the normal ASCII range, you won't see a difference in the actual contents of the file. The difference is whether or not the editor tries to detect the character encoding you saved your file with when you open it again, or asks you specifically.
By default, when you save a new .cs file, VS uses the current ANSI code page to encode the characters. (You can switch this to use UTF-8 by default with the appropriate options.) However, you can instead chose to "Save with Encoding...", which will prompt you for the specific character encoding you want to save it.
Internally, your code is being handled as UTF-16, since that's what Windows deals with as it's native string format. On-disk, however, UTF-16 would most likely blow up your source files to double their size, since most of the C# code you write probably fits into a single byte. So, when writing to disk, VS writes out your data in a particular code page that defines how to convert the UTF-16 characters into some other, possibly 8-bit character set.
When you reload a file in VS, it attempts to figure out what encoding that file was in, and if it can't, it will fall back on the current ANSI code page. (You can force it to fall back to UTF-8 via some options, but it won't ever fall back to a different encoding.)
When you reload a file "With Encoding", you get the same prompt as when you saved the file, asking you which encoding was used. This way, if Studio gets it wrong, you can fix it.
Unless you do a lot of internationalized programming, where you have foreign-language strings embedded in your .cs file from a language other than the default, you probably don't need to use the explicit "with encoding" save or loads. But, they are there if you need them.
If you open with encoding you can save with whatever character encoding is appropriate for your culture or region.

What does the charset parameter of CreateFont exactly set?

The code page on my Windows is set to ANSI (Latin1, Windows-1252).
I create a font with CreateFont and pass RUSSIAN_CHARSET in fdwCharSet
This is what I experience:
Windows controls (like a Static for example) using this font ignore the font's character set: the string passed to SetWindowTextA is displayed with Latin characters
After selecting this font on the DC, GDI text functions (Ext)TextOutA and DrawTextA use the character set of the font. Strings passed to them are displayed with cyrillic letters.
Why? When is the charset parameter of the font taken into account and when is it ignored? Can I force windows controls to use the font's character set?
You will have to convert the text to Unicode and call SetWindowTextW() instead of SetWindowTextA().
Make sure the window's class is registered with RegisterClassW() and not RegisterClassA(). This is what really determines if a window is Unicode. You can use IsWindowUnicode() to verify that the window is really Unicode.
Make sure you pass unhandled messages to DefWindowProcW() and not DefWindowProcA().
Or, if the window is a dialog, just make sure it is created with CreateDialogW() or DialogBoxParamW().
>Can I force windows controls to use the font's character set?
AFAIK no, you can't.
SetWindowTextA just converts the argument to Unicode, then calls SetWindowTextW: both windows kernel, shell, and GDI are unicode.
To convert the argument to Unicode, SetWindowTextA uses the setting from Window's regional options ("Language for non-Unicode programs").
Here is what is happening:
You call SetWindowText with something like "\xC4\xEE\xE1\xF0\xEE\xE5 xF3\xF2\xF0\xEE".
You're compiled as an ANSI application rather than Unicode, so that maps to a call to SetWindowTextA.
SetWindowTextA sees that the window was created in ANSI mode, so it sets the string directly. (If it had been a Unicode window, then it would converts the ANSI input string to Unicode and passes it onto SetWindowTextW.)
The ANSI window converts the ANSI string to Unicode so that it can display it. But it doesn't know the string is in a different code page than the default one for the system. It's this conversion that's changing everything back to Latin characters. It assumes that the input string is in the process's default code page (Windows 1252 in your case). So now you have a bunch of accented Latin characters instead of a string of Cyrillic ones.
The control tries to display this Unicode string using something like DrawTextW or TextOutW.
The lower level part of the system says, "Oh crap, this string is a bunch of Latin characters, but the user has chosen a Cyrillic font." To solve the problem, it uses font linking (or font fallback, I get those terms confused) to effectively select a font compatible with 1252.
You see Latin gobbledygook instead of proper Russian.
I tried coming up with a minimal way to do what you need, but I failed. My first idea was to do the conversion yourself and call SetWindowTextW directly:
void SetWindowTextRussian(HWND hwnd, char *pszCyrillic) {
const int cchCyrillic = ::lstrlen(pszCyrillic);
const int cchUnicode = 4 * cchCyrillic; // worst case
WCHAR *pszUnicode = new WCHAR[cchUnicode];
// See: http://msdn.microsoft.com/en-us/library/dd317756(v=vs.85).aspx
const UINT CP_CYRILLIC = 1251;
if (::MultiByteToWideChar(CP_CYRILLIC, 0, pszCyrillic, -1,
pszUnicode, cchUnicode) > 0) {
::SetWindowTextW(hwnd, pszUnicode);
}
delete [] pszUnicode;
}
But this doesn't work. I suspect that since the window was created as an ANSI window, the Unicode string is converted back to ANSI (assuming the wrong code page again), and then you get question marks instead of Latin nonsense.
I think you're going to have to convert to a Unicode app, or only run with the default code page set to 1251.
Update: If you control the creation of the window (e.g., you call CreateWindow directly rather than having a dialog box instantiate the controls), then you can probably make the above work by calling CreateWindowW directly and creating a Unicode window for the controls that matter.
Consider hooking gdi32full.dll GetCodePage to select the code page that you need. CP_UTF8 for example. It has single pointer param, returns single DWORD (the code page) and stdcall calling convention.

How do shell text editors work?

I'm fairly new at programming, but I've wondered how shell text editors such as vim, emacs, nano, etc are able to control the command-line window. I'm primarily a Windows programmer, so maybe it's different on *nix. As far as I know, it's only possible to print text to a console, and ask for input. How do text editors create a navigable, editable window in a command line environment?
By using libraries such as the following which, in turn, use escape character sequences
NAME
ncurses - CRT screen handling and optimization package
SYNOPSIS
#include
DESCRIPTION
The ncurses library routines give the user a terminal-independent
method of updating character screens with reasonable optimization. This
implementation is ‘‘new curses’’ (ncurses) and is the approved replacement
for 4.4BSD classic curses, which has been discontinued.
[...snip....]
The ncurses package supports: overall screen, window and pad
manipulation; output to windows and pads; reading terminal input; control
over terminal and curses input and output options; environment query
routines; color manipulation; use of soft label keys; terminfo capabilities;
and access to low-level terminal-manipulation routines.
Short answer: there are libraries for it (like curses, slang).
Longer answer: doing things like jumping around with the cursor or changing colors are done by printing special character sequences (called escape-secquences, because they start with the ESC character).
Learning about ncurses might be a good starting point.
There is an old protocol called vt100 based on a "VT100" terminal. It used codes starting with escape to control cursor position, color, clearing the screen, etc.
It's also how you get colored prompts.
Google VT100 or "terminal escape codes"
edit: I Googled it for you: http://www.termsys.demon.co.uk/vtansi.htm
You will also notice this if you type "edit" in a Windows command line console. This "feature" is not unique to unix-like systems, though the concepts for manipulating the windows console in that way are quite different to in unix.
On Unix systems, a console window emulates an ancient serial terminal (usually a VT100). You can print special control characters and escape sequences to move the cursor around, change colors, and do other special effects. There are libraries to help handle the details; ncurses is the most popular.
On Windows, the [Win32 Console API](http://msdn.microsoft.com/en-us/library/ms682073(VS.85%29.aspx) provides similar functionality, but in a rather different manner.
Type "c:\winnt\system32\edit" or "c:\windows\system32\edit" at the command line, and you'll be shown a command line text editor.
People are mostly right about the ESC character being used to control the command screen, but some older programs also write characters directly to the memory space used by the Windows Command Line screen.
In order to control the command line window, you used to have to write your own windowing forms, entry box, menus, etc. You'd also have to wrap all that up in a big loop for handling events.
More Windows command line specific, the app typically calls DOS or BIOS functions that do the same. Sometimes ANSI command code support is available, sometimes it isn't (depending on exact MS OS version and whether or not it's configured to load it).

Resources