Most of WinAPI calls have Unicode and ANSI function call
For examble:
function MessageBoxA(hWnd: HWND; lpText, lpCaption: LPCSTR; uType: UINT): Integer; stdcall;external user32;
function MessageBoxW(hWnd: HWND; lpText, lpCaption: LPCWSTR; uType: UINT): Integer; stdcall; external user32;
When should i use the ANSI function rather than calling the Unicode function ?
Just as (rare) exceptions to the posted comments/answers...
One may choose to use the ANSI calls in cases where UTF-8 is expected and supported. For an example, WriteConsoleA'ing UTF-8 strings in a console set to use a TT font and running under chcp 65001.
Another oddball exception is functions that are primarily implemented as ANSI, where the Unicode "W" variant simply converts to a narrow string in the active codepage and calls the "A" counterpart. For such a function, and when a narrow string is available, calling the "A" variant directly saves a redundant double conversion. Case in point is OutputDebugString, which fell into this category until Windows 10 (I just noticed https://msdn.microsoft.com/en-us/library/windows/desktop/aa363362.aspx which mentions that a call to WaitForDebugEventEx - only available since Windows 10 - enables true Unicode output for OutputDebugStringW).
Then there are APIs which, even though dealing with strings, are natively ANSI. For example GetProcAddress only exists in the ANSI variant which takes a LPCSTR argument, since names in the export tables are narrow strings.
That said, by an large most string-related APIs are natively Unicode and one is encouraged use the "W" variants. Not all the newer APIs even have an "A" variant any longer (e.g. CommandLineToArgvW). From the horses's mouth https://msdn.microsoft.com/en-us/library/windows/desktop/ff381407.aspx:
Windows natively supports Unicode strings for UI elements, file names, and so forth. Unicode is the preferred character encoding, because it supports all character sets and languages. Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as a 16-bit value. UTF-16 characters are called wide characters, to distinguish them from 8-bit ANSI characters.
[...]
When Microsoft introduced Unicode support to Windows, it eased the transition by providing two parallel sets of APIs, one for ANSI strings and the other for Unicode strings.
[...]
Internally, the ANSI version translates the string to Unicode. The Windows headers also define a macro that resolves to the Unicode version when the preprocessor symbol UNICODE is defined or the ANSI version otherwise.
[...]
Most newer APIs in Windows have just a Unicode version, with no corresponding ANSI version.
[ NOTE ] The post was edited to add the last two paragraphs.
The simplest rule to follow is this: Only use the ANSI variants on systems that do not have the Unicode variant. That is on Windows 95, 98 and ME, which are the versions of Windows that do not support Unicode.
These days, it is exceptionally unlikely that you will be targeting such versions, and so in all probability you should always just use the Unicode variants.
Related
What is the difference in calling the Win32 API function that have an A character appended to the end as opposed to the W character.
I know it means ASCII and WIDE CHARACTER or Unicode, but what is the difference in the output or the input?
For example, If I call GetDefaultCommConfigA, will it fill my COMMCONFIG structure with ASCII strings instead of WCHAR strings? (Or vice-versa for GetDefaultCommConfigW)
In other words, how do I know what Encoding the string is in, ASCII or UNICODE, it must be by the version of the function I call A or W? Correct?
I have found this question, but I don't think it answers my question.
The A functions use Ansi (not ASCII) strings as input and output, and the W functions use Unicode string instead (UCS-2 on NT4 and earlier, UTF-16 on W2K and later). Refer to MSDN for more details.
Does the Windows API RegGetValue require a direct descendant for it's lpSubKey parameter?
Will this work?
RegGetValue(HKEY_LOCAL_MACHINE,
L"Software\\Microsoft\\Windows NT\\CurrentVersion", L"ProductName",
RRF_RT_REG_SZ, NULL, outdata, &outdata_size);
Edit: I had a leading slash \\ and Windows doesn't like it! Also converted UTF-8 strings to UTF-16 wide strings (Windows-Style).
Does the Windows API RegGetValue require a direct descendant for it's lpSubKey parameter?
No, it does not.
You can specify a path just as you've shown. You also don't need the leading path separator (\\).
But the code you've shown may or may not work. Not because it specifies a path to the string, but because you're probably mixing Unicode and ANSI strings. From your user name (unixman), I assume that you're relatively new to Windows programming, so it's worth noting that Windows applications are entirely Unicode and have been for more than a decade. You should therefore always compile your code as Unicode and prefix string literals with L (to indicate a wide, or Unicode, string).
Likewise, make sure that outdata is declared as an array of wchar_t.
What is the difference in calling the Win32 API function that have an A character appended to the end as opposed to the W character.
I know it means ASCII and WIDE CHARACTER or Unicode, but what is the difference in the output or the input?
For example, If I call GetDefaultCommConfigA, will it fill my COMMCONFIG structure with ASCII strings instead of WCHAR strings? (Or vice-versa for GetDefaultCommConfigW)
In other words, how do I know what Encoding the string is in, ASCII or UNICODE, it must be by the version of the function I call A or W? Correct?
I have found this question, but I don't think it answers my question.
The A functions use Ansi (not ASCII) strings as input and output, and the W functions use Unicode string instead (UCS-2 on NT4 and earlier, UTF-16 on W2K and later). Refer to MSDN for more details.
I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ? What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ? Also, I found the following quote in MSDN:
Unicode is a 16-bit character encoding
This negates whatever I read about the Unicode. I thought unicode can be encoded with different encodings such as UTF-8 and UTF-16. Can somebody shed some more light on this confusion?
I noticed that there are two compiler
flags in Visual Studio compiler (for
C++) called MBCS and UNICODE. What is
the difference between them ?
Many functions in the Windows API come in two versions: One that takes char parameters (in a locale-specific code page) and one that takes wchar_t parameters (in UTF-16).
int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);
Each of these function pairs also has a macro without the suffix, that depends on whether the UNICODE macro is defined.
#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif
In order to make this work, the TCHAR type is defined to abstract away the character type used by the API functions.
#ifdef UNICODE
typedef wchar_t TCHAR;
#else
typedef char TCHAR;
#endif
This, however, was a bad idea. You should always explicitly specify the character type.
What I am not getting is how UTF-8 is
conceptually different from a MBCS
encoding ?
MBCS stands for "multi-byte character set". For the literal minded, it seems that UTF-8 would qualify.
But in Windows, "MBCS" only refers to character encodings that can be used with the "A" versions of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987), and 950 (Big5), but NOT UTF-8.
To use UTF-8, you have to convert the string to UTF-16 using MultiByteToWideChar, call the "W" version of the function, and call WideCharToMultiByte on the output. This is essentially what the "A" functions actually do, which makes me wonder why Windows doesn't just support UTF-8.
This inability to support the most common character encoding makes the "A" version of the Windows API useless. Therefore, you should always use the "W" functions.
Update: As of Windows 10 build 1903 (May 2019 update), UTF-8 is now supported as an "ANSI" code page. Thus, my original (2010) recommendation to always use "W" functions no longer applies, unless you need to support old versions of Windows. See UTF-8 Everywhere for text-handling advice.
Unicode is a 16-bit character encoding
This negates whatever I read about the
Unicode.
MSDN is wrong. Unicode is a 21-bit coded character set that has several encodings, the most common being UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings as well, such as GB18030, UTF-7, and UTF-EBCDIC.)
Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early adopter of Unicode, back when 16 bits was thought to be enough for everyone, and UTF-8 was only used on Plan 9. So UCS-2 was Unicode.
_MBCS and _UNICODE are macros to determine which version of TCHAR.H routines to call. For example, if you use _tcsclen to count the length of a string, the preprocessor would map _tcsclen to different version according to the two macros: _MBCS and _UNICODE.
_UNICODE & _MBCS Not Defined: strlen
_MBCS Defined: _mbslen
_UNICODE Defined: wcslen
To explain the difference of these string length counting functions, consider following example.
If you have a computer box that run Windows Simplified Chinese edition which use GBK(936 code page), you compile a gbk-file-encoded source file and run it.
printf("%d\n", _mbslen((const unsigned char*)"I爱你M"));
printf("%d\n", strlen("I爱你M"));
printf("%d\n", wcslen((const wchar_t*)"I爱你M"));
The result would be 4 6 3.
Here is the hexdecimal representation of I爱你M in GBK.
GBK: 49 B0 AE C4 E3 4D 00
_mbslen knows this string is encoded in GBK, so it could intepreter the string correctly and get the right result 4 words: 49 as I, B0 AE as 爱, C4 E3 as 你, 4D as M.
strlen only knows 0x00, so it get 6.
wcslen consider this hexdeciaml array is encoded in UTF16LE, and it count two bytes as one word, so it get 3 words: 49 B0, AE C4, E3 4D.
as #xiaokaoy pointed out, the only valid terminator for wcslen is 00 00. Thus the result is not guranteed to be 3 if the following byte is not 00.
MBCS means Multi-Byte Character Set and describes any character set where a character is encoded into (possibly) more than 1 byte.
The ANSI / ASCII character sets are not multi-byte.
UTF-8, however, is a multi-byte encoding. It encodes any Unicode character as a sequence of 1, 2, 3, or 4 octets (bytes).
However, UTF-8 is only one out of several possible concrete encodings of the Unicode character set. Notably, UTF-16 is another, and happens to be the encoding used by Windows / .NET (IIRC). Here's the difference between UTF-8 and UTF-16:
UTF-8 encodes any Unicode character as a sequence of 1, 2, 3, or 4 bytes.
UTF-16 encodes most Unicode characters as 2 bytes, and some as 4 bytes.
It is therefore not correct that Unicode is a 16-bit character encoding. It's rather something like a 21-bit encoding (or even more these days), as it encompasses a character set with code points U+000000 up to U+10FFFF.
As a footnote to the other answers, MSDN has a document Generic-Text Mappings in TCHAR.H with handy tables summarizing how the preprocessor directives _UNICODE and _MBCS change the definition of different C/C++ types.
As to the phrasing "Unicode" and "Multi-Byte Character Set", people have already described what the effects are. I just want to emphasize that both of those are Microsoft-speak for some very specific things. (That is, they mean something less general and more particular-to-Windows than one might expect if coming from a non-Microsoft-specific understanding of text internationalization.) Those exact phrases show up and tend to get their own separate sections/subsections of microsoft technical documents, e.g. in Text and Strings in Visual C++
I had a question about string normalization and it was already answered, but the problem is, I cannot correctly normalize korean characters that require 3 keystrokes
With the input "ㅁㅜㄷ"(from keystrokes "ane"), it comes out "무ㄷ" instead of "묻".
With the input "ㅌㅐㅇ"(from keystrokes "xod"), it comes out "태ㅇ" instead of "탱".
This is Mr. Dean's answer and while it worked on the example I gave at first...it doesn't work with the one's I cited above.
If you are using .NET, the following will work:
var s = "ㅌㅐㅇ";
s = s.Normalize(NormalizationForm.FormKC);
In native Win32, the corresponding call is NormalizeString:
wchar_t *input = "ㅌㅐㅇ";
wchar_t output[100];
NormalizeString(NormalizationKC, input, -1, output, 100);
NormalizeString is only available in Windows Vista+. You need the "Microsoft Internationalized Domain Name (IDN) Mitigation APIs" installed if you want to use it on XP (why it's in the IDN download, I don't understand...)
Note that neither of these methods actually requires use of the IME - they work regardless of whether you've got the Korean IME installed or not.
This is the code I'm using in delphi (with XP):
var buf: array [0..20] of char;
temporary: PWideChar;
const NORMALIZATIONKC=5;
...
temporary:='ㅌㅐㅇ';
NormalizeString(NORMALIZATIONKC , temporary, -1, buf, 20);
showmessage(buf);
Is this a bug? Is there something incorrect in my code?
Does the code run correctly on your computer? In what language? What windows version are you using?
The jamo you're using (ㅌㅐㅇ)are in the block called Hangul Compatibility Jamo, which is present due to legacy code pages. If you were to take your target character and decompose it (using NFKD), you get jamo from the block Hangul Jamo (ᄐ ᅢ ᆼ, sans the spaces, which are just there to prevent the browser from normalizing it), and these can be re-composed just fine.
Unicode 5.2 states:
When Hangul compatibility jamo are
transformed with a compatibility
normalization form, NFKD or NFKC, the
characters are converted to the
corresponding conjoining jamo
characters.
(...)
Table 12-11
illustrates how two Hangul
compatibility jamo can be separated in
display, even after transforming them
with NFKD or NFKC.
This suggests that NFKC should combine them correctly by treating them as regular Jamo, but Windows doesn't appear to be doing that. However, using NFKD does appear to convert them to the normal Jamo, and you can then run NFKC on it to get the right character.
Since those characters appear to come from an external program (the IME), I would suggest you either do a manual pass to convert those compatibility Jamo, or start by doing NFKD, then NFKC. Alternatively, you may be able to reconfigure the IME to output "normal" Jamo instead of comaptibility Jamo.