I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ? What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ? Also, I found the following quote in MSDN:
Unicode is a 16-bit character encoding
This negates whatever I read about the Unicode. I thought unicode can be encoded with different encodings such as UTF-8 and UTF-16. Can somebody shed some more light on this confusion?
I noticed that there are two compiler
flags in Visual Studio compiler (for
C++) called MBCS and UNICODE. What is
the difference between them ?
Many functions in the Windows API come in two versions: One that takes char parameters (in a locale-specific code page) and one that takes wchar_t parameters (in UTF-16).
int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);
Each of these function pairs also has a macro without the suffix, that depends on whether the UNICODE macro is defined.
#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif
In order to make this work, the TCHAR type is defined to abstract away the character type used by the API functions.
#ifdef UNICODE
typedef wchar_t TCHAR;
#else
typedef char TCHAR;
#endif
This, however, was a bad idea. You should always explicitly specify the character type.
What I am not getting is how UTF-8 is
conceptually different from a MBCS
encoding ?
MBCS stands for "multi-byte character set". For the literal minded, it seems that UTF-8 would qualify.
But in Windows, "MBCS" only refers to character encodings that can be used with the "A" versions of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987), and 950 (Big5), but NOT UTF-8.
To use UTF-8, you have to convert the string to UTF-16 using MultiByteToWideChar, call the "W" version of the function, and call WideCharToMultiByte on the output. This is essentially what the "A" functions actually do, which makes me wonder why Windows doesn't just support UTF-8.
This inability to support the most common character encoding makes the "A" version of the Windows API useless. Therefore, you should always use the "W" functions.
Update: As of Windows 10 build 1903 (May 2019 update), UTF-8 is now supported as an "ANSI" code page. Thus, my original (2010) recommendation to always use "W" functions no longer applies, unless you need to support old versions of Windows. See UTF-8 Everywhere for text-handling advice.
Unicode is a 16-bit character encoding
This negates whatever I read about the
Unicode.
MSDN is wrong. Unicode is a 21-bit coded character set that has several encodings, the most common being UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings as well, such as GB18030, UTF-7, and UTF-EBCDIC.)
Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early adopter of Unicode, back when 16 bits was thought to be enough for everyone, and UTF-8 was only used on Plan 9. So UCS-2 was Unicode.
_MBCS and _UNICODE are macros to determine which version of TCHAR.H routines to call. For example, if you use _tcsclen to count the length of a string, the preprocessor would map _tcsclen to different version according to the two macros: _MBCS and _UNICODE.
_UNICODE & _MBCS Not Defined: strlen
_MBCS Defined: _mbslen
_UNICODE Defined: wcslen
To explain the difference of these string length counting functions, consider following example.
If you have a computer box that run Windows Simplified Chinese edition which use GBK(936 code page), you compile a gbk-file-encoded source file and run it.
printf("%d\n", _mbslen((const unsigned char*)"I爱你M"));
printf("%d\n", strlen("I爱你M"));
printf("%d\n", wcslen((const wchar_t*)"I爱你M"));
The result would be 4 6 3.
Here is the hexdecimal representation of I爱你M in GBK.
GBK: 49 B0 AE C4 E3 4D 00
_mbslen knows this string is encoded in GBK, so it could intepreter the string correctly and get the right result 4 words: 49 as I, B0 AE as 爱, C4 E3 as 你, 4D as M.
strlen only knows 0x00, so it get 6.
wcslen consider this hexdeciaml array is encoded in UTF16LE, and it count two bytes as one word, so it get 3 words: 49 B0, AE C4, E3 4D.
as #xiaokaoy pointed out, the only valid terminator for wcslen is 00 00. Thus the result is not guranteed to be 3 if the following byte is not 00.
MBCS means Multi-Byte Character Set and describes any character set where a character is encoded into (possibly) more than 1 byte.
The ANSI / ASCII character sets are not multi-byte.
UTF-8, however, is a multi-byte encoding. It encodes any Unicode character as a sequence of 1, 2, 3, or 4 octets (bytes).
However, UTF-8 is only one out of several possible concrete encodings of the Unicode character set. Notably, UTF-16 is another, and happens to be the encoding used by Windows / .NET (IIRC). Here's the difference between UTF-8 and UTF-16:
UTF-8 encodes any Unicode character as a sequence of 1, 2, 3, or 4 bytes.
UTF-16 encodes most Unicode characters as 2 bytes, and some as 4 bytes.
It is therefore not correct that Unicode is a 16-bit character encoding. It's rather something like a 21-bit encoding (or even more these days), as it encompasses a character set with code points U+000000 up to U+10FFFF.
As a footnote to the other answers, MSDN has a document Generic-Text Mappings in TCHAR.H with handy tables summarizing how the preprocessor directives _UNICODE and _MBCS change the definition of different C/C++ types.
As to the phrasing "Unicode" and "Multi-Byte Character Set", people have already described what the effects are. I just want to emphasize that both of those are Microsoft-speak for some very specific things. (That is, they mean something less general and more particular-to-Windows than one might expect if coming from a non-Microsoft-specific understanding of text internationalization.) Those exact phrases show up and tend to get their own separate sections/subsections of microsoft technical documents, e.g. in Text and Strings in Visual C++
Related
I'm trying to find any information about the encoding behind L"" strings?
https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp?view=msvc-160
I know wchar_t is undefined because it can be any multi-byte encoding. But what happens if I use an L"" string? Even the docs just leave out the information.
auto s2 = L"hello"; // const wchar_t* <-- it's undefined but why?
auto s3 = u"hello"; // const char16_t*, encoded as UTF-16
auto s4 = U"hello"; // const char32_t*, encoded as UTF-32
wchar_t is a standard type, but its exact implementation is left to individual compilers. Microsoft decided back when Unicode all fit into 16-bit quantities that wchar_t would be 2 bytes in size, and Windows would use UCS-2. Later, when Unicode exceeded 16-bit quantities, Windows was updated to use UTF-16, and since Windows operated on little-endian processors, that made it UTF-16LE. wchar_t remained 2 bytes in size, which can handle UTF-16 values, using surrogate pairs for Unicode codepoints above U+FFFF.
When testing my code that uses a routine that checks for chars to show using an ASCII value routine, my program should drop control chars but keep chars that may be entered by the user. It seems that while the ASCII value routine is called "ascii", it does not just return ascii values: giving it a char of ƒ returns 402.
For example have found this web site
but it doesn't have ƒ 402 that I can see.
Need to know whether there are other ascii codes above 402 that I need to test my code with. The character set used internally by the software that 'ascii' is written in uses UCS2. The web site found doesn't mention USC2.
There are probably many interpretations ouf »Control Character« out there, but I'll assume you mean C0 and C1 control characters (includes references to the relevant Unicode Standards).
The commonly used 32-bit integer representation of Unicode characters in general is the codepoint notation: »U+« followed by a at least 4 digit positive hex number, which you will find near mentions of characters, e.g. as in »U+007F (delete)«. The result of your »ASCII value« routine will probably be this number without the »U+«;
UCS-2 is a specific encoding for Unicode characters, which you probably won't need to care about directly), and is equivalent to Unicode codepoints for all characters within the the range of the BMP only.
Most of WinAPI calls have Unicode and ANSI function call
For examble:
function MessageBoxA(hWnd: HWND; lpText, lpCaption: LPCSTR; uType: UINT): Integer; stdcall;external user32;
function MessageBoxW(hWnd: HWND; lpText, lpCaption: LPCWSTR; uType: UINT): Integer; stdcall; external user32;
When should i use the ANSI function rather than calling the Unicode function ?
Just as (rare) exceptions to the posted comments/answers...
One may choose to use the ANSI calls in cases where UTF-8 is expected and supported. For an example, WriteConsoleA'ing UTF-8 strings in a console set to use a TT font and running under chcp 65001.
Another oddball exception is functions that are primarily implemented as ANSI, where the Unicode "W" variant simply converts to a narrow string in the active codepage and calls the "A" counterpart. For such a function, and when a narrow string is available, calling the "A" variant directly saves a redundant double conversion. Case in point is OutputDebugString, which fell into this category until Windows 10 (I just noticed https://msdn.microsoft.com/en-us/library/windows/desktop/aa363362.aspx which mentions that a call to WaitForDebugEventEx - only available since Windows 10 - enables true Unicode output for OutputDebugStringW).
Then there are APIs which, even though dealing with strings, are natively ANSI. For example GetProcAddress only exists in the ANSI variant which takes a LPCSTR argument, since names in the export tables are narrow strings.
That said, by an large most string-related APIs are natively Unicode and one is encouraged use the "W" variants. Not all the newer APIs even have an "A" variant any longer (e.g. CommandLineToArgvW). From the horses's mouth https://msdn.microsoft.com/en-us/library/windows/desktop/ff381407.aspx:
Windows natively supports Unicode strings for UI elements, file names, and so forth. Unicode is the preferred character encoding, because it supports all character sets and languages. Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as a 16-bit value. UTF-16 characters are called wide characters, to distinguish them from 8-bit ANSI characters.
[...]
When Microsoft introduced Unicode support to Windows, it eased the transition by providing two parallel sets of APIs, one for ANSI strings and the other for Unicode strings.
[...]
Internally, the ANSI version translates the string to Unicode. The Windows headers also define a macro that resolves to the Unicode version when the preprocessor symbol UNICODE is defined or the ANSI version otherwise.
[...]
Most newer APIs in Windows have just a Unicode version, with no corresponding ANSI version.
[ NOTE ] The post was edited to add the last two paragraphs.
The simplest rule to follow is this: Only use the ANSI variants on systems that do not have the Unicode variant. That is on Windows 95, 98 and ME, which are the versions of Windows that do not support Unicode.
These days, it is exceptionally unlikely that you will be targeting such versions, and so in all probability you should always just use the Unicode variants.
I am using IMultilanguage2::ConvertStringFromUnicode to convert from UTF-16. For some languages (Japanese, Chinese, Korean), I am getting an escape sequence (e.g. 0x1B, 0x24, 0x29, 0x43 for codepage 50225 (ISO-2022 Korean)). WideCharToMultiByte exhibits the same behavior.
I am building a MIME message, so the encoding is specified in the header itself and the escape prefix is displayed as-is.
Is there a way to convert without the prefix?
Thank you!
I don't really see a problem here. That is a valid byte sequence in ISO 2022:
Escape sequences to designate character sets take the form ESC I [I...] F, where there are one or more intermediate I bytes from the range 0x20–0x2F, and a final F byte from the range 0x40–0x7F. (The range 0x30–0x3F is reserved for private-use F bytes.) The I bytes identify the type of character set and the working set it is to be designated to, while the F byte identifies the character set itself.
...
Code: ESC $ ) F
Hex: 1B 24 29 F
Abbr: G1DM4
Name: G1-designate multibyte 94-set F
Effect: selects a 94n-character set to be used for G1.
As F is 0x43 (C), this byte sequence tells a decoder to switch to ISO-2022-KR:
Character encodings using ISO/IEC 2022 mechanism include:
...
ISO-2022-KR. An encoding for Korean.
ESC $ ) C to switch to KS X 1001-1992, previously named KS C 5601-1987 (2 bytes per character) [designated to G1]
In this case, you have to specify iso-2022-kr as the charset in a MIME Content-Type or RFC2047-encoded header. But an ISO 2022 decoder still has to be able to switch charsets dynamically while decoding, so it is valid for the data to include an intial switch sequence to the Korean charset.
Is there a way to convert without the prefix?
Not with IMultiLanguage2 and WideCharToMultiByte(), no. They have no clue how you are going to use their output, so it makes sense why they include an initial switch sequence to the Korean charset - so a decoder without access to charset info from MIME (or other source) would still know what charset to use initially.
When you put the data into a MIME message, you will have to manually strip off the charset switch sequence when you set the MIME charset to iso-2022-kr. If you do not want to strip it manually, you will have to find (or write) a Unicode encoder that does not output that initial switch sequence.
That was a red herring - turned out the escape sequence is necessary. The problem was with my code that was trimming the names and addresses using Trim() Delphi function, which trims all characters less than or equal to space (0x20); that includes the escape character (0x1B).
Switching to my own trimming function that removes only spaces fixed the problem.
I have a string that I have retrived from an html page using boost's regex_search(). Unfortunately, however, the japanese characters in the page are written as \u codes, and these are interpreted by regex_search as normal characters in a string.
So, my question is, how does one go about converting these codes to normal Unicode text? (UTF-8 obviously)
This is a fundamental issue with fstream having absolutely no regard for UTF-8. It looks like boost has its own implementation of fstream, but changing to it had no effect on my program, and I couldn't find any extra settings to configure boost's fstream to work with UTF-8 (although today is my first day ever working with boost, I could have missed it).
As a final note: I'm running this on linux, but I'd certainly appreciate a portable solution over a system-specific one.
Thanks all, I really appreciate the help :D
fstream is a narrow-character only stream (it's a typedef to basic_fstream<char>). std::wfstream would be the type you're looking for, although to be perfectly portable to, for example, Windows, you may have to introduce C++11 dependencies (Windows has no Unicode locales, but supports locale-independent Unicode conversions introduced by C++11. GCC on Linux doesn't support the new Unicode conversions, but has plenty of Unicode locales to choose from) or rely on boost.locale.
Your steps would be:
parse the string to obtain the hexadecimal values of the code points
store them as wide characters.
write them to a std::wofstream (or convert to UTF-8 first, and then write to std::ofstream)
To illustrate the last step:
#include <fstream>
#include <locale>
int main()
{
std::locale::global(std::locale("en_US.utf8")); // any utf8 works
std::wofstream f("test.txt");
f.imbue(std::locale());
f << wchar_t(0x65e5) << wchar_t(0x672c) << wchar_t(0x8a9e) << '\n';
}
produces a file (on Linux) that contains e6 97 a5 e6 9c ac e8 aa 9e 0a