What is the encoding behind L"" in Windows?

What is the encoding behind L"" in Windows? - windows

I'm trying to find any information about the encoding behind L"" strings?
https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp?view=msvc-160
I know wchar_t is undefined because it can be any multi-byte encoding. But what happens if I use an L"" string? Even the docs just leave out the information.
auto s2 = L"hello"; // const wchar_t* <-- it's undefined but why?
auto s3 = u"hello"; // const char16_t*, encoded as UTF-16
auto s4 = U"hello"; // const char32_t*, encoded as UTF-32

wchar_t is a standard type, but its exact implementation is left to individual compilers. Microsoft decided back when Unicode all fit into 16-bit quantities that wchar_t would be 2 bytes in size, and Windows would use UCS-2. Later, when Unicode exceeded 16-bit quantities, Windows was updated to use UTF-16, and since Windows operated on little-endian processors, that made it UTF-16LE. wchar_t remained 2 bytes in size, which can handle UTF-16 values, using surrogate pairs for Unicode codepoints above U+FFFF.

Related

IMultiLanguage2::ConvertStringFromUnicode - how to avoid compound prefix?

I am using IMultilanguage2::ConvertStringFromUnicode to convert from UTF-16. For some languages (Japanese, Chinese, Korean), I am getting an escape sequence (e.g. 0x1B, 0x24, 0x29, 0x43 for codepage 50225 (ISO-2022 Korean)). WideCharToMultiByte exhibits the same behavior.
I am building a MIME message, so the encoding is specified in the header itself and the escape prefix is displayed as-is.
Is there a way to convert without the prefix?
Thank you!

I don't really see a problem here. That is a valid byte sequence in ISO 2022:
Escape sequences to designate character sets take the form ESC I [I...] F, where there are one or more intermediate I bytes from the range 0x20–0x2F, and a final F byte from the range 0x40–0x7F. (The range 0x30–0x3F is reserved for private-use F bytes.) The I bytes identify the type of character set and the working set it is to be designated to, while the F byte identifies the character set itself.
...
Code: ESC $ ) F
Hex: 1B 24 29 F
Abbr: G1DM4
Name: G1-designate multibyte 94-set F
Effect: selects a 94n-character set to be used for G1.
As F is 0x43 (C), this byte sequence tells a decoder to switch to ISO-2022-KR:
Character encodings using ISO/IEC 2022 mechanism include:
...
ISO-2022-KR. An encoding for Korean.
ESC $ ) C to switch to KS X 1001-1992, previously named KS C 5601-1987 (2 bytes per character) [designated to G1]
In this case, you have to specify iso-2022-kr as the charset in a MIME Content-Type or RFC2047-encoded header. But an ISO 2022 decoder still has to be able to switch charsets dynamically while decoding, so it is valid for the data to include an intial switch sequence to the Korean charset.
Is there a way to convert without the prefix?
Not with IMultiLanguage2 and WideCharToMultiByte(), no. They have no clue how you are going to use their output, so it makes sense why they include an initial switch sequence to the Korean charset - so a decoder without access to charset info from MIME (or other source) would still know what charset to use initially.
When you put the data into a MIME message, you will have to manually strip off the charset switch sequence when you set the MIME charset to iso-2022-kr. If you do not want to strip it manually, you will have to find (or write) a Unicode encoder that does not output that initial switch sequence.

That was a red herring - turned out the escape sequence is necessary. The problem was with my code that was trimming the names and addresses using Trim() Delphi function, which trims all characters less than or equal to space (0x20); that includes the escape character (0x1B).
Switching to my own trimming function that removes only spaces fixed the problem.

Does V8 have Unicode support?

I'm using v8 to use JavaScript in native(c++) code. To call a Javascript function I need to convert all the parameters to v8 data types.
For eg: Code to convert char* to v8 data type
char* value;
...
v8::String::New(value);
Now, I need to pass unicode chars(wchar_t) to JavaScript.
First of all does v8 supports Unicode chars? If yes, how to convert wchar_t/std::wstring to v8 data type?

I'm not sure if this was the case at the time this question was asked, but at the moment the V8 API has a number of functions which support UTF-8, UTF-16 and Latin-1 encoded text:
https://github.com/v8/v8/blob/master/include/v8.h
The relevant functions to create new string objects are:
String::NewFromUtf8 (UTF-8 encoded, obviously)
String::NewFromOneByte (Latin-1 encoded)
String::NewFromTwoByte (UTF-16 encoded)
Alternatively, you can avoid copying the string data and construct a V8 string object that refers to existing data (whose lifecycle you control):
String::NewExternalOneByte (Latin-1 encoded)
String::NewExternalTwoByte (UTF-16 encoded)

Unicode just maps Characters to Number. What you need is proper encoding, like UTF8 or UTF-16.
V8 seems to support UTF-8 (v8::String::WriteUtf8) and a not further described 16bit type (Write). I would give it a try and write some UTF-16 into it.
In unicode applications, windows stores UTF-16 in std::wstring. Maybe you try something like
std::wstring yourString;
v8::String::New (yourString.c_str());

No it doesn't have unicode support, the above solution is fine.

The following code did the trick
wchar_t path[1024] = L"gokulestás";
v8::String::New((uint16_t*)path, wcslen(path))

libxml2 questions about xmlChar*

I'm using libxml2. All function are working with xmlChar*. I found that xmlChar is an unsigned char.
So I have some questions about how to work with it.
1) For example if I working with utf-16 or utf-32 file how libxml2 process it and returns xmlChar in function? Will I lose some characters then??
2) If I want to do something with this string, should I cast it to char* or wchar_t* and how??
Will I lose some characters?

xmlChar is for handling UTF-8 encoding only.
So, to answer your questions:
No, you won't loose any characters if using UTF-16 or UTF-32. Just use iconv or any other library to encode your UTF-16 or UTF-32 data before passing it to the API.
Do not just "cast" the string. Convert them if needed in some other encoding.

Difference between MBCS and UTF-8 on Windows

I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ? What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ? Also, I found the following quote in MSDN:
Unicode is a 16-bit character encoding
This negates whatever I read about the Unicode. I thought unicode can be encoded with different encodings such as UTF-8 and UTF-16. Can somebody shed some more light on this confusion?

I noticed that there are two compiler
flags in Visual Studio compiler (for
C++) called MBCS and UNICODE. What is
the difference between them ?
Many functions in the Windows API come in two versions: One that takes char parameters (in a locale-specific code page) and one that takes wchar_t parameters (in UTF-16).
int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);
Each of these function pairs also has a macro without the suffix, that depends on whether the UNICODE macro is defined.
#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif
In order to make this work, the TCHAR type is defined to abstract away the character type used by the API functions.
#ifdef UNICODE
typedef wchar_t TCHAR;
#else
typedef char TCHAR;
#endif
This, however, was a bad idea. You should always explicitly specify the character type.
What I am not getting is how UTF-8 is
conceptually different from a MBCS
encoding ?
MBCS stands for "multi-byte character set". For the literal minded, it seems that UTF-8 would qualify.
But in Windows, "MBCS" only refers to character encodings that can be used with the "A" versions of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987), and 950 (Big5), but NOT UTF-8.
To use UTF-8, you have to convert the string to UTF-16 using MultiByteToWideChar, call the "W" version of the function, and call WideCharToMultiByte on the output. This is essentially what the "A" functions actually do, which makes me wonder why Windows doesn't just support UTF-8.
This inability to support the most common character encoding makes the "A" version of the Windows API useless. Therefore, you should always use the "W" functions.
Update: As of Windows 10 build 1903 (May 2019 update), UTF-8 is now supported as an "ANSI" code page. Thus, my original (2010) recommendation to always use "W" functions no longer applies, unless you need to support old versions of Windows. See UTF-8 Everywhere for text-handling advice.
Unicode is a 16-bit character encoding
This negates whatever I read about the
Unicode.
MSDN is wrong. Unicode is a 21-bit coded character set that has several encodings, the most common being UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings as well, such as GB18030, UTF-7, and UTF-EBCDIC.)
Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early adopter of Unicode, back when 16 bits was thought to be enough for everyone, and UTF-8 was only used on Plan 9. So UCS-2 was Unicode.

_MBCS and _UNICODE are macros to determine which version of TCHAR.H routines to call. For example, if you use _tcsclen to count the length of a string, the preprocessor would map _tcsclen to different version according to the two macros: _MBCS and _UNICODE.
_UNICODE & _MBCS Not Defined: strlen
_MBCS Defined: _mbslen
_UNICODE Defined: wcslen
To explain the difference of these string length counting functions, consider following example.
If you have a computer box that run Windows Simplified Chinese edition which use GBK(936 code page), you compile a gbk-file-encoded source file and run it.
printf("%d\n", _mbslen((const unsigned char*)"I爱你M"));
printf("%d\n", strlen("I爱你M"));
printf("%d\n", wcslen((const wchar_t*)"I爱你M"));
The result would be 4 6 3.
Here is the hexdecimal representation of I爱你M in GBK.
GBK: 49 B0 AE C4 E3 4D 00
_mbslen knows this string is encoded in GBK, so it could intepreter the string correctly and get the right result 4 words: 49 as I, B0 AE as 爱, C4 E3 as 你, 4D as M.
strlen only knows 0x00, so it get 6.
wcslen consider this hexdeciaml array is encoded in UTF16LE, and it count two bytes as one word, so it get 3 words: 49 B0, AE C4, E3 4D.
as #xiaokaoy pointed out, the only valid terminator for wcslen is 00 00. Thus the result is not guranteed to be 3 if the following byte is not 00.

MBCS means Multi-Byte Character Set and describes any character set where a character is encoded into (possibly) more than 1 byte.
The ANSI / ASCII character sets are not multi-byte.
UTF-8, however, is a multi-byte encoding. It encodes any Unicode character as a sequence of 1, 2, 3, or 4 octets (bytes).
However, UTF-8 is only one out of several possible concrete encodings of the Unicode character set. Notably, UTF-16 is another, and happens to be the encoding used by Windows / .NET (IIRC). Here's the difference between UTF-8 and UTF-16:
UTF-8 encodes any Unicode character as a sequence of 1, 2, 3, or 4 bytes.
UTF-16 encodes most Unicode characters as 2 bytes, and some as 4 bytes.
It is therefore not correct that Unicode is a 16-bit character encoding. It's rather something like a 21-bit encoding (or even more these days), as it encompasses a character set with code points U+000000 up to U+10FFFF.

As a footnote to the other answers, MSDN has a document Generic-Text Mappings in TCHAR.H with handy tables summarizing how the preprocessor directives _UNICODE and _MBCS change the definition of different C/C++ types.
As to the phrasing "Unicode" and "Multi-Byte Character Set", people have already described what the effects are. I just want to emphasize that both of those are Microsoft-speak for some very specific things. (That is, they mean something less general and more particular-to-Windows than one might expect if coming from a non-Microsoft-specific understanding of text internationalization.) Those exact phrases show up and tend to get their own separate sections/subsections of microsoft technical documents, e.g. in Text and Strings in Visual C++

What kind of strings does CFStringCreateWithFormat expects as arguments?

The below example should work with Unicode strings but it doesn't.
CFStringRef aString = CFSTR("one"); // in real life this is an Unicode string
CFStringRef formatString = CFSTR("This is %s example"); // also tried %S but without success
CFStringRef resultString = CFStringCreateWithFormat(NULL, NULL, formatString, aString);
// Here I should have a valid sentence in resultString but the current result is like aString would contain garbage.

Use %# if you want to include a CFStringRef via CFStringCreateWithFormat.
See the Format Specifiers section of Strings Programming Guide for Core Foundation.
%# is for Objective C objects, OR CFTypeRef objects (CFStringRef is compatible with CFTypeRef)
%s is for a null-terminated array of 8-bit unsigned characters (i.e. normal C strings).
%S is for a null-terminated array of 16-bit Unicode characters.
A CFStringRef object is not the same as “a null-terminated array of 16-bit Unicode characters”.

As an answer to the comment in the other answer, I would recommend the poster to
generate a UTF8 string in a portable way into char*
and, at the last minute, convert it to CFString using CFStringCreateWithCString with kCFStringEncodingUTF8 as the encoding.
Please, please do not use %s in CFStringCreateWithFormat. Please do not rely on the "system encoding", which is MacRoman on Western European environments, but not in other languages. The concept of the system encoding is inherently brain-dead, especially in east Asian environments (which I came from) where even the characters inside ASCII code range (below 127!) is modified. Hell breaks loose if you rely on "system encoding". Fortunately, since 10.4, all of the methods which use "system encoding" are now deprecated, except %s... .
I'm sorry I write this much for this small topic, but it was a real pity a few years ago when there were many nice apps which didn't work on Japanese/Korean Macs because of just this "system encoding." Please refer to this detailed explanation which I wrote a few years ago, if you're interested.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio