Why D3DCompile doesn't support unicode sources? - windows

If i pass a buffer with unicode source (wchar_t*) into D3DCompile, i'll get a "unrecognize character" error.
Why it doesn't support unicode source? doesn't all Windows API and COM support unicode?

In general, Win32 APIs that accept text as input support both char* and wchar_t*. D3DCompile is an exception, and only accepts char*. As long as you avoid using unicode characters outside of comments, you should be able to use the WideCharToMultiByte function to convert the text to a form that can be used with D3DCompile.

Related

Is there a Unicode alternative to OutputDebugString?

OutputDebugString() is native ASCII, which means it convert the input Unicode string to local string before call the ASCII version OutputDebugStringA().
Is there any alternative to OutputDebugString() which supports Unicode?
OutputDebugStringW does internally call OutputDebugStringA, so Unicode characters that cannot be represented in the system code page will be replaced with ?.
Oddly enough, the OUTPUT_DEBUG_STRING_INFO structure the debugger receives from the operating system to print the message does appear to support letting the debugger know if the string is Unicode, it just doesn't appear to be used by OutputDebugStringW at all.
Unfortunately, I don't know of a mechanism to get the OS to raise a OUTPUT_DEBUG_STRING_EVENT with a Unicode string. It may not be possible with public APIs.

Making Indy return utf-8 strings instead of ansi strings in FPC/Lazarus?

Are there any compiler directives or preprocessor commands that need to be set in a particular way to make Indy return Utf-8 strings rather than truncating them into Ansi strings? The project I'm working on has all kinds of Delphi-mode flags all over it if that matters.
If I directly set the subject line to a UTF-8 string (like below) it displays correctly on the GUI, so utf-8 support is set up correctly and I'm using an appropriate font and all of that good stuff. Subject is declared as Utf8String for clarity in this code.
MailItem.Subject := 'îņŢëŕŃïóЙǟŁ ŜũƥĵεϿד'; //Displays correctly
However, if I pull the same subject line from the header, using Indy to decode it, I get every international character replaced with exactly one question-mark, exactly the number of should-be characters. Looks like it's converting UTF-8 to ANSI, which is not what I want.
MailItem.Subject := IdCoderHeader.DecodeHeader('=?utf-8?B?w67FhsWiw6vFlcWDw6/Ds9CZx5/FgSDFnMWpxqXEtc61z7/Xkw==?='); //Displays '?????????? ???????'
So what things could be going wrong and/or how can I fix it?
I am using the latest version of Indy10 from Indy's website, and Lazarus 1.0 on Windows, so I don't think this is a "my software needs updating" bug, I think it's probably some sort of configuration issue.
No, there are no compiler flags you can set for that.
Indy uses AnsiString in Delphi prior to D2009, and in FreePascal prior to 3.0. Indy uses UnicodeString in Delphi 2009+ and FreePascal 3.0+. There is no option to change that.
However, in non-Unicode versions of Delphi and FreePascal, in some places of Indy, you can instruct Indy to interpret AnsiString input values as UTF-8, and return UTF-8 encoded AnsiString output. However, DecodeHeader() is not one of those places.

"Windows uses UTF-16 as its internal encoding", what exactly does this mean?

Excuse me if the question is stupid, it's kind of confused me, suppose I have a application(no matter C, C++,.NET or Java) on my Windows XP, and this application will get data from a remote machine, the data contain Chinese characters, now if Chinese characters become junk, is it correct to say that Windows has nothing to do with this issue? because Windows uses UTF-16, and can handle Chinese characters properly.
On the other hand, suppose Windows uses ASCII as its internal encoding, does this mean that any applications on it can never display Chinese characters correctly?
Thanks in advance.
The Windows NT kernel uses UNICODE_STRING for many (or is it most?) named objects (e.g. files). The encoding is UTF-16.
Many of user-mode callable APIs expose pairs of almost identical functions, where one in the pair accepts Unicode strings and, the other, ANSI strings. The ANSI string versions end up converting names from ANSI to Unicode.
For example, when you call C's fopen() function, which accepts 8-bit non-Unicode file names, it ends up invoking CreateFileA() (ANSI), and that eventually calls NtCreateFile(), which accepts Unicode file names. One of NtCreateFile()'s parameters, the OBJECT_ATTRIBUTES structure, contains a pointer to a UNICODE_STRING structure.
If you, on the other hand, call MSVC++'s _wfopen() function, it will reach NtCreateFile() through CreateFileW() (Unicode) without the conversion.
To store any text in memory and display it on screen, the OS needs to handle that text in some encoding behind the scenes. What encoding that is specifically shouldn't matter to you. It could handle it as HTML encoded ASCII for all you know, as long as the APIs accept certain text and it outputs the right thing.
"Windows uses UTF-16 internally" means Windows happens to store and handle text internally as UTF-16. It also supports Chinese text. These two things aren't necessarily connected. Yes, using UTF-16 internally makes it easier to support Chinese, which is probably why the Windows engineers chose to go with UTF-16.

Convert output of Windows LPWSTR API's to UTF-8?

Pardon for my lack of knowledge in this area. Windows natively uses some type of multiple-byte encoding (Is it UTF-16?). Regardless, I am using a regular expression library that needs the output in UTF-8. What is the Windows API used to convert a standard 2-byte LPWSTR to UTF-8?
WideCharToMultiByte and first argument CP_UTF8.
By the way, since English includes the pound sign, euro sign, etc., your language is affected as much as others are.

What encoding are filenames in NTFS stored as?

I'm just getting started on some programming to handle filenames with non-english names on a WinXP system. I've done some recommended reading on unicode and I think I get the basic idea, but some parts are still not very clear to me.
Specifically, what encoding (UTF-8, UTF-16LE/BE) are the file names (not the content, but the actual name of the file) stored in NTFS? Is it possible to open any file using fopen(), which takes a char*, or do I have no choice but to use wfopen(), which uses a wchar_t*, and presumably takes a UTF-16 string?
I tried manually feeding in a UTF-8 encoded string to fopen(), eg.
unsigned char filename[] = {0xEA, 0xB0, 0x80, 0x2E, 0x74, 0x78, 0x74, 0x0}; // 가.txt
FILE* f = fopen((char*)filename, "wb+");
but this came out as 'ê°€.txt'.
I was under the impression (which may be wrong) that a UTF8-encoded string would suffice in opening any filename under Windows, because I seem to vaguely remember some Windows application passing around (char*), not (wchar_t*), and having no problems.
Can anyone shed some light on this?
NTFS stores filenames in UTF-16, however fopen is using ANSI (not UTF-8).
In order to use an UTF16-encoded file name you will need to use the Unicode versions of the file open calls. Do this by defining UNICODE and _UNICODE in your project. Then use the CreateFile call or the wfopen call.
fopen() - in MSVC on windows does not (by default) take a utf-8 encoded char*.
Unfortunately utf-8 was invented rather recently in the great scheme of things. Windows APIs are divided into Unicode and Ansi versions. every windows api that takes or deals with strings is actually available with a W or A suffix - W for "Wide" character/Unicode and A for Ansi. Macro magic hides all this away from the developer so you just call CreateFile with either a char* or a wchar_t* depending on your build configuration without knowing the difference.
The 'Ansi' encoding is actually not a specific encoding:- But means that the encoding used for "char" strings is specific to the locale setting of the PC.
Now, because c-runtime functions - like fopen - need to work by default without developer knowledge - on windows systems they expect to receive their strings in the windows local encoding. msdn indicates the microsoft c-runtime api setlocal can change the locale of the current thread - but specifically says that it will fail for any locales that need more than 2 bytes per character - like utf-8.
So, on Windows there is no shortcut. You need to use wfopen, or the native API CreateFileW (or create your project using the Unicode build settings and just call Createfile) with wchar_t* strings.
As answered by others, the best way to handle UTF-8-encoded strings is to convert them to UTF-16 and use native Unicode APIs such as _wfopen or CreateFileW.
However, this approach won't help when calling into libraries that use fopen() unconditionally because they do not support Unicode or because they are written in portable C. In that case it is still possible to make use of the legacy "short paths" to convert a UTF-8-encoded string into an ASCII form usable with fopen, but it requires some legwork:
Convert the UTF-8 representation to UTF-16 using MultiByteToWideChar.
Use GetShortPathNameW to obtain a "short path" which is ASCII-only. GetShortPathNameW will return it as a wide string with all-ASCII content, which you will need to trivially convert it to a narrow string by a lossless copy casting each wchar_t char.
Pass the short path to fopen() or to the code that will eventually use fopen(). Be aware that error messages printed by that code, if any, will refer to the unsightly "short path" (e.g. KINTO~1 instead of kinto-un-筋斗雲).
While this is not exactly a recommended long-term strategy, as Windows short paths are a legacy feature that can be turned off per-volume, it is likely the only way to pass file names to code that uses fopen() and other file-related API calls (stat, access, ANSI versions of CreateFile and similar).

Resources