Unicode Normalization in Windows

Unicode Normalization in Windows - windows

I've been using "unicode strings" in Windows for as long as... I've learned about Unicode (e.g. after graduating). However, it always mystified me that the Win32API mentions "unicode" very loosely. In particular, "unicode" variant mentioned by MSN is UTF-16 (although the "wide char" terminology comes from the fact that it used to be UCS-2, which is not Unicode). However, it makes almost no mention of Unicode Normalization.
MSN has a few pages about Unicode and Unicode Normalization Forms and functions to change the normalization form. The page on normalization even says:
Win32 and the .NET Framework support all four normalization forms.
However, I haven't found anywhere in the docs what normalization form is used (or understood) by the Win32 API.
Question 1: what normalization form is used by default for user input (such as an Edit control) and conversion through MultiByteToWideChar()?
Question 2: must the strings passed to Win32API functions be in a particular normalization form, or are the kernel and file system normalization-agnostic?

From the MSDN article Using Unicode Normalization to Represent Strings.
Windows, Microsoft applications, and the .NET Framework generally generate characters in form C using normal input methods. For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input. However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.
Update: I've included some specific details relating to Question #2.
In regards to the file system, normalization is not required - based on the article Naming Files, Paths, and Namespaces.
There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. Any normalization that your application requires should be performed with this in mind, external of any calls to related Windows file I/O API functions.
In regards to SQL Server, no normalization is required - nor is data normalized when saved in the database. That said, when comparing strings, SQL Server 2000 uses its own string normalization mechanism inside of indexes; but I cannot find specific details on what that is. A SQL Server 2005 article states the same.
One important change in SQL Server 7.0 was the provision of an operating system–independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 would be consistent. This string comparison code was based on the same code that Windows 2000 uses for its own string normalization, and is encapsulated to be the same on all computers and in all versions of SQL Server.

what normalization form is used by default for user input
Depends on your keyboard layout/IME. It's possible to generate normal form C, D, or a crazy mixture of both if you want.
Keyboard layouts tend towards NFC because in the pre-Unicode days they'd've usually been outputting a single byte character in the local code page for each keypress. However there are exceptions.
For example using the Windows Vietnamese keyboard layout, some diacritics are typed as a single keypress combined with the letter (eg circumflex â) and some are typed as a combining diacritical (eg grave à). The graheme a-with-circumflex-and-grave would be typed as a-circumflex followed by combining-grave, ầ, which would be 0xE2,0xCC in Vietnamese code page 1258, and would come out as U+00E2,U+0300 in Unicode.
This isn't in normal form C (which would be ầ U+1EA7 Latin small letter A with circumflex and grave) nor D (which would be ầ U+0061,U+0302,U+0300).
There is generally a cultural preference for NFC in the Windows world and on the web, and for NFD in the Apple world. But it's not rigorously enforced and you should expect to cope with any mixture of combined and decomposed characters.
are the kernel and file system normalization-agnostic?
Yes, the kernel and filesystem don't know anything about normalisation and will quite happily allow you to have files with the names ầ.txt, ầ.txt and ầ.txt in the same folder.

First of all, thanks for an excellent question. I found the answer in Michael Kaplan's blog:
But since all of the methods of text input on Windows tend to use the same normalization form already (form C), ...

Related

To Understand More about Encoding U+30C9 vs U+30C8U+3099

ド(U+30C9) vs ド(U+30C8U+3099)
Fyi, the situation is
a user uploaded a file with name containing ド(U+30C8U+3099) to AWS s3 from a web app.
Then, the website sent a POST request containing the file name without url encoding to AWS lambda function for further processing using Python. The name when arrived in Lambda became ド(U+30C9). Python then failed to access the file stored in s3 because of the difference in unicode.
I think the solution would be to do url encoding on frontend before sending the request and do url decoding using urllib.parse.unquote to have the same unicode.
My questions are
would url encoding solve that issue? I can't reproduce the same issue probably because I am on a different OS from the user's OS.
How exactly did it happen since both requests (uploading to s3 and sending the 2nd request to lambda) happened on the user's machine?
Thank you.

Your are hitting a common case (maybe more common in Latin scripts): canonical equivalence. Unicode requires to handle canonical equivalent sequences in the same manner.
If you look in UnicodeData.txt you will find:
30C8;KATAKANA LETTER TO;Lo;0;L;;;;;N;;;;;
30C9;KATAKANA LETTER DO;Lo;0;L;30C8 3099;;;;N;;;;;
so, 30C9 is canonical equivalent to 30C8 3099.
Usually, it is better to normalize Unicode strings to a common canonical form. Unfortunately we have two of them: NFC and NFD: Normalization Form Canonical Composition and Normalization Form Canonical Decomposition. Apple prefers the later (and Unicode original design/preference is about this form), and most of the other vendors the first.
So do no trust web browsers to keep the same form. But also consider that input methods on user side may give you different variations (and with keyboards you may have also non-canonical forms which should be normalized [this can happens with several combining characters]).
So, on your backend you should choose a normalization form, and transform all input data in such form (or just be sure that all search and comparing functions can handle equivalent sequences correctly, but this requires a normalization on every call, so it may be less efficient).
Python has unicodedata.normalize() (in standard library, see unicodedata module), to normalize Unicode strings. Eventually on other languages you should use ICU library. In any case, you should normalize Unicode strings.
Note: this has nothing about encoding, but it is in built directly in Unicode design. The reason is about requirement to be compatible with old encoding and old encodings had both ways to describe the same characters.

Using ASCII delimiters (29-31) in modern programming

I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.

The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.

Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.

I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.

"Windows uses UTF-16 as its internal encoding", what exactly does this mean?

Excuse me if the question is stupid, it's kind of confused me, suppose I have a application(no matter C, C++,.NET or Java) on my Windows XP, and this application will get data from a remote machine, the data contain Chinese characters, now if Chinese characters become junk, is it correct to say that Windows has nothing to do with this issue? because Windows uses UTF-16, and can handle Chinese characters properly.
On the other hand, suppose Windows uses ASCII as its internal encoding, does this mean that any applications on it can never display Chinese characters correctly?
Thanks in advance.

The Windows NT kernel uses UNICODE_STRING for many (or is it most?) named objects (e.g. files). The encoding is UTF-16.
Many of user-mode callable APIs expose pairs of almost identical functions, where one in the pair accepts Unicode strings and, the other, ANSI strings. The ANSI string versions end up converting names from ANSI to Unicode.
For example, when you call C's fopen() function, which accepts 8-bit non-Unicode file names, it ends up invoking CreateFileA() (ANSI), and that eventually calls NtCreateFile(), which accepts Unicode file names. One of NtCreateFile()'s parameters, the OBJECT_ATTRIBUTES structure, contains a pointer to a UNICODE_STRING structure.
If you, on the other hand, call MSVC++'s _wfopen() function, it will reach NtCreateFile() through CreateFileW() (Unicode) without the conversion.

To store any text in memory and display it on screen, the OS needs to handle that text in some encoding behind the scenes. What encoding that is specifically shouldn't matter to you. It could handle it as HTML encoded ASCII for all you know, as long as the APIs accept certain text and it outputs the right thing.
"Windows uses UTF-16 internally" means Windows happens to store and handle text internally as UTF-16. It also supports Chinese text. These two things aren't necessarily connected. Yes, using UTF-16 internally makes it easier to support Chinese, which is probably why the Windows engineers chose to go with UTF-16.

ANSI or OEM Codepage when using MME and DirectMusic?

I noticed that when reading MIDI port names from MME, the names are multi-byte strings encoded using the ANSI Codepage, which my app uses by default. When receiving those names from the DirectMusic driver, the names are wide-character strings encoded with the OEM Codepage. See this article by Raymond Chen for a quick refresher on Codepages.
On my German system, this means that when using the current codepage, which turns out to be the ANSI one, I get "Audiogerät" from MME, and "Audiogeröt" from DirectMusic, the latter being wrong. This gets fixed when I treat that last name as OEM-encoded instead.
So how do I know with which codepage to decode those names? Why does the name coming from DirectMusic get encoded differently? Does it come from the USB driver? The COM framework? DirectMusic? How can I know for sure which codepage to use when reading the names of my MIDI ports?
For info:
I use the MultiByteToWideChar() and WideCharToMultiByte() functions to perform the conversions, with CP_ACP and CP_OEMCP as argument for the codepage to use.
I use midiInGetDeviceCaps() to get MIDI port information from the MME subsystem...
... and convert MIDIINCAPS.szPname using the CP_ACP (ANSI) codepage.
I use IID_IDirectMusic8::EnumPort() to get port information from DirectMusic...
... and convert DMUS_PORTCAPS.wszDescription using the CP_OEMCP codepage.

I don't know for sure why the DirectMusic framework would use one set of codepages, and MME another, but the solution here on your end is probably to build an abstraction layer and then make specific implementations for each API. That way, the higher levels of your software don't need to concern itself with details like this.
That said, the endpoint names definitely come from the OS. USB MIDI devices specify only endpoint types (ie, either input or output, and the number), but the OS is free to interpret them as it sees fit, which is why they are localized.
There is not a specific API call (as far as I know) to find out which codepage the framework will deliver its strings in. However, DirectMusic does seem to use double wide characters with OEM codepage as a general convention, though I could not find this clearly stated in any of the MSDN docs. In the MSDN DirectMusic documentation about MIDI port capability structures, the description type clearly is defined as a WCHAR, and the Game Audio Programming book seems to also indicate that this type is an API-wide convention. While it's dangerous to assume that OEM is the default encoding for these chars, I can't find anything that says otherwise (and googling for "DirectMusic codepage" now lists this page as the top hit).
Edit: Check out this stackoverflow question on determining the current OS codepage. It is possible that the DirectMusic API sets the codepage in this manner.

There isn't really an automatic way to tell what codepage is used for these types of data. See here: How can I detect the encoding/codepage of a text file

Win32 File Name Comparison

Does anyone know what culture settings Win32 uses when dealing with case-insensitive files names?
Is this something that varies based on the user's culture, or are the casing rules that Win32 uses culture invariant?

An approximate answer is at
Comparing Unicode file names the right way.
Basically, the recommendation is to uppercase both strings (using CharUpper, CharUpperBuff, or LCMapString), then compare using a binary comparison (i.e. memcmp or wmemcmp, not CompareString with an invariant locale). The file system doesn't do Unicode normalization, and the case rules are not dependent on locale settings.
There are unfortunate ambiguous cases when dealing with characters whose casing rules have changed across different versions of Unicode, but it's about as good as you can do.

Comparing file names in native code and Don't compare filenames are a couple of good blog posts on this topic. The first has C/C++ code for OrdinalIgnoreCaseCompareStrings, and the second tells you how that doesn't always work for filenames and what to do to mitigate that.
Then there are the Unicode problems. While these new OrdinalIgnoreCase string comparison algorithms are great for your local NTFS drive, they might not yield the right answer on your FAT drive, or a network share.
So what's the answer? When possible, let the file system tell you. CreateFile can tell you if a given filename exists. Just pick the right creation disposition. If you need to compare to handles, you can often use GetFileInformationByHandle; look at dwVolumeSerialNumber/nFileIndexHigh/nFileIndexLow.

If you're using .NET, the official recommendation from Microsoft is to use StringComparison.OrdinalIgnoreCase for comparison and ToUpperInvariant for normalization (to be later compared using Ordinal comparison). This also applies to Registry keys and values, environment variables etc.
See New Recommendations for Using Strings in Microsoft .NET 2.0 for more details.
Note that while it's reliable on NTFS, it can fail with network shares, for example. See #SteveSteiner's answer and links in his post for solutions.

According to Windows Driver Samples FastFAT and CDFS, it uses RtlUpcaseUnicodeString to convert a string to uppercase. According to a brief look in Ghidra, that uses an internal function named NLS_UPCASE, whose behavior is based on your current system codepage.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio