Comparing and sorting Unicode filenames

Comparing and sorting Unicode filenames - windows

Using Delphi 2007 and TMS components for Unicode utils and interface (upgrading to Delphi 2009 for Unicode support is not an option).
I'm storing a list of filenames in a string list (TTntStringList). It's sorted and case insensitive. The default sort routine uses CompareStringW(LOCALE_USER_DEFAULT, NORM_IGNORECASE, ...) to compare strings (and the same for Find). However, this is a problem because that will equate dummyss.txt with dummyß.txt (for example), but on NTFS it's perfectly legal to have those two files in the same folder, i.e. they are treated as different names.
My understanding is that on Vista and newer, the correct way to compare filenames is to use CompareStringOrdinal. Is this correct?
On pre-Vista systems, what would be the correct way? I believe it should be CompareStringW(LOCALE_INVARIANT, ...) but I'm not entirely sure.
Thanks

Quote from MSDN article Handling Sorting in Your Applications:
CompareStringOrdinal compares two
Unicode strings to test for binary
equality, as opposed to linguistic
equality. Examples of such
non-linguistic strings are NTFS file
names, ...
CompareStringOrdinal requires Windows Vista or later.
Edit: Yes, it seems that in pre-Vista Windows you can use RtlCompareUnicodeString which is used internally by CompareStringOrdinal, too, and is available since Windows NT.

Related

Using ASCII delimiters (29-31) in modern programming

I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.

The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.

Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.

I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.

"Safe" File naming

Are there any "unsafe" file names that can be encountered in Windows, Mac OS, Linux, etc?
For example:
New Video 2012-External Room
GED Practice Sheet
RgRrE-re-_d Da-
I've heard that even naming files with spaces, underscores, capital letters, and dashes could be potentially problematic, even though Windows doesn't include them in their list of forbidden characters. Is this true? I vaguely recall seeing programs that don't distinguish between uppercase and lowercase characters, and I know that HTML URLs encode unsafe ASCII characters as % (for example, spaces).

Both Unix-like (including Linux and Mac OS) and Windows should have no problem with underscores. Spaces should also generally be fine, but you occasionally find buggy code that can't handle them.
For Windows, it's not that capitals are problematic. It's that Windows filesystems are case-insensitive, so in some cases when interoperating (e.g. with a git repo which is case sensitive) you can end up with problems (e.g. the repo ends up with duplicates with different capitalization).
I'm not sure about -. One reason to avoid it is that - has special meaning for many command-line programs (e.g. rm -r). So you have to use annoying syntax like .\-r. I would also generally avoid more exotic ones like %.

It depends strongly on context of use. Certain non-forbidden characters can cause problems for certain programs, though the vast majority of applications which use standard system APIs should not encounter any issues.
Some programs (especially command-line tools) can be sensitive to the presence of spaces in the filename. Others may use only ASCII internally, and thus be incapable of handling filenames containing characters outside of basic ASCII. (Most modern OSes, by and large, will accept almost any Unicode character in a filename).
Some tools might require certain characters to be escaped (e.g. % in batch scripts), while others may not like having quotes in the filename.
Finally, a note on upper/lowercase: most Windows filesystems are case-preserving but otherwise case-insensitive, so upper/lowercase differences usually don't matter.
But, note that in almost every case, the files can still be used even if some workaround is needed to make them work.

"Windows uses UTF-16 as its internal encoding", what exactly does this mean?

Excuse me if the question is stupid, it's kind of confused me, suppose I have a application(no matter C, C++,.NET or Java) on my Windows XP, and this application will get data from a remote machine, the data contain Chinese characters, now if Chinese characters become junk, is it correct to say that Windows has nothing to do with this issue? because Windows uses UTF-16, and can handle Chinese characters properly.
On the other hand, suppose Windows uses ASCII as its internal encoding, does this mean that any applications on it can never display Chinese characters correctly?
Thanks in advance.

The Windows NT kernel uses UNICODE_STRING for many (or is it most?) named objects (e.g. files). The encoding is UTF-16.
Many of user-mode callable APIs expose pairs of almost identical functions, where one in the pair accepts Unicode strings and, the other, ANSI strings. The ANSI string versions end up converting names from ANSI to Unicode.
For example, when you call C's fopen() function, which accepts 8-bit non-Unicode file names, it ends up invoking CreateFileA() (ANSI), and that eventually calls NtCreateFile(), which accepts Unicode file names. One of NtCreateFile()'s parameters, the OBJECT_ATTRIBUTES structure, contains a pointer to a UNICODE_STRING structure.
If you, on the other hand, call MSVC++'s _wfopen() function, it will reach NtCreateFile() through CreateFileW() (Unicode) without the conversion.

To store any text in memory and display it on screen, the OS needs to handle that text in some encoding behind the scenes. What encoding that is specifically shouldn't matter to you. It could handle it as HTML encoded ASCII for all you know, as long as the APIs accept certain text and it outputs the right thing.
"Windows uses UTF-16 internally" means Windows happens to store and handle text internally as UTF-16. It also supports Chinese text. These two things aren't necessarily connected. Yes, using UTF-16 internally makes it easier to support Chinese, which is probably why the Windows engineers chose to go with UTF-16.

Unicode Normalization in Windows

I've been using "unicode strings" in Windows for as long as... I've learned about Unicode (e.g. after graduating). However, it always mystified me that the Win32API mentions "unicode" very loosely. In particular, "unicode" variant mentioned by MSN is UTF-16 (although the "wide char" terminology comes from the fact that it used to be UCS-2, which is not Unicode). However, it makes almost no mention of Unicode Normalization.
MSN has a few pages about Unicode and Unicode Normalization Forms and functions to change the normalization form. The page on normalization even says:
Win32 and the .NET Framework support all four normalization forms.
However, I haven't found anywhere in the docs what normalization form is used (or understood) by the Win32 API.
Question 1: what normalization form is used by default for user input (such as an Edit control) and conversion through MultiByteToWideChar()?
Question 2: must the strings passed to Win32API functions be in a particular normalization form, or are the kernel and file system normalization-agnostic?

From the MSDN article Using Unicode Normalization to Represent Strings.
Windows, Microsoft applications, and the .NET Framework generally generate characters in form C using normal input methods. For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input. However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.
Update: I've included some specific details relating to Question #2.
In regards to the file system, normalization is not required - based on the article Naming Files, Paths, and Namespaces.
There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. Any normalization that your application requires should be performed with this in mind, external of any calls to related Windows file I/O API functions.
In regards to SQL Server, no normalization is required - nor is data normalized when saved in the database. That said, when comparing strings, SQL Server 2000 uses its own string normalization mechanism inside of indexes; but I cannot find specific details on what that is. A SQL Server 2005 article states the same.
One important change in SQL Server 7.0 was the provision of an operating system–independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 would be consistent. This string comparison code was based on the same code that Windows 2000 uses for its own string normalization, and is encapsulated to be the same on all computers and in all versions of SQL Server.

what normalization form is used by default for user input
Depends on your keyboard layout/IME. It's possible to generate normal form C, D, or a crazy mixture of both if you want.
Keyboard layouts tend towards NFC because in the pre-Unicode days they'd've usually been outputting a single byte character in the local code page for each keypress. However there are exceptions.
For example using the Windows Vietnamese keyboard layout, some diacritics are typed as a single keypress combined with the letter (eg circumflex â) and some are typed as a combining diacritical (eg grave à). The graheme a-with-circumflex-and-grave would be typed as a-circumflex followed by combining-grave, ầ, which would be 0xE2,0xCC in Vietnamese code page 1258, and would come out as U+00E2,U+0300 in Unicode.
This isn't in normal form C (which would be ầ U+1EA7 Latin small letter A with circumflex and grave) nor D (which would be ầ U+0061,U+0302,U+0300).
There is generally a cultural preference for NFC in the Windows world and on the web, and for NFD in the Apple world. But it's not rigorously enforced and you should expect to cope with any mixture of combined and decomposed characters.
are the kernel and file system normalization-agnostic?
Yes, the kernel and filesystem don't know anything about normalisation and will quite happily allow you to have files with the names ầ.txt, ầ.txt and ầ.txt in the same folder.

First of all, thanks for an excellent question. I found the answer in Michael Kaplan's blog:
But since all of the methods of text input on Windows tend to use the same normalization form already (form C), ...

Win32 File Name Comparison

Does anyone know what culture settings Win32 uses when dealing with case-insensitive files names?
Is this something that varies based on the user's culture, or are the casing rules that Win32 uses culture invariant?

An approximate answer is at
Comparing Unicode file names the right way.
Basically, the recommendation is to uppercase both strings (using CharUpper, CharUpperBuff, or LCMapString), then compare using a binary comparison (i.e. memcmp or wmemcmp, not CompareString with an invariant locale). The file system doesn't do Unicode normalization, and the case rules are not dependent on locale settings.
There are unfortunate ambiguous cases when dealing with characters whose casing rules have changed across different versions of Unicode, but it's about as good as you can do.

Comparing file names in native code and Don't compare filenames are a couple of good blog posts on this topic. The first has C/C++ code for OrdinalIgnoreCaseCompareStrings, and the second tells you how that doesn't always work for filenames and what to do to mitigate that.
Then there are the Unicode problems. While these new OrdinalIgnoreCase string comparison algorithms are great for your local NTFS drive, they might not yield the right answer on your FAT drive, or a network share.
So what's the answer? When possible, let the file system tell you. CreateFile can tell you if a given filename exists. Just pick the right creation disposition. If you need to compare to handles, you can often use GetFileInformationByHandle; look at dwVolumeSerialNumber/nFileIndexHigh/nFileIndexLow.

If you're using .NET, the official recommendation from Microsoft is to use StringComparison.OrdinalIgnoreCase for comparison and ToUpperInvariant for normalization (to be later compared using Ordinal comparison). This also applies to Registry keys and values, environment variables etc.
See New Recommendations for Using Strings in Microsoft .NET 2.0 for more details.
Note that while it's reliable on NTFS, it can fail with network shares, for example. See #SteveSteiner's answer and links in his post for solutions.

According to Windows Driver Samples FastFAT and CDFS, it uses RtlUpcaseUnicodeString to convert a string to uppercase. According to a brief look in Ghidra, that uses an internal function named NLS_UPCASE, whose behavior is based on your current system codepage.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio