OS X - how to calculate normalized file name - macos

I need to create a mapping between file names generated on Windows and OS X. I know that OS X "converts all file names to decomposed Unicode" however, "most volume formats do not follow the exact specification for these normal forms"
So, it does not seem a simple matter of converting the Windows name to NFD using a standard UTF8 API and being sure I have the correct OS X name. Is there a way to determine what the actual OS X file name will be without actually creating the file in the file system and then scanning the directory to see what was actually created?

I think the answer is this from TechNote 1150 HFS Plus Volume Format:
Note: The Mac OS Text Encoding Converter provides several constants
that let you convert to and from the canonical, decomposed form stored
on HFS Plus volumes. When using CreateTextEncoding to create a text
encoding, you should set the TextEncodingBase to
kTextEncodingUnicodeV2_0, set the TextEncodingVariant to
kUnicodeCanonicalDecompVariant, and set the TextEncodingFormat to
kUnicode16BitFormat. Using these values ensures that the Unicode will
be in the same form as on an HFS Plus volume, even as the Unicode
standard evolves.

You're probably looking for -[NSString fileSystemRepresentation] method.
Note that there is no general solution for this task. What is a valid file name depends on filesystem of the volume you're saving on. Not every file name valid for HFS+ is valid for FAT32, for example.
For Mac's “standard” filesystem (currently HFS+), fileSystemRepresentation should give what you need; for other file systems, there is no general way. Think about ones that don't exist but will be introduced in the future, for example :)

According to your link, filesystem drivers appear to (mostly) follow one of two behaviours:
* Return all names in NFD, and convert names as appropriate.
* Don't perform any conversions.
In both these cases, if you create a file on OSX in NFD, reading it back on OSX should give you the name in NFD.
OTOH, if your filename goes from Windows → NFS → Mac and you want to do some sort of sync, you're out of luck. This is not an easy thing to do, since the underlying problem is a little philosophical: Should filenames be byte strings or Unicode strings? I believe Unix traditionally does the former, and at least in Linux, UTF-8 NFC names are merely a convention.
(It gets worse, since IIRC HFS+ is defined to use Unicode 3.something, so a naïve conversion to NFD might be wrong for characters added/changed since then unless the API you use can guarantee a specific Unicode version.)

Related

Unicode filenames on FAT-32?

As far as I understand - NTFS supports Unicode filenames (UTF-16 as Micorsoft claims?).
But official MSDN documentation is very vague regarding what codepage(s) is used to store filenames (filepaths) on FAT-32.
Here it says that OEM code page (CP437 I assume) is used to store filenames: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317748.aspx
But here it turns out that there can be different OEM codepages with CP437 being one of them: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752.aspx
And we all now that utilities like mount support many more different codepages for FAT, more than just OEM codepages set.
So what is the actual cdepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created? Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?
And more specific question:
What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?
You might have to experiment here. This is a great question, and I'm not 100% confident, but:
So what is the actual codepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created?
The "OEM codepage", whatever that is for the system.
Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?
No, I don't believe FAT is directly capable of either UTF-16 or UTF-8. That said, Microsoft stores the Unicode filename in an out of band method. A file thus has two filenames. (This is how you can have longer than 8.3 character filenames, as well.)
And more specific question: What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?
The Unicode filename, as passed to CreateFileW is stored directly in the out of band filename. It is re-encoded into the OEM codepage (whatever that happens to be on the system) and is put there. If it cannot be converted into the OEM codepage, or exceeds 8.3 characters, Windows will call the file something like, FILENA~1.TXT.
Some citations for these answers:
First, this page tells us that the OEM code page != the Windows code page:
Non-Unicode applications that create FAT files sometimes have to use the standard C runtime library conversion functions to translate between the Windows code page character set and the OEM code page character set. With Unicode implementations of the file system functions, it is not necessary to perform such translations.
On a typical American system, the OEM code page is "CP437", but the Windows code page is Windows-1252 (The FooA calls, I believe, use the Windows code page, typically Windows-1252 on an American machine, but depends on locale).
If you have a FAT volume available, you can see this in action. The character "Σ" (U+03a3) is not present in Windows-1252, however, it is in CP437. You can see both the short and long filenames with dir /X. With a file named asdfΣ.txt, you'll see:
ASDFΣ.TXT asdfΣ.txt
However, with a file named "asdfΛ.txt" (Λ is not present in either CP437 or Windows-1252), you'll see:
ASDF~1.TXT asdf?.txt
(You'll likely see ?, because cmd.exe's font cannot display a Λ.)
For information about long filenames, see this Wikipedia article.
Also, interestingly, if you name a file "asdf©.txt", you might get:
ASDFC.TXT asdfc.txt
… I'm not 100% sure here, but I think Windows cleverly decided to substitute "c" for ©, and did likewise for displaying it. If you change the font to something not raster based, like Consolas, you'll see:
ASDFC.TXT asdf©.txt
And this is why you should use the FooW functions.
The basic FAT or FAT32 directory entries support only short names (the old DOS 8.3 format) in the current OEM codepage. However, VFAT (FAT with long filename support) which is used while under Windows, can store an additional, so-called long filename for each file, in UTF-16.

Meaning of this string \\.\c:

I'm reading this. Here I've found some code lines, for example: wsprintf(szDrive, "\\\\.\\%c:", *lpszSrc); I want to ask, what does this string give?
I tried to look for information but all that I've found is:
In the ANSI version of this function, the name is limited to MAX_PATH
characters. To extend this limit to 32,767 wide characters, call the
Unicode version of the function and prepend "\\?\" to the path. For
more information, see Naming Files, Paths, and Namespaces.
and this do not answer into my question, so asking here. As I think it should be connected with windows specific or NTFS but not sure about that.
The %c is the single character format specifier for wsprintf.
The code is used to generate path names of this form:
\\.\C:
This is the path to a physical volume. You use such a path when performing file operations directly on a volume, bypassing the file system. So you'd use such a path when implementing raw disk copy, for example. The documentation for CreateFile has more detail.
This all ties in with the fact that the code you found this in performs a raw disk copy.

Colon (:) appears as forward slash (/) when creating file name

I am using date and time to label a new file that I'm creating, but when I view the file, the colon is a forward slash. I am developing on a Mac using 10.7+
Here is the code I'm using:
File.open("#{time.hour} : 00, #{time.month}-#{time.day}-#{time.year}", "a") do |mFile|
mFile.syswrite("#{pKey} - #{tKey}: \n")
mFile.syswrite("Items closed: #{itemsClosed} | Total items: #{totalItems} | Percent closed: % #{pClosed} \n")
mFile.syswrite("\n")
mFile.close
end
Here is the output (assuming the time is 1pm):
13 / 00, 11-8-2012
Why is this happening and how can I fix it? I want the output to be:
13:00, 11-8-2012
Once upon a time, before Mac OS X, : was the directory separator instead of /. Apparently OS X 10.7 is still trying to fix up programs like that. I don't know how you can fix this, if you really need the : to be there. I'd omit it :-).
EDIT: After a bit more searching this USENIX paper describes what is going on. The rule they use apparently is this:
Another obvious problem is the different path separators between HFS+ (colon, ':') and UFS
(slash, '/'). This also means that HFS+ file names may contain the slash character and not
colons, while the opposite is true for UFS file names. This was easy to address, though it
involves transforming strings back and forth. The HFS+ implementation in the kernel's VFS
layer converts colon to slash and vice versa when reading from and writing to the on-disk
format. So on disk the separator is a colon, but at the VFS layer (and therefore anything
above it and the kernel, such as libc) it's a slash. However, the traditional Mac OS
toolkits expect colons, so above the BSD layer, the core Carbon toolkit does yet another
translation. The result is that Carbon applications see colons, and everyone else sees
slashes. This can create a user-visible schizophrenia in the rare cases of file names
containing colon characters, which appear to Carbon applications as slash characters, but
to BSD programs and Cocoa applications as colons.
While OS X "is" a unix operating system, it also derives quite a bit its code, APIs, standards, etc from Mac OS 9. In unix, file paths have "/" separating the elements and ":" is allowed in the names of individual files and directories. In Mac OS 9, it was the other way around: file paths had ":" between elements and "/" was allowed in individual filenames. When Apple developed OS X, they wound up having to support some APIs that used unix-style file paths, and some APIs that used OS 9-style paths, and they had to both be able to work on the same filesystem.
What they did is to swap delimiters and allowed characters depending on context. If you write (/run) a program that uses unix APIs to access the filesystem, you'll see files with colons in their names and slashes separating path elements. If you write (/run) a program that uses the old OS 9 APIs (or their derivatives), you'll see files with slashes in their names and colons separating path elements. See Apple's developer Q&A #1392 and notes on specifying paths in AppleScript for a bit more discussion.
(There are some other differences as well. A unix path is absolute if it starts with the delimiter ("/"), and absolute paths start at the top of the root volume. An OS 9 path is absolute if it doesn't start with a delimiter, and absolute OS 9 paths start with a volume name. Thus, the unix path "/tmp/foo:bar" is equivalent to the OS 9 path "Macintosh HD:tmp:foo/bar".)
So, which character is really in the filename, a slash or a colon? Well, a filename is a rather abstract thing, but if you're asking about the bytes that're actually stored on the disk... if it's on an HFS+ (aka Mac OS Extended) volume, it's being stored in a filesystem that was designed to work with the OS 9 (well, technically Mac OS 8.1) APIs, so it allows slashes but forbids colons, so on an HFS+ volume the file will "really" have a slash in the name. OTOH if you store the file on a unixish volume, it'll be stored using the unix convention, and "really" have a colon in the name. But the difference doesn't really matter unless you're reading raw bytes off the disk or writing a filesystem driver...
Finally, why does the Finder display the controversial filename character as slash rather than colon? I'm pretty sure it's mostly inertia. The Finder isn't even entirely consistent about this, since if you use its Go To Folder option (Command-Shift-G) and type in "/Users/Shared", it treats that as a unix path. If you type in "Macintosh HD:Users:Shared", it has no idea what you're talking about. Furthermore, if you run touch /tmp/foo:bar, then try to get to it with Go To Folder:
Entering "/tmp/foo:bar" works.
Entering "/tmp/fo" then pressing tab autocompletes it to "/tmp/foo/bar/", which works.
Entering "/tmp/foo/bar/" fails, even though it's exactly the same as the autocomplete.
Entering "/tmp/foo" then pressing tab autocompletes to "/tmp/foo/", which cannot be autocompleted any further and doesn't work at all.
Update: as Konrad Rudolph pointed out, the Go To Folder behavior has changed as of El Capitan, and I there's no longer any way to use it to get to folders containing the controversial character.
To avoid as many problems as possible when dealing with File names, paths, and various OSes, you really should take advantage of the built-in File methods, like join, dirname, basename, extname, and split. They try to avoid system dependencies and try to give you a programmatic way to generate valid filenames cross-platform.
This problem was a lot worse back when Apple used the old Macintosh operating system. The move to Mac OS helped, because they dropped using : as a separator, however those people who were manually building filenames found code breaking because it generated the wrong delimiters, whereas taking advantage of the libraries handled the problem.
Because this particular problem isn't a bug, nor is it in Ruby's control but Apple's, I'd say it's not a Ruby problem at all, it's a visualization issue, and if you want the filename to resemble what the Finder displays code accordingly.

Why does apple use .plist files?

Why does apple use .plist files?
Windows uses .ini files, which may be less flexible, but also takes up less space, for the same reason why JSon takes up less space than XML.
They could even use JSON for their configuration, it's at least as easy to parse, supports the value types they need (dict etc.) and takes up the least space.
The original property list format found in NeXTSTEP looked a lot like JSON, but with slightly different syntax. When NeXTSTEP became Mac OS X, that format was replaced with the XML version you see today. The new format had a few improvements over the old one which you can read about in that link.
Property lists can hold several data types that JSON (and INI files) cannot: Numbers specified as real numbers (floating point) or integers, dates, and base64-encoded binary data. Also, JSON wasn’t documented publicly until well after Mac OS X was released.
Mac OS 10.2 and newer include a binary plist format that’s much more space-efficient than XML, and plist files can be converted losslessly between the two.
Because NeXTSTEP used them, so Apple adopted them as well.
Property List Wiki Page:
Under NeXTSTEP, property lists were designed to be human-readable and
edited by hand, serialized to ASCII in a syntax somewhat like a
programming language.
NeXTSTEP used one format to represent a property list, and the
subsequent GNUstep and Mac OS X frameworks introduced differing
formats.
While Mac OS X can also read the NeXTSTEP format, Apple sets it aside
in favor of two new formats of its own.
In Mac OS X 10.0, the NeXTSTEP
format was deprecated, and a new XML format was introduced, with a
public DTD defined by Apple. The XML format supports non-ASCII
characters and storing NSValue objects (which, unlike GNUstep's ASCII
property list format, Apple's ASCII property list format does not
support). Since XML files, however, are not the most space-efficient
means of storage, Mac OS X 10.2 introduced a new format where property
list files are stored as binary files. Starting with Mac OS X 10.4,
this is the default format for preference files.
I believe that was one of the things left over from the NeXTSTEP days... as for why they prefer to use it, it's probably because they can. ;-)

Win32 File Name Comparison

Does anyone know what culture settings Win32 uses when dealing with case-insensitive files names?
Is this something that varies based on the user's culture, or are the casing rules that Win32 uses culture invariant?
An approximate answer is at
Comparing Unicode file names the right way.
Basically, the recommendation is to uppercase both strings (using CharUpper, CharUpperBuff, or LCMapString), then compare using a binary comparison (i.e. memcmp or wmemcmp, not CompareString with an invariant locale). The file system doesn't do Unicode normalization, and the case rules are not dependent on locale settings.
There are unfortunate ambiguous cases when dealing with characters whose casing rules have changed across different versions of Unicode, but it's about as good as you can do.
Comparing file names in native code and Don't compare filenames are a couple of good blog posts on this topic. The first has C/C++ code for OrdinalIgnoreCaseCompareStrings, and the second tells you how that doesn't always work for filenames and what to do to mitigate that.
Then there are the Unicode problems. While these new OrdinalIgnoreCase string comparison algorithms are great for your local NTFS drive, they might not yield the right answer on your FAT drive, or a network share.
So what's the answer? When possible, let the file system tell you. CreateFile can tell you if a given filename exists. Just pick the right creation disposition. If you need to compare to handles, you can often use GetFileInformationByHandle; look at dwVolumeSerialNumber/nFileIndexHigh/nFileIndexLow.
If you're using .NET, the official recommendation from Microsoft is to use StringComparison.OrdinalIgnoreCase for comparison and ToUpperInvariant for normalization (to be later compared using Ordinal comparison). This also applies to Registry keys and values, environment variables etc.
See New Recommendations for Using Strings in Microsoft .NET 2.0 for more details.
Note that while it's reliable on NTFS, it can fail with network shares, for example. See #SteveSteiner's answer and links in his post for solutions.
According to Windows Driver Samples FastFAT and CDFS, it uses RtlUpcaseUnicodeString to convert a string to uppercase. According to a brief look in Ghidra, that uses an internal function named NLS_UPCASE, whose behavior is based on your current system codepage.

Resources