Colon (:) appears as forward slash (/) when creating file name - ruby

I am using date and time to label a new file that I'm creating, but when I view the file, the colon is a forward slash. I am developing on a Mac using 10.7+
Here is the code I'm using:
File.open("#{time.hour} : 00, #{time.month}-#{time.day}-#{time.year}", "a") do |mFile|
mFile.syswrite("#{pKey} - #{tKey}: \n")
mFile.syswrite("Items closed: #{itemsClosed} | Total items: #{totalItems} | Percent closed: % #{pClosed} \n")
mFile.syswrite("\n")
mFile.close
end
Here is the output (assuming the time is 1pm):
13 / 00, 11-8-2012
Why is this happening and how can I fix it? I want the output to be:
13:00, 11-8-2012

Once upon a time, before Mac OS X, : was the directory separator instead of /. Apparently OS X 10.7 is still trying to fix up programs like that. I don't know how you can fix this, if you really need the : to be there. I'd omit it :-).
EDIT: After a bit more searching this USENIX paper describes what is going on. The rule they use apparently is this:
Another obvious problem is the different path separators between HFS+ (colon, ':') and UFS
(slash, '/'). This also means that HFS+ file names may contain the slash character and not
colons, while the opposite is true for UFS file names. This was easy to address, though it
involves transforming strings back and forth. The HFS+ implementation in the kernel's VFS
layer converts colon to slash and vice versa when reading from and writing to the on-disk
format. So on disk the separator is a colon, but at the VFS layer (and therefore anything
above it and the kernel, such as libc) it's a slash. However, the traditional Mac OS
toolkits expect colons, so above the BSD layer, the core Carbon toolkit does yet another
translation. The result is that Carbon applications see colons, and everyone else sees
slashes. This can create a user-visible schizophrenia in the rare cases of file names
containing colon characters, which appear to Carbon applications as slash characters, but
to BSD programs and Cocoa applications as colons.

While OS X "is" a unix operating system, it also derives quite a bit its code, APIs, standards, etc from Mac OS 9. In unix, file paths have "/" separating the elements and ":" is allowed in the names of individual files and directories. In Mac OS 9, it was the other way around: file paths had ":" between elements and "/" was allowed in individual filenames. When Apple developed OS X, they wound up having to support some APIs that used unix-style file paths, and some APIs that used OS 9-style paths, and they had to both be able to work on the same filesystem.
What they did is to swap delimiters and allowed characters depending on context. If you write (/run) a program that uses unix APIs to access the filesystem, you'll see files with colons in their names and slashes separating path elements. If you write (/run) a program that uses the old OS 9 APIs (or their derivatives), you'll see files with slashes in their names and colons separating path elements. See Apple's developer Q&A #1392 and notes on specifying paths in AppleScript for a bit more discussion.
(There are some other differences as well. A unix path is absolute if it starts with the delimiter ("/"), and absolute paths start at the top of the root volume. An OS 9 path is absolute if it doesn't start with a delimiter, and absolute OS 9 paths start with a volume name. Thus, the unix path "/tmp/foo:bar" is equivalent to the OS 9 path "Macintosh HD:tmp:foo/bar".)
So, which character is really in the filename, a slash or a colon? Well, a filename is a rather abstract thing, but if you're asking about the bytes that're actually stored on the disk... if it's on an HFS+ (aka Mac OS Extended) volume, it's being stored in a filesystem that was designed to work with the OS 9 (well, technically Mac OS 8.1) APIs, so it allows slashes but forbids colons, so on an HFS+ volume the file will "really" have a slash in the name. OTOH if you store the file on a unixish volume, it'll be stored using the unix convention, and "really" have a colon in the name. But the difference doesn't really matter unless you're reading raw bytes off the disk or writing a filesystem driver...
Finally, why does the Finder display the controversial filename character as slash rather than colon? I'm pretty sure it's mostly inertia. The Finder isn't even entirely consistent about this, since if you use its Go To Folder option (Command-Shift-G) and type in "/Users/Shared", it treats that as a unix path. If you type in "Macintosh HD:Users:Shared", it has no idea what you're talking about. Furthermore, if you run touch /tmp/foo:bar, then try to get to it with Go To Folder:
Entering "/tmp/foo:bar" works.
Entering "/tmp/fo" then pressing tab autocompletes it to "/tmp/foo/bar/", which works.
Entering "/tmp/foo/bar/" fails, even though it's exactly the same as the autocomplete.
Entering "/tmp/foo" then pressing tab autocompletes to "/tmp/foo/", which cannot be autocompleted any further and doesn't work at all.
Update: as Konrad Rudolph pointed out, the Go To Folder behavior has changed as of El Capitan, and I there's no longer any way to use it to get to folders containing the controversial character.

To avoid as many problems as possible when dealing with File names, paths, and various OSes, you really should take advantage of the built-in File methods, like join, dirname, basename, extname, and split. They try to avoid system dependencies and try to give you a programmatic way to generate valid filenames cross-platform.
This problem was a lot worse back when Apple used the old Macintosh operating system. The move to Mac OS helped, because they dropped using : as a separator, however those people who were manually building filenames found code breaking because it generated the wrong delimiters, whereas taking advantage of the libraries handled the problem.
Because this particular problem isn't a bug, nor is it in Ruby's control but Apple's, I'd say it's not a Ruby problem at all, it's a visualization issue, and if you want the filename to resemble what the Finder displays code accordingly.

Related

OS X - how to calculate normalized file name

I need to create a mapping between file names generated on Windows and OS X. I know that OS X "converts all file names to decomposed Unicode" however, "most volume formats do not follow the exact specification for these normal forms"
So, it does not seem a simple matter of converting the Windows name to NFD using a standard UTF8 API and being sure I have the correct OS X name. Is there a way to determine what the actual OS X file name will be without actually creating the file in the file system and then scanning the directory to see what was actually created?
I think the answer is this from TechNote 1150 HFS Plus Volume Format:
Note: The Mac OS Text Encoding Converter provides several constants
that let you convert to and from the canonical, decomposed form stored
on HFS Plus volumes. When using CreateTextEncoding to create a text
encoding, you should set the TextEncodingBase to
kTextEncodingUnicodeV2_0, set the TextEncodingVariant to
kUnicodeCanonicalDecompVariant, and set the TextEncodingFormat to
kUnicode16BitFormat. Using these values ensures that the Unicode will
be in the same form as on an HFS Plus volume, even as the Unicode
standard evolves.
You're probably looking for -[NSString fileSystemRepresentation] method.
Note that there is no general solution for this task. What is a valid file name depends on filesystem of the volume you're saving on. Not every file name valid for HFS+ is valid for FAT32, for example.
For Mac's “standard” filesystem (currently HFS+), fileSystemRepresentation should give what you need; for other file systems, there is no general way. Think about ones that don't exist but will be introduced in the future, for example :)
According to your link, filesystem drivers appear to (mostly) follow one of two behaviours:
* Return all names in NFD, and convert names as appropriate.
* Don't perform any conversions.
In both these cases, if you create a file on OSX in NFD, reading it back on OSX should give you the name in NFD.
OTOH, if your filename goes from Windows → NFS → Mac and you want to do some sort of sync, you're out of luck. This is not an easy thing to do, since the underlying problem is a little philosophical: Should filenames be byte strings or Unicode strings? I believe Unix traditionally does the former, and at least in Linux, UTF-8 NFC names are merely a convention.
(It gets worse, since IIRC HFS+ is defined to use Unicode 3.something, so a naïve conversion to NFD might be wrong for characters added/changed since then unless the API you use can guarantee a specific Unicode version.)

"Safe" File naming

Are there any "unsafe" file names that can be encountered in Windows, Mac OS, Linux, etc?
For example:
New Video 2012-External Room
GED Practice Sheet
RgRrE-re-_d Da-
I've heard that even naming files with spaces, underscores, capital letters, and dashes could be potentially problematic, even though Windows doesn't include them in their list of forbidden characters. Is this true? I vaguely recall seeing programs that don't distinguish between uppercase and lowercase characters, and I know that HTML URLs encode unsafe ASCII characters as % (for example, spaces).
Both Unix-like (including Linux and Mac OS) and Windows should have no problem with underscores. Spaces should also generally be fine, but you occasionally find buggy code that can't handle them.
For Windows, it's not that capitals are problematic. It's that Windows filesystems are case-insensitive, so in some cases when interoperating (e.g. with a git repo which is case sensitive) you can end up with problems (e.g. the repo ends up with duplicates with different capitalization).
I'm not sure about -. One reason to avoid it is that - has special meaning for many command-line programs (e.g. rm -r). So you have to use annoying syntax like .\-r. I would also generally avoid more exotic ones like %.
It depends strongly on context of use. Certain non-forbidden characters can cause problems for certain programs, though the vast majority of applications which use standard system APIs should not encounter any issues.
Some programs (especially command-line tools) can be sensitive to the presence of spaces in the filename. Others may use only ASCII internally, and thus be incapable of handling filenames containing characters outside of basic ASCII. (Most modern OSes, by and large, will accept almost any Unicode character in a filename).
Some tools might require certain characters to be escaped (e.g. % in batch scripts), while others may not like having quotes in the filename.
Finally, a note on upper/lowercase: most Windows filesystems are case-preserving but otherwise case-insensitive, so upper/lowercase differences usually don't matter.
But, note that in almost every case, the files can still be used even if some workaround is needed to make them work.

Delimiter for meta data in Windows file name

I'm working on maintenance of an application that transfers a file to another system and uses a structured filename to include meta data including a language code. The current app uses a two character language code and a dash/hyphen for a delimiter.
Ex. Canada-EN-ProdName-ProdCode.txt
I'm converting it to use IETF language code and so the dash delimiter won't do and need a replacement. I'm trying to determine a delimiter to avoid future errors and am considering the tilde ~.
Ex. Canada~en-GB~ProdName~ProdCode.txt
This will be use only on Windows Sever 2003 + systems. I certainly didn't come up with this system of parsing a filename to get meta data. Unfortunately, I can't include this in the file itself and the destination system is expecting the language code to be in IETF format with the dash.
Any thoughts on potential issues with using the tilde in the filename, or perhaps a better character to use? I'm just looking for a second opinion in case I'm overlooking a possible failure. I believe windows will use the tilde when shortening a long filename to 8.3 format, but I don't see that as an issue here as the OSs can handle lang filenames.
The tilde is probably fine, but what's wrong with the good old underscore _ ? It has no special meaning on either windows or unix, and makes names that are relatively easy to read. If there are no other special considerations, I would avoid the tilde solely out of paranoia, since windows does use it as a special character sometimes, as you mentioned.
For anyone readiong this question I would strongly recommend anything but the tilde in the file name or at least be careful in testing for any speed problems with any .NET path work where one exists.
I used this as a file name delimiter some time ago. I couldn't understand why simply getting a list of files from the folders was taking so long. It was a number of years later (having written a lot of speed up code that had marginal advantage) that I discovered there is a problem with the (DirectoryInfo(path).name in .NET at least) where simple existience of the tilde was forcing underlying code to through a lot of hoops.
The slow down was substantial (it was over a network so I had thought it was bandwidth/Network issues for a fair while)
I understand this is a legacy overhang for when alternative short versions of filenames could be used for Windows files.
I am now stuck with the tilde in these file names but, given that the problem lay in some of the .NET path functions (I don't actually know if it still does), I could work around it by spotting a tilde and creating my own answers when it existed rather than passing it through.
If in any doubt just run speed tests with and without the tilde in filenames for say just 500-1,000 files.

Creating files in a *nix environment with an email address as its name

PLEASE don't tell me why you think its a bad idea. Just tell me if its a workable idea.
I want to create files in a folder with names like the following:
asdf#qwerty.com.eml
abc+def#asdf.net.eml
abc_def#sasdf.at.eml
Is there some fundamental incompatibility in the characters allowed in email addresses and those allowed by a unix system?
I will be having a bash script reading the file names, subtracting the ".eml" ending, converting it into the "correct" usable format and sending an email to the address.
A simple test showed that it saved the above as files called
asdf\#qwerty.com.eml
abc+def\#asdf.net.eml
abc_def\#sasdf.at.eml
The only characters not allowed in a *nix filename are \0 and /, neither of which is allowed in an email address anyways. How your shell may handle symbols is another matter.
There are no characters disallowed in UNIX file names except / (directory separator) and ASCII 0 (string terminator), so there is no problem at a fundamental level.
Handling those file names in shell scripts is a different matter; it requires at least quoting every variable reference as "$FILENAME", and forgetting even one quotatino will create a very rare, insidious bug. (Also, many other utilities will fail on strange characters such as | or newline unless you consistently use the -0 option.)
So yes, technically your bad idea is workable :-)
Short answer:
przemek#linux-634b:~/tmp/email> touch john.smith#example.com
przemek#linux-634b:~/tmp/email> ls
john.smith#example.com
Works perfectly;)
Long answer:
It depends on filesystem you're using. See Wikipedia entry which lists allowed characters in file names. Most UNIX file systems support all characters that can be included in e-mail addresses. Non-UNIX filesystems, such as FAT, however, may cause problems.
Note that your problems may come from improper escaping. Check how are you creating your files.
What was your "simple test"?
Typing abc and hitting tab?
The bash autocompletion will add a \ before every special character.
But this does not mean, it is stored with a \ in its name.
Use ls to see the true name.
There is no problem with such file names on systems which treat file names as blobs and allow all byte sequences, i.e. Linux. But they are not portable to systems which treat file names as Unicode strings and disallow certain characters (Windows) or transform file names (Mac OS X, canonical decomposition).

Are there any invalid linux filenames?

If I wanted to create a string which is guaranteed not to represent a filename, I could put one of the following characters in it on Windows:
\ / : * ? | < >
e.g.
this-is-a-filename.png
?this-is-not.png
Is there any way to identify a string as 'not possibly a file' on Linux?
There are almost no restrictions - apart from '/' and '\0', you're allowed to use anything. However, some people think it's not a good idea to allow this much flexibility.
An empty string is the only truly invalid path name on Linux, which may work for you if you need only one invalid name. You could also use a string like "///foo", which would not be a canonical path name, although it could refer to a file ("/foo"). Another possibility would be something like "/dev/null/foo", since /dev/null has a POSIX-defined non-directory meaning. If you only need strings that could not refer to a regular file you could use "/" or ".", since those are always directories.
Technically it's not invalid but files with dash(-) at the beginning of their name will put you in a lot of troubles. It's because it has conflicts with command arguments.
I personally find that a lot of the time the problem is not Linux but the applications one is using on Linux.
Take for example Amarok. Recently I noticed that certain artists I had copied from my Windows machine where not appearing in the library. I check and confirmed that the files were there and then I noticed that certain characters in the folder names (Named for the artist) were represented with a weird-looking square rather than an actual character.
In a shell terminal the filenames look even stranger: /Music/Albums/Einst$'\374'rzende\ Neubauten is an example of how strange.
While these files were definitely there, Amarok could not see them for some reason. I was able to use some shell trickery to rename them to sane versions which I could then re-name with ASCII-only characters using Musicbrainz Picard. Unfortunately, Picard was also unable to open the files until I renamed them, hence the need for a shell script.
Overall this a a tricky area and it seems to get very thorny if you are trying to synchronise a music collection between Windows and Linux wherein certain folder or file names contain funky characters.
The safest thing to do is stick to ASCII-only filenames.

Resources