"Safe" File naming - windows

Are there any "unsafe" file names that can be encountered in Windows, Mac OS, Linux, etc?
For example:
New Video 2012-External Room
GED Practice Sheet
RgRrE-re-_d Da-
I've heard that even naming files with spaces, underscores, capital letters, and dashes could be potentially problematic, even though Windows doesn't include them in their list of forbidden characters. Is this true? I vaguely recall seeing programs that don't distinguish between uppercase and lowercase characters, and I know that HTML URLs encode unsafe ASCII characters as % (for example, spaces).

Both Unix-like (including Linux and Mac OS) and Windows should have no problem with underscores. Spaces should also generally be fine, but you occasionally find buggy code that can't handle them.
For Windows, it's not that capitals are problematic. It's that Windows filesystems are case-insensitive, so in some cases when interoperating (e.g. with a git repo which is case sensitive) you can end up with problems (e.g. the repo ends up with duplicates with different capitalization).
I'm not sure about -. One reason to avoid it is that - has special meaning for many command-line programs (e.g. rm -r). So you have to use annoying syntax like .\-r. I would also generally avoid more exotic ones like %.

It depends strongly on context of use. Certain non-forbidden characters can cause problems for certain programs, though the vast majority of applications which use standard system APIs should not encounter any issues.
Some programs (especially command-line tools) can be sensitive to the presence of spaces in the filename. Others may use only ASCII internally, and thus be incapable of handling filenames containing characters outside of basic ASCII. (Most modern OSes, by and large, will accept almost any Unicode character in a filename).
Some tools might require certain characters to be escaped (e.g. % in batch scripts), while others may not like having quotes in the filename.
Finally, a note on upper/lowercase: most Windows filesystems are case-preserving but otherwise case-insensitive, so upper/lowercase differences usually don't matter.
But, note that in almost every case, the files can still be used even if some workaround is needed to make them work.

Related

Colon (:) appears as forward slash (/) when creating file name

I am using date and time to label a new file that I'm creating, but when I view the file, the colon is a forward slash. I am developing on a Mac using 10.7+
Here is the code I'm using:
File.open("#{time.hour} : 00, #{time.month}-#{time.day}-#{time.year}", "a") do |mFile|
mFile.syswrite("#{pKey} - #{tKey}: \n")
mFile.syswrite("Items closed: #{itemsClosed} | Total items: #{totalItems} | Percent closed: % #{pClosed} \n")
mFile.syswrite("\n")
mFile.close
end
Here is the output (assuming the time is 1pm):
13 / 00, 11-8-2012
Why is this happening and how can I fix it? I want the output to be:
13:00, 11-8-2012
Once upon a time, before Mac OS X, : was the directory separator instead of /. Apparently OS X 10.7 is still trying to fix up programs like that. I don't know how you can fix this, if you really need the : to be there. I'd omit it :-).
EDIT: After a bit more searching this USENIX paper describes what is going on. The rule they use apparently is this:
Another obvious problem is the different path separators between HFS+ (colon, ':') and UFS
(slash, '/'). This also means that HFS+ file names may contain the slash character and not
colons, while the opposite is true for UFS file names. This was easy to address, though it
involves transforming strings back and forth. The HFS+ implementation in the kernel's VFS
layer converts colon to slash and vice versa when reading from and writing to the on-disk
format. So on disk the separator is a colon, but at the VFS layer (and therefore anything
above it and the kernel, such as libc) it's a slash. However, the traditional Mac OS
toolkits expect colons, so above the BSD layer, the core Carbon toolkit does yet another
translation. The result is that Carbon applications see colons, and everyone else sees
slashes. This can create a user-visible schizophrenia in the rare cases of file names
containing colon characters, which appear to Carbon applications as slash characters, but
to BSD programs and Cocoa applications as colons.
While OS X "is" a unix operating system, it also derives quite a bit its code, APIs, standards, etc from Mac OS 9. In unix, file paths have "/" separating the elements and ":" is allowed in the names of individual files and directories. In Mac OS 9, it was the other way around: file paths had ":" between elements and "/" was allowed in individual filenames. When Apple developed OS X, they wound up having to support some APIs that used unix-style file paths, and some APIs that used OS 9-style paths, and they had to both be able to work on the same filesystem.
What they did is to swap delimiters and allowed characters depending on context. If you write (/run) a program that uses unix APIs to access the filesystem, you'll see files with colons in their names and slashes separating path elements. If you write (/run) a program that uses the old OS 9 APIs (or their derivatives), you'll see files with slashes in their names and colons separating path elements. See Apple's developer Q&A #1392 and notes on specifying paths in AppleScript for a bit more discussion.
(There are some other differences as well. A unix path is absolute if it starts with the delimiter ("/"), and absolute paths start at the top of the root volume. An OS 9 path is absolute if it doesn't start with a delimiter, and absolute OS 9 paths start with a volume name. Thus, the unix path "/tmp/foo:bar" is equivalent to the OS 9 path "Macintosh HD:tmp:foo/bar".)
So, which character is really in the filename, a slash or a colon? Well, a filename is a rather abstract thing, but if you're asking about the bytes that're actually stored on the disk... if it's on an HFS+ (aka Mac OS Extended) volume, it's being stored in a filesystem that was designed to work with the OS 9 (well, technically Mac OS 8.1) APIs, so it allows slashes but forbids colons, so on an HFS+ volume the file will "really" have a slash in the name. OTOH if you store the file on a unixish volume, it'll be stored using the unix convention, and "really" have a colon in the name. But the difference doesn't really matter unless you're reading raw bytes off the disk or writing a filesystem driver...
Finally, why does the Finder display the controversial filename character as slash rather than colon? I'm pretty sure it's mostly inertia. The Finder isn't even entirely consistent about this, since if you use its Go To Folder option (Command-Shift-G) and type in "/Users/Shared", it treats that as a unix path. If you type in "Macintosh HD:Users:Shared", it has no idea what you're talking about. Furthermore, if you run touch /tmp/foo:bar, then try to get to it with Go To Folder:
Entering "/tmp/foo:bar" works.
Entering "/tmp/fo" then pressing tab autocompletes it to "/tmp/foo/bar/", which works.
Entering "/tmp/foo/bar/" fails, even though it's exactly the same as the autocomplete.
Entering "/tmp/foo" then pressing tab autocompletes to "/tmp/foo/", which cannot be autocompleted any further and doesn't work at all.
Update: as Konrad Rudolph pointed out, the Go To Folder behavior has changed as of El Capitan, and I there's no longer any way to use it to get to folders containing the controversial character.
To avoid as many problems as possible when dealing with File names, paths, and various OSes, you really should take advantage of the built-in File methods, like join, dirname, basename, extname, and split. They try to avoid system dependencies and try to give you a programmatic way to generate valid filenames cross-platform.
This problem was a lot worse back when Apple used the old Macintosh operating system. The move to Mac OS helped, because they dropped using : as a separator, however those people who were manually building filenames found code breaking because it generated the wrong delimiters, whereas taking advantage of the libraries handled the problem.
Because this particular problem isn't a bug, nor is it in Ruby's control but Apple's, I'd say it's not a Ruby problem at all, it's a visualization issue, and if you want the filename to resemble what the Finder displays code accordingly.

How many valid utf8 characters are there?

I know that this is a little vague, so for context, think of it as "a character you could tweet," or something like that. My question is how many valid unicode characters are there that a browser or a service that supports utf8 could resolve, in such a way that a utf8 browser could copy and paste it around without any issues.
I guess what I don't want is the full character space, because I know a lot of it is reserved for command characters or reserved characters that wouldn't be shown (unless I'm super wrong!).
UTF-8 isn't the important factor, since all of the standard Unicode encodings (UTF-8, UTF-16, UTF-32) encode the same character space, just in different ways.
From your explanation I see you don't just want the 1,112,064 valid Unicode code points?
Unicode 6.0 and ISO/IEC 10646:2010 define 109,449 characters, but a handful of those are what you're calling "control characters". Which ones do or don't fall into that category depends on how you're counting. Copying and pasting may result in some characters being treated as identical to one another, or ignore altogether, depending on the OS and the programs doing the copying and pasting.
However because Unicode is forward compatible, some systems will correctly preserve characters which haven't yet been assigned. After all, just because you're running Windows XP and you copy and paste a document with characters that weren't standardised until 2009 doesn't mean you expect them to vanish. There could be a million or so extra possible characters by this way of thinking, although their visual appearance may be indistinguishable in some places.

Right single apostrophe vs. apostrophe?

Right single quotation mark (U+2019)
vs.
Apostrophe (U+0027)
What is the difference between these two characters?
I ran into this issue where I use CAtlString to load a string from a resource file, and on some Windows installations, the LoadString fails when trying to load a string that contains U+2019, but it works on some other Windows installations. The U+2019 character appears in strings in my resource file that I copied from Word, and U+0027 appears in stirngs that I hand coded. Why does LoadString (sometimes) choke on this?
What is the difference between these two characters?
Arguable!
Going by the names, one would imagine that the curly ‹’› is only for use as a quotation mark, and that the straight ‹'› is only for use as a real apostrophe, an indicator of omitted letters.
However traditional typesetting practice in English is always to use a curly ‹’› to render an apostrophe. Personally—and I may be alone here—I don't like this. It can make for more ambiguous reading:
“He said, ‘It’s fish ’n’ chips’...”
with the apostrophes being straight it's (marginally) clearer where the quotation ends:
“He said, ‘It's fish 'n' chips’...”
and the apostrophe being ‘straight’ makes more sense to me because its purpose of indicating omitted letters has no inherent directionality, whereas quotation marks are clearly asymmetrical in purpose.
In traditional ASCII, of course, there are no smart quotes, so the apostrophe is always used for both...
on some Windows installations, the LoadString fails when trying to load a string that contains U+2019, but it works on some other Windows installations.
Here you are meeting the horror of the ‘ANSI’ code page. This is a default character encoding that is different across different Windows install locales. So on a machine in the Western region, you get different results when you read a resource to when you read it on a Japanese Windows.
It is highly unfortunate that Windows has varying default code pages instead of using a single global encoding like UTF-8, but it's too late to fix now. If you compile your whole application as a Unicode app (so you'll be using LoadStringW rather than LoadStringA) then you can cope with non-ASCII characters like the smart quotes much better.
If you can't move to a Unicode application you're a bit stuck. You won't be able to handle non-ASCII characters like the smart quotes globally, so stick with ASCII characters like the straight apostrophe ‹'› alone.
The U+2019 character appears in strings in my resource file that I copied from Word
Yes, Word has an annoying AutoCorrect feature that replaces all apostrophes you type with smart quotes. This is especially undesirable when you are dealing with code, where ‹’› will break the program; but it's also wrong even for plain old English, as it's not possible to correctly guess the desired direction of the quote. (It'll get one of the apostrophes in “fish 'n' chips” the wrong way round, for example.)
I suggest turning off the automatic-replace-with-smart-quotes feature. If you want the smart quotes, it's better to type them deliberately. Unfortunately they are inconvenient to type on most keyboard layouts, often requiring obscure Alt+numpad sequences. Personally I use this one to drop them onto Alt+[] keys.
Historically, single-quote and double-quote come in pairs, left (open) and right (close).
For many years the character sets of computers were limited, having a single form of each.
Now, with the advent of Unicode, the full forms are available, but support for them is still limited. Programming languages still use the simple forms, and the full forms can still cause problems.

Delimiter for meta data in Windows file name

I'm working on maintenance of an application that transfers a file to another system and uses a structured filename to include meta data including a language code. The current app uses a two character language code and a dash/hyphen for a delimiter.
Ex. Canada-EN-ProdName-ProdCode.txt
I'm converting it to use IETF language code and so the dash delimiter won't do and need a replacement. I'm trying to determine a delimiter to avoid future errors and am considering the tilde ~.
Ex. Canada~en-GB~ProdName~ProdCode.txt
This will be use only on Windows Sever 2003 + systems. I certainly didn't come up with this system of parsing a filename to get meta data. Unfortunately, I can't include this in the file itself and the destination system is expecting the language code to be in IETF format with the dash.
Any thoughts on potential issues with using the tilde in the filename, or perhaps a better character to use? I'm just looking for a second opinion in case I'm overlooking a possible failure. I believe windows will use the tilde when shortening a long filename to 8.3 format, but I don't see that as an issue here as the OSs can handle lang filenames.
The tilde is probably fine, but what's wrong with the good old underscore _ ? It has no special meaning on either windows or unix, and makes names that are relatively easy to read. If there are no other special considerations, I would avoid the tilde solely out of paranoia, since windows does use it as a special character sometimes, as you mentioned.
For anyone readiong this question I would strongly recommend anything but the tilde in the file name or at least be careful in testing for any speed problems with any .NET path work where one exists.
I used this as a file name delimiter some time ago. I couldn't understand why simply getting a list of files from the folders was taking so long. It was a number of years later (having written a lot of speed up code that had marginal advantage) that I discovered there is a problem with the (DirectoryInfo(path).name in .NET at least) where simple existience of the tilde was forcing underlying code to through a lot of hoops.
The slow down was substantial (it was over a network so I had thought it was bandwidth/Network issues for a fair while)
I understand this is a legacy overhang for when alternative short versions of filenames could be used for Windows files.
I am now stuck with the tilde in these file names but, given that the problem lay in some of the .NET path functions (I don't actually know if it still does), I could work around it by spotting a tilde and creating my own answers when it existed rather than passing it through.
If in any doubt just run speed tests with and without the tilde in filenames for say just 500-1,000 files.

Creating files in a *nix environment with an email address as its name

PLEASE don't tell me why you think its a bad idea. Just tell me if its a workable idea.
I want to create files in a folder with names like the following:
asdf#qwerty.com.eml
abc+def#asdf.net.eml
abc_def#sasdf.at.eml
Is there some fundamental incompatibility in the characters allowed in email addresses and those allowed by a unix system?
I will be having a bash script reading the file names, subtracting the ".eml" ending, converting it into the "correct" usable format and sending an email to the address.
A simple test showed that it saved the above as files called
asdf\#qwerty.com.eml
abc+def\#asdf.net.eml
abc_def\#sasdf.at.eml
The only characters not allowed in a *nix filename are \0 and /, neither of which is allowed in an email address anyways. How your shell may handle symbols is another matter.
There are no characters disallowed in UNIX file names except / (directory separator) and ASCII 0 (string terminator), so there is no problem at a fundamental level.
Handling those file names in shell scripts is a different matter; it requires at least quoting every variable reference as "$FILENAME", and forgetting even one quotatino will create a very rare, insidious bug. (Also, many other utilities will fail on strange characters such as | or newline unless you consistently use the -0 option.)
So yes, technically your bad idea is workable :-)
Short answer:
przemek#linux-634b:~/tmp/email> touch john.smith#example.com
przemek#linux-634b:~/tmp/email> ls
john.smith#example.com
Works perfectly;)
Long answer:
It depends on filesystem you're using. See Wikipedia entry which lists allowed characters in file names. Most UNIX file systems support all characters that can be included in e-mail addresses. Non-UNIX filesystems, such as FAT, however, may cause problems.
Note that your problems may come from improper escaping. Check how are you creating your files.
What was your "simple test"?
Typing abc and hitting tab?
The bash autocompletion will add a \ before every special character.
But this does not mean, it is stored with a \ in its name.
Use ls to see the true name.
There is no problem with such file names on systems which treat file names as blobs and allow all byte sequences, i.e. Linux. But they are not portable to systems which treat file names as Unicode strings and disallow certain characters (Windows) or transform file names (Mac OS X, canonical decomposition).

Resources