All the PHP files in my workspace are encoded in Unicode (UTF-8, no BOM). I often duplicate an existing source file to use as a base for a new script. Invariably (with Path Finder or the original Finder), OS X will convert the encoding of the duplicate file to Western (Mac OS Roman).
Is there any way to make OS X behave and not convert the text encoding when duplicating a text file? Or make it use a specific text encoding (other than Western!) by default for all files with .php extension?
I highly doubt that the "conversion" you're seeing is an actual difference in the two copies of the file. OS X doesn't store any sort of encoding value for a file's content, so it's much more likely that it's whatever tool you're using to view/edit your files that's interpreting the content differently, whether it be Xcode, TextWrangler, or whatever other editor you're using. Doing a straight copy of the file, whether it be from the Finder, Path Finder, or the command line, won't ever change the actual data in the file in the process of copying.
UTF-8 and Mac OS Roman are actually the same if you don't have any characters above 0x7f, so it's very possible that a valid UTF-8 file with no BOM could be interpreted as a Mac OS Roman file. In any case, I would look into whatever program it is that's showing you that encoding to find the solution, not the file copying procedure of OS X itself.
Please note that there is indeed an extended attribute in OS X that stores the file's encoding for text files produced through -[NSString writeToFile:...] or similar facilities, which may not be getting copied along with the file.
Are you using the Finder to copy your files, or some other tool (an IDE)?
It's possible your original file has some extended Finder metadata in its Resource Fork that's being lost in the copying, if you don't use Finder to copy it.
-W
Related
So... I was testing jGrasp and when i openned my testing file I saw something like this:
¿Khà ?
instead of this:
¿Khà?
but when i compile it, first i got the weird characters (the encoding was wrong). So i changed the encoding on the WorkSpace>Charset (the default, I/O and cygwin) to UTF-8 and got the correct output (like in the second image)... but it still looks the same on jGrasp.
If I change it on jGrasp so it looks "good", on other text editors will look diferent (and also in the compiler).
EDIT
I have found a few other encodings that work, but they aren't UTF-8, and also i don 't want to be changing every moment the encoding.
I'm not clear on exactly what the problem is, but if you need to open and/or edit a single file with a specific encoding different from the default, use "File" > "Open" and specify the charset on the dialog. The charset choice will be remembered.
I am using date and time to label a new file that I'm creating, but when I view the file, the colon is a forward slash. I am developing on a Mac using 10.7+
Here is the code I'm using:
File.open("#{time.hour} : 00, #{time.month}-#{time.day}-#{time.year}", "a") do |mFile|
mFile.syswrite("#{pKey} - #{tKey}: \n")
mFile.syswrite("Items closed: #{itemsClosed} | Total items: #{totalItems} | Percent closed: % #{pClosed} \n")
mFile.syswrite("\n")
mFile.close
end
Here is the output (assuming the time is 1pm):
13 / 00, 11-8-2012
Why is this happening and how can I fix it? I want the output to be:
13:00, 11-8-2012
Once upon a time, before Mac OS X, : was the directory separator instead of /. Apparently OS X 10.7 is still trying to fix up programs like that. I don't know how you can fix this, if you really need the : to be there. I'd omit it :-).
EDIT: After a bit more searching this USENIX paper describes what is going on. The rule they use apparently is this:
Another obvious problem is the different path separators between HFS+ (colon, ':') and UFS
(slash, '/'). This also means that HFS+ file names may contain the slash character and not
colons, while the opposite is true for UFS file names. This was easy to address, though it
involves transforming strings back and forth. The HFS+ implementation in the kernel's VFS
layer converts colon to slash and vice versa when reading from and writing to the on-disk
format. So on disk the separator is a colon, but at the VFS layer (and therefore anything
above it and the kernel, such as libc) it's a slash. However, the traditional Mac OS
toolkits expect colons, so above the BSD layer, the core Carbon toolkit does yet another
translation. The result is that Carbon applications see colons, and everyone else sees
slashes. This can create a user-visible schizophrenia in the rare cases of file names
containing colon characters, which appear to Carbon applications as slash characters, but
to BSD programs and Cocoa applications as colons.
While OS X "is" a unix operating system, it also derives quite a bit its code, APIs, standards, etc from Mac OS 9. In unix, file paths have "/" separating the elements and ":" is allowed in the names of individual files and directories. In Mac OS 9, it was the other way around: file paths had ":" between elements and "/" was allowed in individual filenames. When Apple developed OS X, they wound up having to support some APIs that used unix-style file paths, and some APIs that used OS 9-style paths, and they had to both be able to work on the same filesystem.
What they did is to swap delimiters and allowed characters depending on context. If you write (/run) a program that uses unix APIs to access the filesystem, you'll see files with colons in their names and slashes separating path elements. If you write (/run) a program that uses the old OS 9 APIs (or their derivatives), you'll see files with slashes in their names and colons separating path elements. See Apple's developer Q&A #1392 and notes on specifying paths in AppleScript for a bit more discussion.
(There are some other differences as well. A unix path is absolute if it starts with the delimiter ("/"), and absolute paths start at the top of the root volume. An OS 9 path is absolute if it doesn't start with a delimiter, and absolute OS 9 paths start with a volume name. Thus, the unix path "/tmp/foo:bar" is equivalent to the OS 9 path "Macintosh HD:tmp:foo/bar".)
So, which character is really in the filename, a slash or a colon? Well, a filename is a rather abstract thing, but if you're asking about the bytes that're actually stored on the disk... if it's on an HFS+ (aka Mac OS Extended) volume, it's being stored in a filesystem that was designed to work with the OS 9 (well, technically Mac OS 8.1) APIs, so it allows slashes but forbids colons, so on an HFS+ volume the file will "really" have a slash in the name. OTOH if you store the file on a unixish volume, it'll be stored using the unix convention, and "really" have a colon in the name. But the difference doesn't really matter unless you're reading raw bytes off the disk or writing a filesystem driver...
Finally, why does the Finder display the controversial filename character as slash rather than colon? I'm pretty sure it's mostly inertia. The Finder isn't even entirely consistent about this, since if you use its Go To Folder option (Command-Shift-G) and type in "/Users/Shared", it treats that as a unix path. If you type in "Macintosh HD:Users:Shared", it has no idea what you're talking about. Furthermore, if you run touch /tmp/foo:bar, then try to get to it with Go To Folder:
Entering "/tmp/foo:bar" works.
Entering "/tmp/fo" then pressing tab autocompletes it to "/tmp/foo/bar/", which works.
Entering "/tmp/foo/bar/" fails, even though it's exactly the same as the autocomplete.
Entering "/tmp/foo" then pressing tab autocompletes to "/tmp/foo/", which cannot be autocompleted any further and doesn't work at all.
Update: as Konrad Rudolph pointed out, the Go To Folder behavior has changed as of El Capitan, and I there's no longer any way to use it to get to folders containing the controversial character.
To avoid as many problems as possible when dealing with File names, paths, and various OSes, you really should take advantage of the built-in File methods, like join, dirname, basename, extname, and split. They try to avoid system dependencies and try to give you a programmatic way to generate valid filenames cross-platform.
This problem was a lot worse back when Apple used the old Macintosh operating system. The move to Mac OS helped, because they dropped using : as a separator, however those people who were manually building filenames found code breaking because it generated the wrong delimiters, whereas taking advantage of the libraries handled the problem.
Because this particular problem isn't a bug, nor is it in Ruby's control but Apple's, I'd say it's not a Ruby problem at all, it's a visualization issue, and if you want the filename to resemble what the Finder displays code accordingly.
I need to create a mapping between file names generated on Windows and OS X. I know that OS X "converts all file names to decomposed Unicode" however, "most volume formats do not follow the exact specification for these normal forms"
So, it does not seem a simple matter of converting the Windows name to NFD using a standard UTF8 API and being sure I have the correct OS X name. Is there a way to determine what the actual OS X file name will be without actually creating the file in the file system and then scanning the directory to see what was actually created?
I think the answer is this from TechNote 1150 HFS Plus Volume Format:
Note: The Mac OS Text Encoding Converter provides several constants
that let you convert to and from the canonical, decomposed form stored
on HFS Plus volumes. When using CreateTextEncoding to create a text
encoding, you should set the TextEncodingBase to
kTextEncodingUnicodeV2_0, set the TextEncodingVariant to
kUnicodeCanonicalDecompVariant, and set the TextEncodingFormat to
kUnicode16BitFormat. Using these values ensures that the Unicode will
be in the same form as on an HFS Plus volume, even as the Unicode
standard evolves.
You're probably looking for -[NSString fileSystemRepresentation] method.
Note that there is no general solution for this task. What is a valid file name depends on filesystem of the volume you're saving on. Not every file name valid for HFS+ is valid for FAT32, for example.
For Mac's “standard” filesystem (currently HFS+), fileSystemRepresentation should give what you need; for other file systems, there is no general way. Think about ones that don't exist but will be introduced in the future, for example :)
According to your link, filesystem drivers appear to (mostly) follow one of two behaviours:
* Return all names in NFD, and convert names as appropriate.
* Don't perform any conversions.
In both these cases, if you create a file on OSX in NFD, reading it back on OSX should give you the name in NFD.
OTOH, if your filename goes from Windows → NFS → Mac and you want to do some sort of sync, you're out of luck. This is not an easy thing to do, since the underlying problem is a little philosophical: Should filenames be byte strings or Unicode strings? I believe Unix traditionally does the former, and at least in Linux, UTF-8 NFC names are merely a convention.
(It gets worse, since IIRC HFS+ is defined to use Unicode 3.something, so a naïve conversion to NFD might be wrong for characters added/changed since then unless the API you use can guarantee a specific Unicode version.)
We're creating an app that is going to generate some text files on *nix systems with hashed filenames to avoid too-long filenames.
However, it would be nice to tag the files with some metadata that gives a better clue as to what their content is.
Hence my question. Does anyone have any experience with creating files with custom metadata in Ruby?
I've done some searching and there seem to be some (very old) gems that read metadata:
https://github.com/kig/metadata
http://oai.rubyforge.org/
I also found: system file, read write o create custom metadata or attributes extended which seems to suggest that what I need may be at the system level, but dropping down there feels dirty and scary.
Anyone know of libraries that could achieve this? How would one create custom metadata for files generated by Ruby?
A very old but interesting question with no answers!
In order for a file to contain metadata, it has to have a format that has some way (implicitly or explicitly) to describe where and how the metadata is stored.
This can be done by the format, such as having a header that says where the "main" data is stored and where the "metadata" is stored, or perhaps implicitly, such as having a length to the "main" data, and storing metadata as anything beyond the "main" data.
This can also be done by the OS/filesystem by storing information along with the files, such as permission info, modtime, user, and more comprehensive file information like "icon" as you would find with iOS/Windows.
(Note that I am using "quotes" around "main" and "metadata" because the reality is that it's all data, and needs to be stored in some way that tools can retrieve it)
A true text file does not contain any headers or any such file format, and is essentially just a continuous block of characters (disregarding how the OS may store it). This also means that it can be generally opened by any text editor, which will merely read and display all the characters it finds.
So the answer in some sense is that you can't, at least not on a true text file that is truly portable to multiple OS.
A few thoughts on how to get around this:
Use binary at the end of the text file with hope/requirements that their text editor will ignore non-ascii.
Store it in the OS metadata for the file and make it OS specific (such as storing it in the "comments" section that an OS may have for a file.
Store it in a separate file that goes "along with" the file (i.e., file.txt and file.meta) and hope that they keep the files together.
Store it in a separate file and zip the text and the meta file together and have your tool be zip aware.
Come up with a new file format that is not just text but has a text section (though then it can no longer be edited with a text editor).
Store the metadata at the end of the text file in a text format with perhaps comments or some indicator to leave the metadata alone. This is similar to the technique that the vi/vim text editor uses to embed vim commands into a file, it just puts them as comments at the beginning or end of the file.
I'm not sure there are many other ways to accomplish what you want, but perhaps one of those will work.
I have been reading about the issue with trying to figure out the actual encoding of a file and all its complications.
But I just need to know what the encoding of a file was set to when it was saved. Does windows store this information somewhere similar to file type , date modified etc., ?
That's not available. The Windows file system (NTFS) doesn't store any metadata for a file beyond the trivial stuff like name, extension, last written date, etcetera. Nothing that's specific for the file type.
All you have available is the BOM, bytes at beginning of the file that indicate the UTF encoding and byte order. It only exists for files encoded in UTF and, unfortunately, is optional. The real troublemakers however are text files that were encoded with a particular 8-bit non-Unicode code page. Usually created by a legacy application. Nothing you can do for that but hope that the file wasn't created too far away from your machine so that the default system code page is a match.
No operating system stores the information about the encoding to a file. the encoding is a property of text file only. Since some text files do not have .txt extension and some .txt file is not really a text file, associating the encoding to a file does not make much sense.
Some UTF-8 files store the byte order mark (BOM) at the beginning of the file which can be used to check whether it is a UTF-8 file or not. However, BOM is not always present and a UTF-8 file does not need to have BOM. So the only way to determine the encoding of the text file is to open it up with different encoding method until you can read the file.