Ruby: Create files with metadata - ruby

We're creating an app that is going to generate some text files on *nix systems with hashed filenames to avoid too-long filenames.
However, it would be nice to tag the files with some metadata that gives a better clue as to what their content is.
Hence my question. Does anyone have any experience with creating files with custom metadata in Ruby?
I've done some searching and there seem to be some (very old) gems that read metadata:
https://github.com/kig/metadata
http://oai.rubyforge.org/
I also found: system file, read write o create custom metadata or attributes extended which seems to suggest that what I need may be at the system level, but dropping down there feels dirty and scary.
Anyone know of libraries that could achieve this? How would one create custom metadata for files generated by Ruby?

A very old but interesting question with no answers!
In order for a file to contain metadata, it has to have a format that has some way (implicitly or explicitly) to describe where and how the metadata is stored.
This can be done by the format, such as having a header that says where the "main" data is stored and where the "metadata" is stored, or perhaps implicitly, such as having a length to the "main" data, and storing metadata as anything beyond the "main" data.
This can also be done by the OS/filesystem by storing information along with the files, such as permission info, modtime, user, and more comprehensive file information like "icon" as you would find with iOS/Windows.
(Note that I am using "quotes" around "main" and "metadata" because the reality is that it's all data, and needs to be stored in some way that tools can retrieve it)
A true text file does not contain any headers or any such file format, and is essentially just a continuous block of characters (disregarding how the OS may store it). This also means that it can be generally opened by any text editor, which will merely read and display all the characters it finds.
So the answer in some sense is that you can't, at least not on a true text file that is truly portable to multiple OS.
A few thoughts on how to get around this:
Use binary at the end of the text file with hope/requirements that their text editor will ignore non-ascii.
Store it in the OS metadata for the file and make it OS specific (such as storing it in the "comments" section that an OS may have for a file.
Store it in a separate file that goes "along with" the file (i.e., file.txt and file.meta) and hope that they keep the files together.
Store it in a separate file and zip the text and the meta file together and have your tool be zip aware.
Come up with a new file format that is not just text but has a text section (though then it can no longer be edited with a text editor).
Store the metadata at the end of the text file in a text format with perhaps comments or some indicator to leave the metadata alone. This is similar to the technique that the vi/vim text editor uses to embed vim commands into a file, it just puts them as comments at the beginning or end of the file.
I'm not sure there are many other ways to accomplish what you want, but perhaps one of those will work.

Related

Rules for file extensions?

Are there any rules for file extensions? For example, I wrote some code which reads and writes a byte pattern that is only understood by that specific programm. I'm assuming my anti virus programm won't be too happy if I give it the name "pleasetrustme.exe"... Is it gerally allowed to use those extensions? And what about the lesser known ones, like ".arw"?
You can use any file extension you want (or none at all). Using standard extensions that reflect the actual type of the file just makes things more convenient. On Windows, file extensions control stuff like how the files are displayed in Windows Explorer and what happens when you double click on it.
I wrote some code which reads and writes a byte pattern that is only
understood by that specific programm.
A file extension is only an indication of what type of data will be inside, never a guarantee that certain data formatted in a specific way will be inside the file.
For your own specific data structure it is of course always best to choose an extension that is not already in use for other file formats (or use a general extension like .dat or .bin maybe). This also has the advantage of being able to use an own icon without it being overwritten by other software using the same extension - or the other way around.
But maybe even more important when creating a custom (binary?) file format, is to provide a magic number as the first bytes of that file, maybe followed by a file header structure containing a version number etc. That way your own software can first check the header data to make sure it's the right type and version (for example: anyone could rename any file type to your extension, so your program needs to have a way to do some checks inside the file before reading the remaining data).

Is the ReplaceFile Windows API a convenience function only?

Is the ReplaceFile Windows API a convenience function only, or does it achieve anything beyond what could be coded using multiple calls to MoveFileEx?
I'm currently in the situation where I need to
write a temporary file and then
rename this temporary file to the original filename, possibly replacing the original file.
I thought about using MoveFileEx with MOVEFILE_REPLACE_EXISTING (since I don't need a backup or anything) but there is also the ReplaceFile API and since it is mentioned under Alternatives to TxF.
This got me thinking: Does ReplaceFile actually do anything special, or is it just a convenience wrapper for MoveFile(Ex)?
I think the key to this can be found in this line from the documentation (my emphasis):
The replacement file assumes the name of the replaced file and its identity.
When you use MoveFileEx, the replacement file has a different identity. Its creation date is not preserved, the creator is not preserved, any ACLs are not preserved and so on. Using ReplaceFile allows you to make it look as though you opened the file, and modified its contents.
The documentation says it like this:
Another advantage is that ReplaceFile not only copies the new file data, but also preserves the following attributes of the original file:
Creation time
Short file name
Object identifier
DACLs
Security resource attributes
Encryption
Compression
Named streams not already in the replacement file
For example, if the replacement file is encrypted, but the
replaced file is not encrypted, the resulting file is not
encrypted.
Any app that wants to update a file by writing to a temp and doing the rename/rename/delete dance (handling all the various failure scenarios correctly), would have to change each time a new non-data attribute was added to the system. Rather than forcing all apps to change, they put in an API that is supposed to do this for you.
So you could "just do it yourself", but why? Do you correctly cover all the failure scenarios? Yes, MS may have a bug, but why try to invent the wheel?
NB, I have a number of issues with the programming model (better to do a "CreateUsingTemplate") but it's better than nothing.

Line breaks don't show up right in YAML?

I am looking at a YAML config file for a database, and all I see is a big jumble of text. However, I notice that there is a missing character every now and then if I use my keyboard's arrow keys to navigate around, I notice that there is occasionally a spot where the cursor gets stuck and requires me to press the arrow key two times instead of one. I am currently assuming that this is a line break that only YAML parsers can read. When I force a line break by pressing ENTER, the YAML parser does not understand the config file anymore. How can I get past this limitation without using a non-windows program? This line break has a Hex value of 0A.
As requested, a snippet of what the current YAML text looks like and what I would like it to look like can be found at the links below (due to StackExchange's limited use of indents. Note that these are two different files for a game's configuration. The API for the parser is here.
What I would like the config to look like
What the config currently looks like
It has also come to my attention that the second link might show it as a YAML file since it registers the line-break as a line break. However, the chunk below might give you an idea of what it looks like to me.
RWtorchLight: Version 1.2 made by MYCRAFTisbest
indent1: ''
NOTE: 'The Meta data valuse is the number after the :'
For Example: Black wool, put 35 in Light_Block and 15 in Meta Data
Light_Block: 89
Meta_Deta_LB: 0
IMPORTANT: The torch and boots are not compatable with Meta Data yet
Torch_Item: 50
Helmet_Item: 314
Boot_Item: 317
indent2: ''
Torch_Use: true
Helmet_Use: true
Boot_Use: true
T-or-T Mode: Will create dim light when wearing pumpkin and all below features
Trick-or-Treat Mode: true
C of C: Chance of Cookie is the chance of how often trick-or-treaters get candy
Set to: '"0" for no chance'
Chance of Cookie: 5000
N of C: 'Will randomly chose a number between 1 and # when Cookies are received'
Number of Cookies: 5
BACKGROUND
After reviewing your question and the associated discussion in comments, a likely case is your YAML file is being corrupted either by:
notepad.exe;
your FTP/SFTP/Web page/whatever used for uploading the text; OR
a combination of both of the above
PROBLEM
YAML syntax is whitespace and indentation sensitive, and using MSFT notepad.exe is not recommended because it may not support the encoding specified in your YAML file.
Since YAML uses whitespace to delimit the data, any kind of modification to the text that is not consistent with the original encoding and whitespace of the original YAML will potentially render the file unusable.
This is one of the aspects of YAML that makes it potentially more brittle than alternative formats, such as JSON or XML.
SOLUTION
Use another editor such as Notepad++ (as recommended in the comments) or, if you do not have sufficient privileges to install another text editor, use an online text editor such as editpad (http://www.editpad.org/) to edit and save the YAML to a local file on your machine.
After saving the file to your local machine using a text editor besides notepad.exe, upload your file using an option that does not apply any kind of text filter to the text.
For example, some websites strip out characters from user-uploaded text to prevent things data corruption and security risks.
STEP BY STEP
start with a known well-formed YAML file, such as the one you specified in "What I would like the config to look like"
paste it into Notepad++ (local machine) or editpad (web-based editor)
modify the YAML file so it matches the settings you want
save your modifications to the original file
upload the file via SFTP or other means that preserves the original encoding

Insert a hyperlink to another file (Word) into Visual Studio code file

I am currently developing some functionality that implements some complex calculations. The calculations themselves are explained and defined in Word documents.
What I would like to do is create a hyperlink in each code file that references the assocciated Word document - just as you can in Word itself. Ideally this link would be placed in or near the XML comments for each class.
The files reside on a network share and there are no permissions to worry about.
So far I have the following but it always comes up with a file not found error.
file:///\\165.195.209.3\engdisk1\My Tool\Calculations\111-07 MyToolCalcOne.docx
I've worked out the problem is due to the spaces in the folder and filenames.
My Tool
111-07 MyToolCalcOne.docx
I tried replacing the spaces with %20, thus:
file:///\\165.195.209.3\engdisk1\My%20Tool\Calculations\111-07%20MyToolCalcOne.docx
but with no success.
So the question is; what can I use in place of the spaces?
Or, is there a better way?
One way that works beautifully is to write your own URL handler. It's absolutely trivial to do, but so very powerful and useful.
A registry key can be set to make the OS execute a program of your choice when the registered URL is launched, with the URL text being passed in as a command-line argument. It just takes a few trivial lines of code to will parse the URL in any way you see fit in order to locate and launch the documentation.
The advantages of this:
You can use a much more compact and readable form, e.g. mydocs://MyToolCalcOne.docx
A simplified format means no trouble trying to encode tricky file paths
Your program can search anywhere you like for the file, making the document storage totally portable and relocatable (e.g. you could move your docs into source control or onto a website and just tweak your URL handler to locate the files)
Your URL is unique, so you can differentiate files, web URLs, and documentation URLs
You can register many URLs, so can use different ones for specs, designs, API documentation, etc.
You have complete control over how the document is presented (does it launch Word, an Internet Explorer, or a custom viewer to display the docs, for example?)
I would advise against using spaces in filenames and URLs - spaces have never worked properly under Windows, and always cause problems (or require ugliness like %20) sooner or later. The easiest and cleanest solution is simply to remove the spaces or replace them with something like underscores, dashes or periods.

Is there a standard format for describing a flat file?

Is there a standard or open format which can be used to describe the formating of a flat file. My company integrates many different customer file formats. With an XML file it's easy to get or create an XSD to describe the XML file format. I'm looking for something similar to describe a flat file format (fixed width, delimited etc). Stylus Studio uses a proprietary .conv format to do this. That .conv format can be used at runtime to transform an arbitrary flat file to an XML file. I was just wondering if there was any more open or standards based method for doing the same thing.
I'm looking for one method of describing a variety of flat file formats whether they are fixed width or delimited, so CSV is not an answer to this question.
XFlat:
http://www.infoloom.com/gcaconfs/WEB/philadelphia99/lyons.HTM#N29
http://www.unidex.com/overview.htm
For complex cases (e.g. log files) you may consider a lexical parser.
About selecting existing flat file formats: There is the Comma-separated values (CSV) format. Or, more generally, DSV. But these are not "fixed-width", since there's a delimiter character (such as a comma) that separates individual cells. Note that though CSV is standardized, not everybody adheres to the standard. Also, CSV may be to simple for your purposes, since it doesn't allow a rich document structure.
In that respect, the standardized and only slightly more complex (but thus more useful) formats JSON and YAML are a better choice. Both are supported out of the box by plenty of languages.
Your best bet is to have a look at all languages listed as non-binary in this overview and then determine which works best for you.
About describing flat file formats: This could be very easy or difficult, depending on the format. Though in most cases easier solutions exist, one way that will work in general is to view the file format as a formal grammar, and write a lexer/parser for it. But I admit, that's quite† heavy machinery.
If you're lucky, a couple of advanced regular expressions may do the trick. Most formats will not lend themselves for that however.‡ If you plan on writing a lexer/parser yourself, I can advise PLY (Python Lex-Yacc). But many other solutions exists, in many different languages, a lot of them more convenient than the old-school Lex & Yacc. For more, see What parser generator do you recommend?
  †: Yes, that may be an understatement.
  ‡: Even properly describing the email address format is not trivial.
COBOL (whether you like it or not) has a standard format for describing fixed-width record formats in files.
Other file formats, however, are somewhat simpler to describe. A CSV file, for example, is just a list of strings. Often the first row of a CSV file is the column names -- that's the description.
There are examples of using JSON to formulate metadata for text files. This can be applied to JSON files, CSV files and fixed-format files.
Look at http://www.projectzero.org/sMash/1.1.x/docs/zero.devguide.doc/zero.resource/declaration.html
This is IBM's sMash (Project Zero) using JSON to encode metadata. You can easily apply this to flat files.
At the end of the day, you will probably have to define your own file standard that caters specifically to your storage needs. What I suggest is using xml, YAML or JSON as your internal container for all of the file types you receive. On top of this, you will have to implement some extra validation logic to maintain meta-data such as the column sizes of the fixed width files (for importing from and exporting to fixed width). Alternatively, you can store or link a set of metadata to each file you convert to the internal format.
There may be a standard out there, but it's too hard to create 'one size fits all' solutions for these problems. There are entity relationship management tools out there (Talend, others) that make creating these mappings easier, but you will still need to spend a lot of time maintaining file format definitions and rules.
As for enforcing column width, xml might be the best solution as you can describe the formats using xml schemas (with the length restriction). For YAML or JSON, you may have to write your own logic for this, although I'm sure someone else has come up with a solution.
See XML vs comma delimited text files for further reference.
I don't know if there is any standard or open format to describe a flat file format. But one industry has done this: the banking industry. Financial institutions are indeed communicating using standardized message over a dedicated network called SWIFT. SWIFT messages were originally positional (before SWIFTML, the XMLified version). I don't know if it's a good suggestion as it's kinda obscure but maybe you could look at the SWIFT Formatting Guide, it may gives you some ideas.
Having that said, check out Flatworm, an humble flat file parser. I've used it to parse positional and/or CSV file and liked its XML descriptor format. It may be a better suggestion than SWIFT :)
CSV
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
The CSV entry on wikipedia allowed me to find a comparison of data serialization formats that is pretty much what you asked for.
The only similar thing I know of is Hachoir, which can currently parse 70 file formats:
http://bitbucket.org/haypo/hachoir/wiki/Home
I'm not sure if it really counts as a declarative language, since it's plugin parser based, but it seems to work, and is extensible, which may meet your needs just fine.
As an aside, there are interesting standardised, extensible flat-file FORMATS, such as IFF (Interchange File Format).

Resources