Most non-confrontational delimiter for my text files? - text-files

I am saving all my notes in a log file. Each line is a note, suffixed by tags, and prefixed by a date and time marker, which currently looks like this: [12.20.09:22.22] ([date:time].
I am planning to have this a long-living format. Notes will be logged willy-nilly with this format 20-30 times a day for years to come. I foresee numerous kinds of parsing for analytics, filtering, searching ...
I am worried about the [ ]s though. Could they possibly trip some parsing code (someone else's if not mine)? What would be the most non-confrontational marker?

If you end up going with your own format, can I recommend ISO 8601 for your date and time format.
In summary, the basic format is:
yyyy-mm-dd hh:mm:ss
You can extend this with timezone and microsecond info if you wish. Timezone is recommended or assume UTC.
With the date/time in this format there's no confusion over which is the month and the day. And it has the bonus of sorting using a basic string sort.

I'd consider using either XML or JSON as the format for the file.
In particular your date/time marker is ambiguous. Is it mm/dd/yy or dd/mm/yy? Or even yy/mm/dd? And in what timezone is the date and time?
Both XML and JSON define a way to have dates that are culture and timezone independent, and (best of all) there's masses of tooling available for both formats.
XML datetime format is defined here: for example, 2000-01-12T12:13:14Z.
JSON datetime format is defined as the number of seconds since Jan 1, 1970, so it's a bit uglier: { currentDate: "#1163531522089#" }

If you want everything to last in a long-lived format, then the metadata needs to be as explicit as possible. If it's intended to be long-lived, then many others will need to read it, as easily as possible.
I agree with Jeremy McGee: XML is an excellent choice. Even if no other data lives, then having it be in the format:
<note>
<datetime>
<year>
2009
</year>
<month>
12
</month>
. . .
</datetime>
<message>
Foo bar baz quox
</message>
<note>
cannot be misunderstood.

This depends on your data. However, if you escape them with a special character of some sort, (i.e. \]) and code accordingly to look at the previous character when finding a "[" or "]", you should have no problem.
Also, if you're open to a new format, I'm a fan of JSON as it's light weight and very useful.

Using '[]' as the markers would be ok provided that you allow the DSL the ability to escape the characters. This is typical of operations on text which need parsing.
As an example check out the typical regular expression syntax which enables '/' as the seperator, whilst letting the user specify an escape character such as '\'. You may get some more ideas from the likes of such Unix tools as; awk, sed and grep

I would tend to think a standardized format is the way to go, with JSON being my personal choice because of it's simplicity. Not only does that help to avoid parsing issues since others have already though about it, you are also given a lot more tools to work with over the life of the project.

Related

RFC3339 - Date when no time is specified

If a date is specified in the format
start=2021-04-05&end=2021-05-05
does that mean that 2021-05-05 is excluded from the results?
and it's returning up to 11:59:59 on the 4th?
In an API I'm using, it seems to be behaving the same as
start=2021-04-05T00:00:00Z&end=2021-05-05T00:00:00Z when no time is specified.
A few things:
I think you are asking about the full-date format from RFC 3339, which is the same as the ISO 8601 extended date format: YYYY-MM-DD
Neither specification says anything about inclusivity or exclusivity of date-only ranges.
ISO 8601 does talk a bit about ranges (they call them intervals), but they are defined as a pair of instants, not whole dates.
The typical best practice (in my experience) would be to use a fully inclusive range for date-only values, or a half-open range for date-time values. For example:
[2021-04-05, 2021-05-05]
[2021-04-05T00:00:00, 2021-04-06T00:00:00)
However, this is not a hard rule. The actual details would be highly specific to the particular API you are using and how the authors of that API designed it to function.
A whole date like 2021-04-05 is not necessarily the same thing as 2021-04-05T00:00:00. In many cases, the reason to use a whole date is to not associate a time or a time zone at all. But again, this is highly implementation specific.
Nothing you've shown would imply that UTC (Z) is being used. If that's how the API is behaving, that's another implementation detail of that API.

PDF date field - one format, multiple valid inputs

I need to create a PDF with a date field that displays dates in the format dd.mm.yyyy. This field must be validated and should also accept inputs in other formats.
For example, users should be able to input dates in the form dd.mm.yy, which will then be expanded to dd.mm.20yy (MS Office apps like Excel do this).
Alternatively, selecting multiple valid formats would be an acceptable solution.
What currently happens:
If the date format of the field is set to dd.mm.yyyy, dd.mm.yy is rejected.
If the date format of the field is set to dd.mm.yy, dd.mm.yyyy is accepted (but formatted to dd.mm.yy).
The last behaviour is almost what i need, just with the wrong format.
Is there a way to do this without custom Javascript? If not, is there a way to still use the built-in formatting or do i have to rewrite everything in JS?
Unfortunately this is not achievable with Acrobat's built-in formatting.
One thing to add to your second bullet point: it trims the date down to .yy when you exit the field, but it still retains all four digits. When you click into the field, it will revert back to being .yyyy. That may or may not matter depending on how you're using it.
Regarding a custom validation, a quick Google search will yield an abundance of Javascript date validation scripts. Something like this could probably be quickly repurposed for your application.

How to change long gene names to abbreviated in some automatic way (microarray data processing)?

Is there any automatic way to convert a list of long gene names (like Cadherin_3453) to its abbreviations, like CDHRN_3453?
Are there any abbreviation name convention in Genomics, Bioinformatics?
Sorry, no code herein
There is the HUGO database which tries to standardize gene names. Depending on your use case you can either try to access their online search every time or download the data and use your own database.
Since you did not post a programming language that you want, I am guessing that this is just a simple, one-time exercise that you would like to do.
Though it is not a true abbreviation, you could just remove all of the vowels in the gene name (as you may have done by accident in your example).
You should use:
http://www.togglecase.com/convert_to_disemvowelled_text.php
It was able to change Cadherin_3453 to Cdhrn_3453.
If you a looking to be able to do this with a program that you can tailor to your specific needs, you can look into this SO question: String replace vowels in Python?

International notation for input time format

I have a page that tells the user to input a time in the format hh:mm:ss. This is obviously fine for English speaking audiences, but is this an acceptable international notation?
If not, could you give me some examples of how other countries display this kind of information to the user.
In terms of time, it is usually either in format hh:mm:ss or h:mm:ss, that is you either do zero-padding or no zero padding for single digits hours.
Another thing is, the format - distinction between 12 hours and 24 hours, that also varies from Locale to Locale.
Lastly, you should take into account local user's time zone - it is natural to enter the time in relation to current time zone.
How would you go about these options? Basically, it depends on what your UI look like. If you have just one text box where user can enter free-form text, you should actually parse the text according to valid Locale format. In that case you can give an example of the format, by formatting current day, so the label would say something like "Enter the time (for example: 16:36:11):".
Other approach is to have two (hour/minute) or more text fields (seconds) separated by time separator (":") and possibly conditional AM/PM radio button group (or drop down). In that case it would be a bit harder, because to do that correctly, you should actually analyze the localized pattern (on what is and what is not supported) and dynamically create UI elements (I don't want to see AM/PM stuff as I am using 24 hours time naturally) as well as validation rules...
I know dates are different between different country, but time is pretty standard. you won't have problem with French, spanish, german, UK... i don't know for asian country tho...

Is there a standard format for describing a flat file?

Is there a standard or open format which can be used to describe the formating of a flat file. My company integrates many different customer file formats. With an XML file it's easy to get or create an XSD to describe the XML file format. I'm looking for something similar to describe a flat file format (fixed width, delimited etc). Stylus Studio uses a proprietary .conv format to do this. That .conv format can be used at runtime to transform an arbitrary flat file to an XML file. I was just wondering if there was any more open or standards based method for doing the same thing.
I'm looking for one method of describing a variety of flat file formats whether they are fixed width or delimited, so CSV is not an answer to this question.
XFlat:
http://www.infoloom.com/gcaconfs/WEB/philadelphia99/lyons.HTM#N29
http://www.unidex.com/overview.htm
For complex cases (e.g. log files) you may consider a lexical parser.
About selecting existing flat file formats: There is the Comma-separated values (CSV) format. Or, more generally, DSV. But these are not "fixed-width", since there's a delimiter character (such as a comma) that separates individual cells. Note that though CSV is standardized, not everybody adheres to the standard. Also, CSV may be to simple for your purposes, since it doesn't allow a rich document structure.
In that respect, the standardized and only slightly more complex (but thus more useful) formats JSON and YAML are a better choice. Both are supported out of the box by plenty of languages.
Your best bet is to have a look at all languages listed as non-binary in this overview and then determine which works best for you.
About describing flat file formats: This could be very easy or difficult, depending on the format. Though in most cases easier solutions exist, one way that will work in general is to view the file format as a formal grammar, and write a lexer/parser for it. But I admit, that's quite† heavy machinery.
If you're lucky, a couple of advanced regular expressions may do the trick. Most formats will not lend themselves for that however.‡ If you plan on writing a lexer/parser yourself, I can advise PLY (Python Lex-Yacc). But many other solutions exists, in many different languages, a lot of them more convenient than the old-school Lex & Yacc. For more, see What parser generator do you recommend?
  †: Yes, that may be an understatement.
  ‡: Even properly describing the email address format is not trivial.
COBOL (whether you like it or not) has a standard format for describing fixed-width record formats in files.
Other file formats, however, are somewhat simpler to describe. A CSV file, for example, is just a list of strings. Often the first row of a CSV file is the column names -- that's the description.
There are examples of using JSON to formulate metadata for text files. This can be applied to JSON files, CSV files and fixed-format files.
Look at http://www.projectzero.org/sMash/1.1.x/docs/zero.devguide.doc/zero.resource/declaration.html
This is IBM's sMash (Project Zero) using JSON to encode metadata. You can easily apply this to flat files.
At the end of the day, you will probably have to define your own file standard that caters specifically to your storage needs. What I suggest is using xml, YAML or JSON as your internal container for all of the file types you receive. On top of this, you will have to implement some extra validation logic to maintain meta-data such as the column sizes of the fixed width files (for importing from and exporting to fixed width). Alternatively, you can store or link a set of metadata to each file you convert to the internal format.
There may be a standard out there, but it's too hard to create 'one size fits all' solutions for these problems. There are entity relationship management tools out there (Talend, others) that make creating these mappings easier, but you will still need to spend a lot of time maintaining file format definitions and rules.
As for enforcing column width, xml might be the best solution as you can describe the formats using xml schemas (with the length restriction). For YAML or JSON, you may have to write your own logic for this, although I'm sure someone else has come up with a solution.
See XML vs comma delimited text files for further reference.
I don't know if there is any standard or open format to describe a flat file format. But one industry has done this: the banking industry. Financial institutions are indeed communicating using standardized message over a dedicated network called SWIFT. SWIFT messages were originally positional (before SWIFTML, the XMLified version). I don't know if it's a good suggestion as it's kinda obscure but maybe you could look at the SWIFT Formatting Guide, it may gives you some ideas.
Having that said, check out Flatworm, an humble flat file parser. I've used it to parse positional and/or CSV file and liked its XML descriptor format. It may be a better suggestion than SWIFT :)
CSV
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
The CSV entry on wikipedia allowed me to find a comparison of data serialization formats that is pretty much what you asked for.
The only similar thing I know of is Hachoir, which can currently parse 70 file formats:
http://bitbucket.org/haypo/hachoir/wiki/Home
I'm not sure if it really counts as a declarative language, since it's plugin parser based, but it seems to work, and is extensible, which may meet your needs just fine.
As an aside, there are interesting standardised, extensible flat-file FORMATS, such as IFF (Interchange File Format).

Resources