What are valid date-time separators in RFC3339 strings? - rfc3339

I'm quite confused as to what's allowed as the time separator/designator in the RFC3339 standard. By time separator I mean the sequence of characters that draw the line between date and time.
The standard states in section 5.6 different things that are unclear or conflicting. First of all, it says that the production rule for a full datetime is this:
date-time = full-date "T" full-time
Meaning that the delimiter between the date and the time is an uppercase T. Right after comes this:
NOTE: Per [ABNF] and ISO8601, the "T" and "Z" characters in this
syntax may alternatively be lower case "t" or "z" respectively
Meaning the upper case T may be a lower case t. It conflicts with the ABNF, but OK, it stills sounds to me within the realm of reasonable. Then the following is stated
NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of
readability, to specify a full-date and full-time separated by
(say) a space character.
Which is very confusing. Does this allow not only a space character but anything? which is what this say implies. Or does it by this syntax refer to ISO8601 and unnecessarily describes a detail of that other standard?
In other words, are the following valid RFC3339 strings?
2020-09-07 20:26:03.623359300+02:00
2020-09-07hey johnny20:26:03.623359300+02:00
2020-09-07πŸ’©20:26:03.623359300+02:00

Meaning the upper case T may be a lower case t. It conflicts with the ABNF, [...]
It does not. See 2.3 Terminal Values of RFC 2234:
Literal text strings are interpreted as a concatenated set of
printable characters.
NOTE: ABNF strings are case-insensitive and
the character set for these strings is us-ascii.
So it is allowed to use t here.
NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of
readability, to specify a full-date and full-time separated by
(say) a space character.
Which is very confusing. Does this allow not only a space character
but anything?
This "deviation" is used for readability to the user when displayed. So when the value is displayed to the user in some kind, it can be displayed as:
2020-09-07 20:26:03.623359300+02:00
2020-09-07, 20:26:03.623359300+02:00
That way it might be easier for the user to see the clear space between the date and time, so they don't have to look for the T or t character to find the separation. It is indeed a vague sentence as it basically mean the application can do anything.
To answer your question: These listed date formats are not valid according to RFC 3339.

Short answer: T (or t as discouraged alternative).
After reading on this as much as I could, it turns out the time separator must be a T or t. What has made think this way is first of all this thread in the GNU lists where F. Alexander Njemz contacted the authors of RFC3339 Graham Klyne and Chris Newman asking if T is mandatory and got this response from Mr. Klyne:
In short: "yes"
Per section 5.5, the intent in this draft was to specify a timestamp format using
elements from and compatible with 8601, but eliminating as far as
reasonable any variations that could make timestamp data harder to
process. This includes making the 'T' mandatory in date+time values.
#g
Just for clarity's sake, this is stated in the section 5.5:
Simplicity is achieved by making most fields and punctuation
mandatory.
This clearly clashes with a non-mandatory T and strongly makes me think that the this syntax in that problematic passage refers to ISO8601 and not RFC3339.
For those who want to read more, here are some links regarding the confusion created by this specific point:
https://lists.gnu.org/archive/html/bug-coreutils/2006-05/msg00014.html
http://validator.w3.org/feed/docs/error/InvalidRFC3339Date.html
https://www.rfc-editor.org/errata/eid5783
Plus of course divergent implementations. For instance, the developers of GNU Date chose to use a space character:
$ date --rfc-3339=seconds
2020-09-14 14:53:51+02:00

Related

What are valid identifiers in R7RS-small?

R7RS-small says that all identifiers must be terminated by a delimiter, but at the same time it defines pretty elaborate rules for what can be in an identifier. So, which one is it?
Is an identifier supposed to start with an initial character and then continue until a delimiter, or does it start with an initial character and continue following the syntax defined in 7.1.1.
Here are a couple of obvious cases. Are these valid identifiers?
a#a
b,b
c'c
d[d]
If they are not supposed to be valid, what is the purpose of saying that an identifier must be terminated by a delimiter?
|..ident..| are delimiters for symbols in R7RS, to allow any character that you cannot insert in an old style symbol (| is the delimiter).
However, in R6RS the "official" grammar was incorrect, as it did not allow to define symbols such that 1+, which led all implementations define their own rules to overcome this illness of the official grammar.
Unless you need to read the source code of a given implementation and see how it defines the symbols, you should not care too much about these rules and use classical symbols.
In the section 7.1.1 you find the backus-naur form that defines the lexical structure of R7RS identifiers but I doubt the implementations follow it.
I quote from here
As with identifiers, different implementations of Scheme use slightly
different rules, but it is always the case that a sequence of
characters that contains no special characters and begins with a
character that cannot begin a number is taken to be a symbol
In other words, an implementation will use a function like read-atom and after that it will classify an atom by backtracking with read-number and if number? fails it will be a symbol.

Semantic meaning of '36_864_7_345ms' as a time literal

Reading the spec for verilog, it appears that
36_864_7_345ms
Is a valid time literal: http://www.ece.uah.edu/~gaede/cpe526/SystemVerilog_3.1a.pdf (see section 2)
Note: decimal_digit is defined as [0-9] in the full IEEE spec.
What is the semantic meaning (if any) of this time literal? Or am I misreading the spec?
Edit:
Looking elsewhere in the spec (section 3.7.9), it appears that the underscore characters are silently discarded. Does the underscore act as an arbitrary seperating character in a similar way as numbers in English (ex. 43,251) have commas to visually separate the numbers? Or is there another meaning altogether?
The spec you quoted from is long since obsolete. Please get the latest from the IEEE where it says in section 5.7.1 Integer literal constants:
The underscore character (_) shall be legal anywhere in a number
except as the first character. The underscore character is ignored.
This feature can be used to break up long numbers for readability
purposes.

How do you check for a changing value within a string

I am doing some localization testing and I have to test for strings in both English and Japaneses. The English string might be 'Waiting time is {0} minutes.' while the Japanese string might be '待け時間は{0}εˆ†γ§γ™γ€‚' where {0} is a number that can change over the course of a test. Both of these strings are coming from there respective property files. How would I be able to check for the presence of the string as well as the number that can change depending on the test that's running.
I should have added the fact that I'm checking these strings on a web page which will display in the relevant language depending on the location of where they are been viewed. And I'm using watir to verify the text.
You can read elsewhere about various theories of the best way to do testing for proper language conversion.
One typical approach is to replace all hard-coded text matches in your code with constants, and then have a file that sets the constants which can be updated based on the language in use. (I've seen that done by wrapping the require of that file in a case statement based on the language being tested. Another approach is an array or hash for each value, enumerated by a variable with a name like 'language', which lets the tests change the language on the fly. So validations would look something like this
b.div(:id => "wait-time-message).text.should == WAIT_TIME_MESSAGE[language]
To match text where part is expected to change but fall within a predictable pattern, use a regular expression. I'd recommend a little reading about regular expressions in ruby, especially using unicode regular expressions in ruby, as well as some experimenting with a tool like Rubular to test regexes
In the case above a regex such as:
/Waiting time is \d+ minutes./ or /待け時間は\d+εˆ†γ§γ™γ€‚/
would match the messages above and expect one or more digits in the middle (note that it would fail if no digits appear, if you want zero or more digits, then you would need a * in place of the +
Don't check for the literal string. Check for some kind of intermediate form that can be used to render the final string.
Sometimes this is done by specifying a message and any placeholder data, like:
[ :waiting_time_in_minutes, 10 ]
Where that would render out as the appropriate localized text.
An alternative is to treat one of the languages as a template, something that's more limited in flexibility but works most of the time. In that case you could use the English version as the string that's returned and use a helper to render it to the final page.

Are VHDL character substitutions ever used in real life?

VHDL allows the following substitutions, presumably because some computers might not support the vertical bar (or pipe symbol) (|) or the hash (or pound sign / number sign) (#):
case A|B can be written as case A!B
16#fff# can be written as 16:fff:
Any computer nowadays supports the vertical bar and the hash symbol, so I figured nobody uses these substitutions... Until somebody requested support for the exclamation mark.
My question: is this a lone case or are other people also using the exclamation mark as substitute for the vertical bar? Anybody using the colon?
Data point 1: Not me :)
And I've never seen it as far as I recall in any code - nor was I taught it at any point (in fact, this is the first I knew of those substitutions).
I had a quick look in Ashenden's Designer's Guide to VHDL, and the ! alternative is not even mentioned when the | is introduced for case statements.
These are inherited from Ada (in which they are obsolescent since Ada95). The Ada83 Rationale says "For portability reasons, it is possible to write any program in a 56 character subset of the ISO character set." in which ISO character set must be understood as ISO-646, aka ASCII (well ISO-646 has provision for replacing some characters for national variant, ASCII can be understood as the US national variant of ISO-646)
There is a third replacement: % can be use instead of " as string delimitor (both must be replaced).
I seem to remember that EBCDIC is using | or ! for the same code point depending on the variant.

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: Β¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Resources