Allowed characters in map key identifier in YAML? - validation

Which characters are and are not allowed in a key (i.e. example in example: "Value") in YAML?

According to the YAML 1.2 specification simply advises using printable characters with explicit control characters being excluded (see here):
In constructing key names, characters the YAML spec. uses to denote syntax or special meaning need to be avoided (e.g. # denotes comment, > denotes folding, - denotes list, etc.).
Essentially, you are left to the relative coding conventions (restrictions) by whatever code (parser/tool implementation) that needs to consume your YAML document. The more you stick with alphanumerics the better; it has simply been our experience that the underscore has worked with most tooling we have encountered.
It has been a shared practice with others we work with to convert the period character . to an underscore character _ when mapping namespace syntax that uses periods to YAML. Some people have similarly used hyphens successfully, but we have seen it misconstrued in some implementations.

Any character (if properly quoted by either single quotes 'example' or double quotes "example"). Please be aware that the key does not have to be a scalar ('example'). It can be a list or a map.

Related

Helm-Charts(yaml): Regex expression broken

I am working with https://github.com/prometheus-community/helm-charts and am running into some issues with a couple of regex queries are a part of our basic yaml deployments. The issue I'm having is specifically with the Node exporter part of the prometheus chart. I have configured this:
nodeExporter:
extraArgs: {
collector.filesystem.ignored-fs-types="^(devpts|devtmpfs|mqueue|proc|securityfs|binfmt_misc|debugfs|overlay|pstore|selinuxfs|tmpfs|hugetlbfs|nfsd|cgroup|configfs|rpc_pipefs|sysfs|autofs|rootfs)$",
collector.filesystem.ignored-mount-points="^/etc/.+$",
collector.netstat.fields="*",
collector.diskstats.ignored-devices="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-)\d+$", # BROKEN
collector.netclass.ignored-devices=^(?:tun|kube|veth|dummy|docker).+$, # BROKEN
collector.nfs
}
tolerations:
- operator: Exists
As noted above, these two lines with regex are broken:
collector.diskstats.ignored-devices="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-)\d+$", # BROKEN
collector.netclass.ignored-devices=^(?:tun|kube|veth|dummy|docker).+$, # BROKEN
There seems to be a problem with the | character right be fore "nvme" in the first one, and with the ?: in the second. I believe it's something to do with regex/yaml format, but I'm not sure how to correct this.
With {, you are beginning a YAML flow mapping. It typically contains comma-separated key-value pairs, though you can also, like in this example, give single values instead, which will make them a key with null value.
In YAML, as soon as you enter a flow-style collection, all special flow-indicators cannot be used in plain scalars anymore. Special flow indicators are {}[],. A plain scalar is a non-quoted textual value.
The first broken value is illegal because it contains [ and ]. The second broken value is actually legal according to the specification, but quite some YAML implementations choke on it because ? is also used as indicator for a mapping key.
You have several options:
Quote the scalars. since none of them contain single quotes, enclosing each with single quotes will do the trick. Generally you can also double-quote them, but then you need to escape all double-quote characters and all backslashes in there which does not help readability.
nodeExporter:
extraArgs: {
collector.filesystem.ignored-fs-types="^(devpts|devtmpfs|mqueue|proc|securityfs|binfmt_misc|debugfs|overlay|pstore|selinuxfs|tmpfs|hugetlbfs|nfsd|cgroup|configfs|rpc_pipefs|sysfs|autofs|rootfs)$",
collector.filesystem.ignored-mount-points="^/etc/.+$",
collector.netstat.fields="*",
'collector.diskstats.ignored-devices="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-)\d+$"',
'collector.netclass.ignored-devices=^(?:tun|kube|veth|dummy|docker).+$',
collector.nfs
}
tolerations:
- operator: Exists
Use block scalars. Block scalars are generally the best way to enter scalars with lots of special characters because they are ended via indentation and therefore can contain any special character. Block scalars can only occur in other block structures, so you'd need to make extraArgs a block mapping:
nodeExporter:
extraArgs:
? collector.filesystem.ignored-fs-types="^(devpts|devtmpfs|mqueue|proc|securityfs|binfmt_misc|debugfs|overlay|pstore|selinuxfs|tmpfs|hugetlbfs|nfsd|cgroup|configfs|rpc_pipefs|sysfs|autofs|rootfs)$"
? collector.filesystem.ignored-mount-points="^/etc/.+$"
? collector.netstat.fields="*"
? |-
collector.diskstats.ignored-devices="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-)\d+$"
? |-
collector.netclass.ignored-devices=^(?:tun|kube|veth|dummy|docker).+$
? collector.nfs
tolerations:
- operator: Exists
As you can see, this is now using the previously mentioned ? as key indicator.
Since it is a block sequence, you don't need the commas anymore.
|- starts a literal block scalar from which the final linebreak is stripped.

Semantic meaning of '36_864_7_345ms' as a time literal

Reading the spec for verilog, it appears that
36_864_7_345ms
Is a valid time literal: http://www.ece.uah.edu/~gaede/cpe526/SystemVerilog_3.1a.pdf (see section 2)
Note: decimal_digit is defined as [0-9] in the full IEEE spec.
What is the semantic meaning (if any) of this time literal? Or am I misreading the spec?
Edit:
Looking elsewhere in the spec (section 3.7.9), it appears that the underscore characters are silently discarded. Does the underscore act as an arbitrary seperating character in a similar way as numbers in English (ex. 43,251) have commas to visually separate the numbers? Or is there another meaning altogether?
The spec you quoted from is long since obsolete. Please get the latest from the IEEE where it says in section 5.7.1 Integer literal constants:
The underscore character (_) shall be legal anywhere in a number
except as the first character. The underscore character is ignored.
This feature can be used to break up long numbers for readability
purposes.

What constitutes a valid URI query parameter key?

I'm looking over Section 3.4 of RFC 3986 trying to understand what constitutes a valid URI query parameter key, but I'm not seeing a clear answer.
The reason I'm asking is because I'm writing a Ruby class that composes a URI with query parameters. When a new parameter is added I want to validate the key. Based on experience, it seems like the key will be invalid if it requires any escaping.
I should also say that I plan to validate the key. I'm not sure how to go about validating this data either, but I do know that in all cases I should escape this value.
Advice is appreciated. Advice in the context of how validation might already be possible through say a Ruby Gem would also be a plus.
I could well be wrong, but that spec seems to say that anything following '?' or '#' is valid as long. I wonder if you should be looking more at the spec for 'application/x-www-form-urlencoded' (ie. the key/value pairs we're all used to)?
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
This is the default content type. Forms submitted with this content
type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by %HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by &'.
I don't believe key=value is part of the RFC, it's a convention that has emerged. Wikipedia suggests this is an 'W3C recommendation'.
Seems like some good stuff to be found searching on the application/x-www-form-urlencoded content type.
http://www.w3.org/TR/REC-html40/interact/forms.html#form-data-set

How to use the `\p{Assigned}` regexp selector?

I just read about the \p{Assigned} selector in the Ruby Regex documentation, which sounds like a good possibility to easily generate an array of characters that it is supposed to match? The only thing it says in the documentation though is 'An assigned character'.
How do I assign one/multiple characters to this selector?
Unicode assigned characters are the whole of graphic, format, control, and private-use characters, i.e. any character that is not reserved for future assignment. You can't assign arbitrary characters to the class \p{Assigned}.
See https://bugs.ruby-lang.org/issues/3838
You need to specify that the encoding is UTF-8 by adding 'u' after the expression
/\p{Assigned}/u

User-defined Literals suffix, with *_digit..."?

A user-defined literal suffix in C++0x should be an identifier that
starts with _ (underscore) (17.6.4.3.5)
should not begin with _ followed by uppercase letter (17.6.4.3.2)
Each name that [...] begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.
Is there any reason, why such a suffix may not start _ followed by a digit? I.E. _4 or _3musketeers?
Musketeer dartagnan = "d'Artagnan"_3musketeers;
int num = 123123_4; // to be interpreted in base4 system?
string s = "gdDadndJdOhsl2"_64; // base64decoder
The precedent for identifiers of the form _<number> is the function argument placeholder object mechanism in std::placeholders (§20.8.9.1.3), which defines an implementation-defined number of such symbols.
This is a good thing, because it means the user cannot #define any identifier of that form. §17.6.4.3.1/1:
A translation unit that includes a standard library header shall not #define or #undef names declared in any standard library header.
The name of the user-defined literal function is operator "" _123, not simply _123, so there is no direct conflict between your name and the library name if presence of the using namespace std::placeholders;.
My 2¢, though, is that you would be better off with an operator "" _baseconv and encoding the base within the literal, "123123_4"_baseconv.
Edit: Looking at Johannes' (deleted) answer, there is There may be concern that _123 could be used as a macro by the implementation. This is certainly the realm of theory, as the implementation would have little to gain by such preprocessor use. Furthermore, if I'm not mistaken, the reason for hiding these symbols in std::placeholders, not std itself, is that such names are more likely to be used by the user, such as by inclusion of Boost Bind (which does not hide them inside a named namespace).
The tokens are not reserved for use by the implementation globally (17.6.4.3.2), and there is precedent for their use, so they are at least as safe as, say, forward.
"can" vs "may".
can denotes ability where may denotes permission.
Is there a reason why you would not have permission to the start a user-defined literal suffix with _ followed by a digit?
Permission implies coding standards or best-practices. The examples you provides seem to show that _\d would fine suffixes if used correctly (to denote numeric base). Unfortunately your question can't have a well thought out answer as no one has experience with this new language feature yet.
Just to be clear user-defined literal suffixes can start with _\d.
An underscore followed by a digit is a legal user-defined literal suffix.
The function signature would be:
operator"" _4();
so it couldn;t get eaten by a placeholder.
The literal would be a single preprocessor token:
123123_4;
so the _4 would not get clobbered by a placeholder or a preprocessor symbol.
My reading of 17.6.4.3.5 is that suffixes not containing a leading underscore risk collision with the implementation or future library additions. They also collide with existing suffixes: F, L, ULL, etc. One of the rationales for user-defined literals is that a new type (such as decimals for example) could be defined as a pure library extension including literals with suffuxes d, df, dl.
Then there's the question of style and readability. Personally, I think I would loose sight of the suffix 1234_3; Maybe, maybe not.
Finally, there was some idea that didn't make it into the standard (but I kind of like) to have _ be a literal separator for numbers like in Ada and Ruby. So you could have 123_456_789 to visually separate thousands for example. Your suffix would break if that ever went through.
I knew I had some papers on this subject:
Digital Separators describes a proposal to use _ as a digit separator in numeric literals
Ambiguity and Insecurity with User-Defined literals Describes the evolution of ideas about literal suffix naming and namespace reservation and efforts to deconflict user-defined literals against a future digit separator.
It just doesn't look that good for the _ digit separator.
I had an idea though: how about either a backslash or a backtick for digit separator? It isn't as nice as _ but I don't think there would be any collision as long as the backslash was inside the stream of digits. The backtick has no lexical use currently that I know of.
i = 123\456\789;
j = 0xface\beef;
or
i = 123`456`789;
j = 0xface`beef;
This would leave _123 as a literal suffix.

Resources