Antlr4 handling of yaml unquoted multi-line strings - yaml

I am trying to build a parser for a limited set of YAML syntax similar to what is shown below using Antlr 4.7:
name:
last: Smith
first: John
address:
street: 123 Main St
Suite 100
city: Boston
state: MA
zip: 12345
I have a grammar (derived from the Python 3 grammar) that works correctly if I put quotes around the "value" strings but fails if I remove them. It seems that defining the "value" string so matching terminates before the next "tag:" portion of a new block or a "tag: " portion of a new assign statement is the trick.
Does anyone have any ideas or working samples that handle this use case?

It is the indentation of a non-empty line that should end the matching of a plain scalar. If that indentation is not more than the indentation of the current mapping, the scalar ends there.
For example:
mapping:
key: value with
multiple lines
key2:
other value
Here, the value with multiple lines ends at the line with key2:, because it is not indented more than the current mapping (i.e. the value of mapping: above). Of course, the last newline character and the indentation of key2: is not a part of that scalar's content.
In the YAML specification, this is handled by a production
s-indent(n) ::= s-space × n
Now in our case, the inner mapping has an indentation of n=2, so your scalar would be matched by something like
plain-scalar-part (s-indent(3) s-white* plain-scalar-part)*
(I don't know Antlr syntax, just assume these are all non-terminals). After the (possibly empty) first line, you match an indentation of more than the parent mapping (so 3 spaces in this case), then there might be even more whitespace (which is not part of the content), and then more content follows. For simplicity, I ignored possible empty lines.
This will not match the line key2: because it has too few indentation, which is how the matching of the scalar will end.
Now I do not know how to do something like s-indent(n) in Antlr, but the Python grammar should give you the right pointers.

Related

Helm-Charts(yaml): Regex expression broken

I am working with https://github.com/prometheus-community/helm-charts and am running into some issues with a couple of regex queries are a part of our basic yaml deployments. The issue I'm having is specifically with the Node exporter part of the prometheus chart. I have configured this:
nodeExporter:
extraArgs: {
collector.filesystem.ignored-fs-types="^(devpts|devtmpfs|mqueue|proc|securityfs|binfmt_misc|debugfs|overlay|pstore|selinuxfs|tmpfs|hugetlbfs|nfsd|cgroup|configfs|rpc_pipefs|sysfs|autofs|rootfs)$",
collector.filesystem.ignored-mount-points="^/etc/.+$",
collector.netstat.fields="*",
collector.diskstats.ignored-devices="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-)\d+$", # BROKEN
collector.netclass.ignored-devices=^(?:tun|kube|veth|dummy|docker).+$, # BROKEN
collector.nfs
}
tolerations:
- operator: Exists
As noted above, these two lines with regex are broken:
collector.diskstats.ignored-devices="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-)\d+$", # BROKEN
collector.netclass.ignored-devices=^(?:tun|kube|veth|dummy|docker).+$, # BROKEN
There seems to be a problem with the | character right be fore "nvme" in the first one, and with the ?: in the second. I believe it's something to do with regex/yaml format, but I'm not sure how to correct this.
With {, you are beginning a YAML flow mapping. It typically contains comma-separated key-value pairs, though you can also, like in this example, give single values instead, which will make them a key with null value.
In YAML, as soon as you enter a flow-style collection, all special flow-indicators cannot be used in plain scalars anymore. Special flow indicators are {}[],. A plain scalar is a non-quoted textual value.
The first broken value is illegal because it contains [ and ]. The second broken value is actually legal according to the specification, but quite some YAML implementations choke on it because ? is also used as indicator for a mapping key.
You have several options:
Quote the scalars. since none of them contain single quotes, enclosing each with single quotes will do the trick. Generally you can also double-quote them, but then you need to escape all double-quote characters and all backslashes in there which does not help readability.
nodeExporter:
extraArgs: {
collector.filesystem.ignored-fs-types="^(devpts|devtmpfs|mqueue|proc|securityfs|binfmt_misc|debugfs|overlay|pstore|selinuxfs|tmpfs|hugetlbfs|nfsd|cgroup|configfs|rpc_pipefs|sysfs|autofs|rootfs)$",
collector.filesystem.ignored-mount-points="^/etc/.+$",
collector.netstat.fields="*",
'collector.diskstats.ignored-devices="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-)\d+$"',
'collector.netclass.ignored-devices=^(?:tun|kube|veth|dummy|docker).+$',
collector.nfs
}
tolerations:
- operator: Exists
Use block scalars. Block scalars are generally the best way to enter scalars with lots of special characters because they are ended via indentation and therefore can contain any special character. Block scalars can only occur in other block structures, so you'd need to make extraArgs a block mapping:
nodeExporter:
extraArgs:
? collector.filesystem.ignored-fs-types="^(devpts|devtmpfs|mqueue|proc|securityfs|binfmt_misc|debugfs|overlay|pstore|selinuxfs|tmpfs|hugetlbfs|nfsd|cgroup|configfs|rpc_pipefs|sysfs|autofs|rootfs)$"
? collector.filesystem.ignored-mount-points="^/etc/.+$"
? collector.netstat.fields="*"
? |-
collector.diskstats.ignored-devices="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-)\d+$"
? |-
collector.netclass.ignored-devices=^(?:tun|kube|veth|dummy|docker).+$
? collector.nfs
tolerations:
- operator: Exists
As you can see, this is now using the previously mentioned ? as key indicator.
Since it is a block sequence, you don't need the commas anymore.
|- starts a literal block scalar from which the final linebreak is stripped.

writing an antlr grammar where whitespace is sometimes significant

This is a dummy example, my actual language is more complicated:
grammar wordasnumber;
WS: [ \t\n] -> skip;
AS: [Aa] [Ss];
ID: [A-Za-z]+;
NUMBER: [0-9]+;
wordAsNumber: (ID AS NUMBER)* EOF;
In this language, these two strings are legal:
seven as 7 eight as 8
seven as 7eight as8
Which is exactly what I told it to do, but not what I want. Because ID and AS are both strings of letters, white space is required between them, I would like that second phrase
to be a syntax error. I could add some other rule to try and match theses mashed up things ...
fragment LETTER: [A-Za-z];
fragment DIGIT: [0-9];
BAD_THING: ( LETTER+ DIGIT (LETTER|DIGIT)* ) | ( DIGIT+ LETTER (LETTER|DIGIT)* );
ID: LETTER+;
NUMBER: DIGIT+;
... to make the lexer return a different token for these smashed up things, but this feels like a weird bandaid which sort of found the need for accidentally and maybe there are more if I really stared at my lexer very carefully.
Is there a better way to do this? My actual grammar is much larger so, for example, making WS NOT be skipped and placing it explicitly between the tokens where it is required is non starter.
There was an older question on this list, which I could not find, which I think is the same question, in that case someone who was parsing white space separated numbers was surprised that 1.2.3 was parsing as 1.2 and .3 and not as a syntax error.
Add another rule for the wrong input, but don't use that in your parser. It will then cause a syntax error when matched:
INVALID: (ID | NUMBER)+;
This additional rule will change the parse tree output, for the input in the question, to:
This trick works because ANTLR4's lexing approach tries to match the longest input in on go, and that INVALID rule matches more than ID and NUMBER alone. But you have to place it after these 2 rules, to make use of another lexing rule: "If two lexer rules would match the same input, pick the first one.". This way, you get the correct tokens for single appearances of ID and NUMBER.

Reading and writing back yaml files with multi-line strings

I have to read a yaml file, modify it and write back using pyYAML. Every thing works fine except when there is multi-line string values in single quotes e.g. if input yaml file looks like
FOO:
- Bar: '{"HELLO":
"WORLD"}'
then reading it as data=yaml.load(open("foo.yaml")) and writing it yaml.dump(data, fref, default_flow_style=False) generates something like
FOO:
- Bar: '{"HELLO": "WORLD"}'
i.e. without the extra line for Bar value. Strange thing is that if input file has something like
FOO:
- Bar: '{"HELLO":
"WORLD"}'
i.e. one extra new line for Bar value then writing it back generates the correct number of new lines. Any idea what I am doing wrong?
You are not doing anything wrong, but you probably should have read more of the YAML specification.
According to the (outdated) 1.1 spec that PyYAML implements, within
single quoted scalars:
In a multi-line single-quoted scalar, line breaks are subject to (flow) line folding, and any trailing white space is excluded from the content.
And line-folding:
Line folding allows long lines to be broken for readability, while retaining the original semantics of a single long line. When folding is done, any line break ending an empty line is preserved. In addition, any specific line breaks are also preserved, even when ending a non-empty line.
This means that your first two examples are the same, as the
line-break is read as if there is a space.
The third example is different, because it actually contains a newline after loading, because "any line break ending an empty line is preserved".
In order to understand why that dumps back as it was loaded, you have to know that PyYAML doesn't
maintain any information about the quoting (nor about the single newline in the first example), it
just loads that scalar into a Python string. During dumping PyYAML evaluates how that string
can best be written and the options it considers (unless you try to force things using the default_style argument to dump()): plain style, single quoted style, double quoted style.
PyYAML will use plain style (without quotes) when possible, but since
the string starts with {, this leads to confusion (collision) with
that character's use as the start of a flow style mapping. So quoting
is necessary. Since there are also double quotes in the string, and
there are no characters that need backslash escaping the "cleanest"
representation that PyYAML can choose is single quoted style, and in
that style it needs to represent a line-break by including an emtpy
line withing the single quoted scalar.
I would personally prefer using a block style literal scalar to represent your last example:
FOO:
- Bar: |
{"HELLO":
"WORLD"}
but if you load, then dump that using PyYAML its readability would be lost.
Although worded differently in the YAML 1.2 specification (released almost 10 years ago) the line-folding works the same, so this would "work" in a similar way with a more up-to-date YAML loader/dumper. My package ruamel.yaml, for loading/dumping YAML 1.2 will properly maintain the block style if you set the attribute preserve_quotes = True on the YAML() instance, but it will still get rid of the newline in your first example. This could be implemented (as is shown by ruamel.yaml preserving appropriate newline positions in folded style block scalars), but nobody ever asked for that, probably because if people want that kind of control over wrapping they use a block style to start with.

how to test my Grammar antlr4 successfully? [duplicate]

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

YAML comments in multi-line strings

Does YAML support comments in multi-line strings?
I'm trying to do things like this, but the validator is throwing errors:
key:
#comment
value
#comment
value
value #comments here don't work either
No. Per the YAML 1.2 spec "Comments must not appear inside scalars". Which is exactly the case here. There's no way in YAML to escape the octothorpe symbol (#) so within a multi-line string there's no way to disambiguate the comment from the raw string value.
You can however interleave comments within a collection. For example, if you really needed to, you could break your string into a sequence of strings one per line:
key: #comment
- value line 1
#comment
- value line 2
#comment
- value line 3
Should work...

Resources