Reusing assigned labels in rule definition in ANTLR - antlr3

I have a rule to match 'FOR "hi" FOR'
rule : id1=ELEMENT STRING id1
{
// action
}
-> ^(Tree rule)
but it fails saying reference to undefined rule: id1
How can I reuse a label to ensure the start and end of the rule are the same identifier

The recommended way to handle this is assume the values match while parsing, and then examine the AST after parsing is complete, issuing error messages at that time for any mismatched elements.
This approach results in a more robust parser and much much understandable error messages in the case of an wrote.

Related

Algorithm to tell when we've processed a complex variable path expression while parsing?

I am working on a compiler for a homemade programming language and I am stuck on how to convert the lexical token stream into a tree of commands for constructing a DOM-like tree. The "tree of commands" will still be a list, essentially emmitting events in a way that describes how to create a tree, from partial information provided by the lexer. (This language is like CoffeeScript in a way, indentation based, or like XML with indentation focus).
I am stuck on how to tell when a variable path has been discovered. A variable path can be simple, or complex, as these examples demonstrate:
foo
foo.bar
foo.bar[baz].hello[and][goodday].there
this[is[even[more.complicated].wouldnt.you[say]]]
They could get more complicated still, if we handled dynamic interpolation of strings, such as:
foo[`bar${x}abc`].baz
But in my simple lang, there are two relevant things, "paths", and "terms". Terms are anything /a-z/ for now, and paths are chaining together and nesting, like the first examples.
For demonstration purposes, everything else is a simple "term" of 1 word, so you might have this:
abc foo.bar[baz].hello[and][goodday].there, one foo.bar
It forms a simple tree.
Right now I have a lexer which spits out the tokens, so basically:
abc
[SPACE]
foo
.
bar
[
baz
]
.
hello
[
and
]
[
goodday
]
.
there
,
[SPACE]
one
[SPACE]
foo
.
bar
That is at least how I broke it up initially.
So given that sequence of strings, how can you generate messages to tell the parser how to build a tree?
term
nest-down
term
period
term
open-square
and
close-square
...
That is the stream of tokens with a name now, but it is not a tree yet. I would like this:
term-start
term # value: abc
term-end
nest-down
term-path-start
term-start
term # value: foo
term-end
period
term-start
term # value: bar
term-end
term-nest-start
term-start
term # value: and
term-and
term-nest-end
...
I have been struggling with this example for several days now (boiled down from a complex real-world scenario). I cant seem to figure out how to keep track of all the information you need to make a decision on when to say "this structure is done now, close it out" sort of thing. Wondering if you know how to get past this.
Note, I don't need the last tree to actually be a tree structure visually, I just need it to generate those messages which can be interpreted on the other end and used to construct a tree at runtime.
There is no way to construct a tree from a list without having the description of the tree in some form. Often, in relation to parsing, the description of this tree is given by a context-free grammar (CFG).
Then you create a parser on the basis of this given CFG. The lexical token stream is given as an input to the parser. The parser organizes the lexical tokens into a tree by using some parsing algorithm.
The parser emits commands for syntax tree construction based on the rules it uses during parsing. On entering into a rule a command "rule X enter" is emitted, on exiting a rule a command "exit X rule" is emitted. When you accept a lexical token then a "token forward" is emitted with its lexeme characters. Some grammars, namely these in ABNF format, support repetitions of elements. Depending from these repetitions the syntax tree might be represented as lists or arrays.
Then a builder module receives this commands and builds a tree, or uses the commands for the specific task with the listener pattern.
I have co-authored (2021) a paper describing a list of commands for building a concrete/abstract syntax trees, depending on CFG's structure, that are used in the parsers generated by parser generator Tunnel Grammar Studio.
The paper is named "Тhe Expressive Power of the Statically Typed Concrete Syntax Trees". It is in an open-access journal (intentionally). The commands are in section "4.3 Syntax Structure Construction Commands". The article is bit "compressed", due to space limitations, and it is not really intended to be a software development guide, but to note the taken approach. It might give you some ideas.
Another co-authored paper of mine, from 2021 named "A Parsing Machine Architecture Encapsulating Different Parsing Approaches" (also in open-access journal) describes a general form of parsing machine and its modules. There Fig.1, p.33, will give you a quick description.
Disclaimer: I have made the parser generator.

Rascal: Grammar Stack Trace

When parsing a file with a specific grammar and the parse fails, I get a corresponding error message with the location in the source file that offended the grammar.
What I would like to look at in these situations would be the list of grammar rules that were active at this moment, something like a grammar rule "stack trace", or the rules that have matched so far.
Is this possible in Rascal?
So, for a very simple example, in the EXP language from the documentation, if I tried to parse "2 + foo" I could get something like
Exp
=> left Exp "+" Exp
=> left IntegerLiteral "+" Exp
=> left IntegerLiteral "+" <?>
No derivation of "foo" from rule 'Exp'
Another way of saying this is looking at an incomplete parse tree, as it was the moment the parse error occurred. Does that make sense?
It makes total sense, but I'm afraid this "incomplete parse tree" feature is on our TODO list.
Note that with the non-deterministic parsing algorithm it would probably return a set of current parse contexts, so a "parse forest" rather than a single stack trace. Still I think that would be a very useful debugging feature.
The only suggestion right now I can do right now is "delta-debugging", so removing half the input and checking if the parse error is still there, then the other half, rinse/lather/repeat.

How to ensure exhaustivity of enum-based case statements in Ruby like with switch statements in Java

First of all, this question is related but not solved by
Advanced Java-like enums in Ruby
Static analysis of exhaustive switch statements of enums
Then about the question itself : when programming in Java with IDEs like Eclipse, it's possible to have a warning when we implement a switch statement on an enum, and we forgot some cases in the switch statement (very useful after adding an extra possible value to the enum and we forget to edit all switches based on this enum)
Is it possible to have the same kind of static analysis in Ruby ? Is there a way to implement enums so that we'd get a warning (maybe after running rubocop or something) if we forget to implement a case ?
EDIT
This "enum" I'm talking about could be any type of Set like object with a finite number of values, the most simplest form being an array of symbols, but maybe it is not enough/convenient to perform analysis with it hence why I am starting this question
On of my use case involve checking all possible errors after performing Policy checks
class CanShowArticlePolicy
def call
list_of_exceptions = [:unpublished, :deleted,
:offensive_content_detected]
# business logic that returns either true or false and add exception information exception, can be mocked as
#error = list_of_exceptions.sample
false
end
end
# in another file like a controller or service
article = Article.find(id)
policy = CanShowArticlePolicy.new(article)
if policy.call
render_article
else
# Where I'm trying to be exhaustive
case policy.error # <== Goal : detect here we are swithing on an "enum" with finite values and we should be exhaustive
when :unpublished
render_unpublished_error
when :deleted
render_gone
# <<= Here I would like to get a rubocop error because we've forgotten to handle the `:offensive_content_detected` case
end
Maybe a solution would be to have instead something like an annotation
case enum_value # #exhaustive-case with ::CanShowArticlePolicy::ErrorEnum
and the annotation would have for effect of the static analysis trying to find a ::CanShowArticlePolicy::ErrorEnum array containing the symbols, and making sure there are as many when statements as number of items in the frozen ErrorEnum

Java Grammar To AST

In java grammar I have a parser rule,
name
: Identifier ('.' Identifier)* ';'
;
How to get all the identifiers under a single AST tree node?
It seems impossible to me only with your lexer-parser.
For this, you will need the called: tree-walker.This third part of the parsing process will make you able to go through the generated AST and, with a counter, print the number of occurrences.
I let you a reference here in case you decide to implement it.
https://theantlrguy.atlassian.net/wiki/display/ANTLR3/Tree+construction
I hope this would help you!

Erlang pattern matching bitstrings

I'm writing code to decode messages from a binary protocol. Each message type is assigned a 1 byte type identifier and each message carries this type id. Messages all start with a common header consisting of 5 fields. My API is simple:
decoder:decode(Bin :: binary()) -> my_message_type() | {error, binary()}`
My first instinct is to lean heavily on pattern matching by writing one decode function for each message type and to decode that message type completely in the fun argument
decode(<<Hdr1:8, ?MESSAGE_TYPE_ID_X:8, Hdr3:8, Hdr4:8, Hdr5:32,
TypeXField1:32, TypeXFld2:32, TypeXFld3:32>>) ->
#message_x{hdr1=Hdr1, hdr3=Hdr3 ... fld4=TypeXFld3};
decode(<<Hdr1:8, ?MESSAGE_TYPE_ID_Y:8, Hdr3:8, Hdr4:8, Hdr5:32,
TypeYField1:32, TypeYFld2:16, TypeYFld3:4, TypeYFld4:32
TypeYFld5:64>>) ->
#message_y{hdr1=Hdr1, hdr3=Hdr3 ... fld5=TypeYFld5}.
Note that while the first 5 fields of the messages are structurally identical, the fields after that vary for each message type.
I have roughly 20 message types and thus 20 functions similar to the above. Am I decoding the full message multiple times with this structure? Is it idiomatic? Would I be better off just decoding the message type field in the function header and then decode the full message in the body of the message?
Just to agree that your style is very idiomatic Erlang. Don't split the decoding into separate parts unless you feel it makes your code clearer. Sometimes it can be more logical to do that type of grouping.
The compiler is smart and compiles pattern matching in such a way that it will not decode the message more than once. It will first decode the first two fields (bytes) and then use the value of the second field, the message type, to determine how it is going to handle the rest of the message. This works irrespective of how long the common part of the binary is.
So their is no need to try and "help" the compiler by splitting the decoding into separate parts, it will not make it more efficient. Again, only do it if it makes your code clearer.
Your current approach is idiomatic Erlang, so keep going this direction. Don't worry about performance, Erlang compiler does good work here. If your messages are really exactly same format you can write macro for it but it should generate same code under hood. Anyway using macro usually leads to worse maintainability. Just for curiosity why you are generating different record types when all have exactly same fields? Alternative approach is just translate message type from constant to Erlang atom and store it in one record type.

Resources