When parsing a file with a specific grammar and the parse fails, I get a corresponding error message with the location in the source file that offended the grammar.
What I would like to look at in these situations would be the list of grammar rules that were active at this moment, something like a grammar rule "stack trace", or the rules that have matched so far.
Is this possible in Rascal?
So, for a very simple example, in the EXP language from the documentation, if I tried to parse "2 + foo" I could get something like
Exp
=> left Exp "+" Exp
=> left IntegerLiteral "+" Exp
=> left IntegerLiteral "+" <?>
No derivation of "foo" from rule 'Exp'
Another way of saying this is looking at an incomplete parse tree, as it was the moment the parse error occurred. Does that make sense?
It makes total sense, but I'm afraid this "incomplete parse tree" feature is on our TODO list.
Note that with the non-deterministic parsing algorithm it would probably return a set of current parse contexts, so a "parse forest" rather than a single stack trace. Still I think that would be a very useful debugging feature.
The only suggestion right now I can do right now is "delta-debugging", so removing half the input and checking if the parse error is still there, then the other half, rinse/lather/repeat.
Related
I am working on a compiler for a homemade programming language and I am stuck on how to convert the lexical token stream into a tree of commands for constructing a DOM-like tree. The "tree of commands" will still be a list, essentially emmitting events in a way that describes how to create a tree, from partial information provided by the lexer. (This language is like CoffeeScript in a way, indentation based, or like XML with indentation focus).
I am stuck on how to tell when a variable path has been discovered. A variable path can be simple, or complex, as these examples demonstrate:
foo
foo.bar
foo.bar[baz].hello[and][goodday].there
this[is[even[more.complicated].wouldnt.you[say]]]
They could get more complicated still, if we handled dynamic interpolation of strings, such as:
foo[`bar${x}abc`].baz
But in my simple lang, there are two relevant things, "paths", and "terms". Terms are anything /a-z/ for now, and paths are chaining together and nesting, like the first examples.
For demonstration purposes, everything else is a simple "term" of 1 word, so you might have this:
abc foo.bar[baz].hello[and][goodday].there, one foo.bar
It forms a simple tree.
Right now I have a lexer which spits out the tokens, so basically:
abc
[SPACE]
foo
.
bar
[
baz
]
.
hello
[
and
]
[
goodday
]
.
there
,
[SPACE]
one
[SPACE]
foo
.
bar
That is at least how I broke it up initially.
So given that sequence of strings, how can you generate messages to tell the parser how to build a tree?
term
nest-down
term
period
term
open-square
and
close-square
...
That is the stream of tokens with a name now, but it is not a tree yet. I would like this:
term-start
term # value: abc
term-end
nest-down
term-path-start
term-start
term # value: foo
term-end
period
term-start
term # value: bar
term-end
term-nest-start
term-start
term # value: and
term-and
term-nest-end
...
I have been struggling with this example for several days now (boiled down from a complex real-world scenario). I cant seem to figure out how to keep track of all the information you need to make a decision on when to say "this structure is done now, close it out" sort of thing. Wondering if you know how to get past this.
Note, I don't need the last tree to actually be a tree structure visually, I just need it to generate those messages which can be interpreted on the other end and used to construct a tree at runtime.
There is no way to construct a tree from a list without having the description of the tree in some form. Often, in relation to parsing, the description of this tree is given by a context-free grammar (CFG).
Then you create a parser on the basis of this given CFG. The lexical token stream is given as an input to the parser. The parser organizes the lexical tokens into a tree by using some parsing algorithm.
The parser emits commands for syntax tree construction based on the rules it uses during parsing. On entering into a rule a command "rule X enter" is emitted, on exiting a rule a command "exit X rule" is emitted. When you accept a lexical token then a "token forward" is emitted with its lexeme characters. Some grammars, namely these in ABNF format, support repetitions of elements. Depending from these repetitions the syntax tree might be represented as lists or arrays.
Then a builder module receives this commands and builds a tree, or uses the commands for the specific task with the listener pattern.
I have co-authored (2021) a paper describing a list of commands for building a concrete/abstract syntax trees, depending on CFG's structure, that are used in the parsers generated by parser generator Tunnel Grammar Studio.
The paper is named "Тhe Expressive Power of the Statically Typed Concrete Syntax Trees". It is in an open-access journal (intentionally). The commands are in section "4.3 Syntax Structure Construction Commands". The article is bit "compressed", due to space limitations, and it is not really intended to be a software development guide, but to note the taken approach. It might give you some ideas.
Another co-authored paper of mine, from 2021 named "A Parsing Machine Architecture Encapsulating Different Parsing Approaches" (also in open-access journal) describes a general form of parsing machine and its modules. There Fig.1, p.33, will give you a quick description.
Disclaimer: I have made the parser generator.
The way that the "fast pipe" operator is compared to the "pipe last" in many places implies that they are drop-in replacements for each other. Want to send a value in as the last parameter to a function? Use pipe last (|>). Want to send it as the first parameter? Use fast pipe (once upon a time |., now deprecated in favour of ->).
So you'd be forgiven for thinking, as I did until earlier today, that the following code would get you the first match out of the regular expression match:
Js.String.match([%re "/(\\w+:)*(\\w+)/i"], "int:id")
|> Belt.Option.getExn
-> Array.get(1)
But you'd be wrong (again, as I was earlier today...)
Instead, the compiler emits the following warning:
We've found a bug for you!
OCaml preview 3:10-27
This has type:
'a option -> 'a
But somewhere wanted:
'b array
See this sandbox. What gives?
Looks like they screwed up the precedence of -> so that it's actually interpreted as
Js.String.match([%re "/(\\w+:)*(\\w+)/i"], "int:id")
|> (Belt.Option.getExn->Array.get(1));
With the operators inlined:
Array.get(Belt.Option.getExn, 1, Js.String.match([%re "/(\\w+:)*(\\w+)/i"], "int:id"));
or with the partial application more explicit, since Reason's syntax is a bit confusing with regards to currying:
let f = Array.get(Belt.Option.getExn, 1);
f(Js.String.match([%re "/(\\w+:)*(\\w+)/i"], "int:id"));
Replacing -> with |. works. As does replacing the |> with |..
I'd consider this a bug in Reason, but would in any case advise against using "fast pipe" at all since it invites a lot of confusion for very little benefit.
Also see this discussion on Github, which contains various workarounds. Leaving #glennsl's as the accepted answer because it describes the nature of the problem.
Update: there is also an article on Medium that goes into a lot of depth about the pros and cons of "data first" and "data last" specifically as it applies to Ocaml / Reason with Bucklescript.
I am trying to run this prolog code in DrRacket: http://www.anselm.edu/homepage/mmalita/culpro/graf1.html
#lang datalog
arc(a,b).
arc(b,c).
arc(a,c).
arc(a,d).
arc(b,e).
arc(e,f).
arc(b,f).
arc(f,g).
pathall(X,X,[]).
pathall(X,Y,[X,Z|L]):- arc(X,Z),pathall(Z,Y,L). % error on this line;
pathall(a,g)?
However, it is giving following error:
read: expected a `]' to close `['
I suspect '|' symbol is not being read as head-tail separator of the list. Additionally, [] is also giving error (if subsequent line is removed):
#%app: missing procedure expression;
probably originally (), which is an illegal empty application in: (#%app)
How can these be corrected so that the code works and searches for paths between a and g ?
The Datalog module in DrRacket is not an implementation of Prolog, and the syntax that you have used is not allowed (see the manual for the syntax allowed).
In particular terms cannot be data structures like lists ([]). To run a program like that of above you need a Prolog interpreter with data structures.
What you can do is define for instance a predicate path, like in the example that you have linked:
path(X,Y):- arc(X,Y).
path(X,Y):- arc(X,Z),path(Z,Y).
and, for instance, ask if a path exists or not, as in:
path(a,g)?
or print all the paths to a certain node with
path(X,g)?
etc.
I have a rule to match 'FOR "hi" FOR'
rule : id1=ELEMENT STRING id1
{
// action
}
-> ^(Tree rule)
but it fails saying reference to undefined rule: id1
How can I reuse a label to ensure the start and end of the rule are the same identifier
The recommended way to handle this is assume the values match while parsing, and then examine the AST after parsing is complete, issuing error messages at that time for any mismatched elements.
This approach results in a more robust parser and much much understandable error messages in the case of an wrote.
I'm trying to get the feel for antlr3, and i pasted the Expression evaluator into an ANTLRWorks window (latest version) and compiled it. It compiled successfully and started, but two problems:
Attempting to use a input of 1+2*4/3; resulted in the actual input for the parser being 1+2*43.
One of the errors it shows in it's graphical parser tree is MissingTokenException(0!=0).
As i'm new to antlr, can someone help?
The example you linked to doesn't support division (just look at the code, you'll notice there's no division here:
expr returns [int value]
: e=multExpr {$value = $e.value;}
( '+' e=multExpr {$value += $e.value;}
| '-' e=multExpr {$value -= $e.value;}
)*
We often get
MissingTokenException(0!=0)
when we make mistakes. I think it means that it cannot find a token it's looking for, and could be produced by an incorrect token. It's possible for the parser to "recover" sometimes depending on the grammar.
Remember also that the LEXER operates before the parser and your should check what tokens are actually passed to the parser. The AntlrWorks debugger can be very helpful here.