Getting First and Follow metadata from an ANTLR4 parser

Getting First and Follow metadata from an ANTLR4 parser - antlr3

Is it possible to extract the first and follow sets from a rule using ANTLR4? I played around with this a little bit in ANTLR3 and did not find a satisfactory solution, but if anyone has info for either version, it would be appreciated.
I would like to parse user input up the user's cursor location and then provide a list of possible choices for auto-completion. At the moment, I am not interested in auto-completing tokens which are partially entered. I want to display all possible following tokens at some point mid-parse.
For example:
sentence:
subjects verb (adverb)? '.' ;
subjects:
firstSubject (otherSubjects)* ;
firstSubject:
'The' (adjective)? noun ;
otherSubjects:
'and the' (adjective)? noun;
adjective:
'small' | 'orange' ;
noun:
CAT | DOG ;
verb:
'slept' | 'ate' | 'walked' ;
adverb:
'quietly' | 'noisily' ;
CAT : 'cat';
DOG : 'dog';
Given the grammar above...
If the user had not typed anything yet the auto-complete list would be ['The'] (Note that I would have to retrieve the FIRST and not the FOLLOW of rule sentence, since the follow of the base rule is always EOF).
If the input was "The", the auto-complete list would be ['small', 'orange', 'cat', 'dog'].
If the input was "The cat slept, the auto-complete list would be ['quietly', 'noisily', '.'].
So ANTLR3 provides a way to get the set of follows doing this:
BitSet followSet = state.following[state._fsp];
This works well. I can embed some logic into my parser so that when the parser calls the rule at which the user is positioned, it retrieves the follows of that rule and then provides them to the user. However, this does not work as well for nested rules (For instance, the base rule, because the follow set ignores and sub-rule follows, as it should).
I think I need to provide the FIRST set if the user has completed a rule (this could be hard to determine) as well as the FOLLOW set of to cover all valid options. I also think I will need to structure my grammar such that two tokens are never subsequent at the rule level.
I would have break the above "firstSubject" rule into some sub rules...
from
firstSubject:
'The'(adjective)? CAT | DOG;
to
firstSubject:
the (adjective)? CAT | DOG;
the:
'the';
I have yet to find any information on retrieving the FIRST set from a rule.
ANTLR4 appears to have drastically changed the way it works with follows at the level of the generated parser, so at this point I'm not really sure if I should continue with ANTLR3 or make the jump to ANTLR4.
Any suggestions would be greatly appreciated.

ANTLRWorks 2 (AW2) performs a similar operation, which I'll describe here. If you reference the source code for AW2, keep in mind that it is only released under an LGPL license.
Create a special token which represents the location of interest for code completion.
In some ways, this token behaves like the EOF. In particular, the ParserATNSimulator never consumes this token; a decision is always made at or before it is reached.
In other ways, this token is very unique. In particular, if the token is located at an identifier or keyword, it is treated as though the token type was "fuzzy", and allowed to match any identifier or keyword for the language. For ANTLR 4 grammars, if the caret token is located at a location where the user has typed g, the parser will allow that token to match a rule name or the keyword grammar.
Create a specialized ATN interpreter that can return all possible parse trees which lead to the caret token, without looking past the caret for any decision, and without constraining the exact token type of the caret token.
For each possible parse tree, evaluate your code completion in the context of whatever the caret token matched in a parser rule.
The union of all the results found in step 3 is a superset of the complete set of valid code completion results, and can be presented in the IDE.
The following describes AW2's implementation of the above steps.
In AW2, this is the CaretToken, and it always has the token type CARET_TOKEN_TYPE.
In AW2, this specialized operation is represented by the ForestParser<TParser> interface, with most of the reusable implementation in AbstractForestParser<TParser> and specialized for parsing ANTLR 4 grammars for code completion in GrammarForestParser.
In AW2, this analysis is performed primarily by GrammarCompletionQuery.TaskImpl.runImpl(BaseDocument).

Related

My flow fails for no reason: Invalid Template Language? This is what I do

Team,
Occasionally my flow fails and its enough test it manually to running again. However, I want to avoid that this error ocurrs again to stay in calm.
The error that appears is this:
Unable to process template language expressions in action 'Periodo' inputs at line '0' and column '0': 'The template language function 'split' expects its first parameter to be of type string. The provided value is of type 'Null'. Please see https://aka.ms/logicexpressions#split for usage details.'.
And it appears in 2 of the 4 variables that I create:
Client and Periodo
The variable Clientlooks this:
The same scenario to "Periodo".
The variables are build in the same way:
His formula:
trim(first(split(first(skip(split(outputs('Compos'),'client = '),1)),'indicator')))
His formula:
trim(first(split(first(skip(split(outputs('Compos'),'period = '),1)),'DATA_REPORT_DELIVERY')))
The same scenario to the 4 variables. 4 of them strings (numbers).
Also I attached email example where I extract the info:
CO NIV ICE REFRESCOS DE SOYA has finished successfully.CO NIV ICE REFRESCOS DE SOYA
User
binary.struggle#mail.com
Parameters
output = 7
country = 170
period = 202204012
DATA_REPORT_DELIVERY = NO
read_persistance = YES
write_persistance = YES
client = 18277
indicator_group = SALES
Could you give some help? I reach some attepmpts succeded but it fails for no apparent reason:
Thank you.

I'm not sure if you're interested but I'd do it a slightly different way. It's a little more verbose but it will work and it makes your expressions a lot simpler.
I've just taken two of your desired outputs and provided a solution for those, one being client and the other being country. You can apply the other two as need be given it's the same pattern.
If I take client for example, this is the concept.
Initialize Data
This is your string that you provided in your question.
Initialize Split Lines
This will split up your string for each new line. The expression for this step is ...
split(variables('Data'), '\n')
However, you can't just enter that expression into the editor, you need to do it and then edit in in code view and change it from \\n to \n.
Filter For 'client'
This will filter the array created from the split line step and find the item that contains the word client.
`contains(item(), 'client')`
On the other parallel branches, you'd change out the word to whatever you're searching for, e.g. country.
This should give us a single item array with a string.
Initialize 'client'
Finally, we want to extract the value on the right hand side of the equals sign. The expression for this is ...
trim(split(body('Filter_For_''client''')[0], '=')[1])
Again, just change out the body name for the other action in each case.
I need to put body('Filter_For_''client''')[0] and specify the first item in an array because the filter step returns an array. We're going to assume the length is always 1.
Result
You can see from all of that, you have the value as need be. Like I said, it's a little more verbose but (I think) easier to follow and troubleshoot if something goes wrong.

ANTL4 Parser (Flat parser vs Structor parse) for Language Translator

Over the last couple of months, with the help of members from this site, I have been able to write (Phase 1) a Lexer and Parser to translate Lang X to Java. Because I was new to this topic, I opted for a simple line by line, parser and now it's able to parse around 1000 language files in 15 minutes with a small number of errors/exceptions and circa 1M lines of code, with the problems being isolated to the source files not the parser. I will refer to this a flat parsing, for want of a better expression.
Now for Phase 2, the translation to Java. Like any language, mine has Data Structures, Procedures, Sub-routines, etc and I thought it best to alter the parser from below (for simplicity I have focussed on the Data Structure (called TABLE)):
// Main entry point of the program
program
: executableUnit+ EOF
;
// Execution units (line by line)
executableUnit:
| itemBlockStart
| itemBlockEnd
| itemStatement
| tableHeader
;
itemBlockStart: BEGIN;
itemBlockEnd: END;
tableHeader: // A TABLE declaration statement
TABLE atom LETTER (atom)*
;
// Item statement
itemStatement:
// Tables with Item statements
ITEM atom+
// Base atom lowest of the low
atom:
MINUS? INT #IntegerAtom
| REAL_FORMAT #RealAtom
| FIX_POINT #FixPointAtom
| (MINUS | EQUALS)? NAME DOT? #NameAtom
| LETTER #LetterAtom
| keywords DOT? #KeywordAtom
| DOLLAR atom DOLLAR #DollarAtom
| hex_assign #HexItem
;
to this:
// Execution units (by structure)
executableUnit:
tableStatement
| itemStatement
;
// Table statement, header and body
tableStatement:
tableHeader (itemBlockStart | itemBlockEnd | itemStatement)*;
Before we go any futher, TABLE and individual ITEM statements can occur anywhere in the code, on their own (Java output would be public) or inside a Procedure (Jave output would be private)
Imagine my dismay (if you will) when the parser produced the same number of errors, but took 10 times longer to parser the input. I kind of understand the increased time period, in terms of selecting the right path. My questions for the group are:
Is there a way to force the parser down the TABLE structure early to reduce the time period?
Whether having this logical tree structure grouping is worth the increased time?
My desire to move in this direction was to have a Listener callback with a mini tree with all the relevant items accessable to walk. I.e. If the mini tree wasnt inside a Procedure statement is was public in Java.

It's not entirely clear to me what performance difference you are referring to (presumably, the difference between the "line by line" parser, and this, full file, parser. (???)
A few things that "jump out" about your grammar, and could have some performance impact:
1 - itemBlockStart: BEGIN; and itemBlockEnd: END;. There's no point in have a rule that is a single Token. Just use the token in the rule definition.
2 - You are, probably unintentionally, being VERY relaxed in the acceptance of itemStartBlock and itemStopBlock in this rule (tableStatement: tableHeader (itemBlockStart | itemBlockEnd | itemStatement)*;). This could also have performance implications. I'm assuming in the rest of this response that BEGIN should appear at the beginning of an itemStatement and END should appear at the end (not that the three can appear in any order willy-nilly).
Try this refactoring:
// Main entry point of the program
program
: executableUnit+ EOF
;
// Execution units (line by line)
executableUnit:
| itemStatement # ItemStmt
| tableHeader # TableHeader
;
tableHeader: // A TABLE declaration statement
TABLE atom LETTER atom*
;
// Item statement
itemStatement: // Tables with Item statements
BEGIN ITEM atom+ END
;
// Base atom lowest of the low
atom: MINUS? INT #IntegerAtom
| REAL_FORMAT #RealAtom
| FIX_POINT #FixPointAtom
| (MINUS | EQUALS)? NAME DOT? #NameAtom
| LETTER #LetterAtom
| keywords DOT? #KeywordAtom
| DOLLAR atom DOLLAR #DollarAtom
| hex_assign #HexItem
;
admittedly, I can't quite make out what your intention is, but this should be a step in the right direction.
As Kaby76 points out, the greedy operator at the end of the tableHeader is quite likely to "gobble up" a lot of input. This is partly because of the lack of a terminator token (which would, no doubt, stop the token consumption earlier than not having a termination token. However, your atom rule seems to be something of a "kitchen sink" rule that can match all manner of input. Couple that with the use of atom+ and atom* and there's quite the likelihood of consuming a long stream of tokens. Is it really your intention that any of the atoms can appear one after the other with no structure? They appear to be pieces/parts of expressions. If that's the case, you will want to define your grammar for expressions. This added structure will both help performance and give you a MUCH more useful parse tree to act upon.
Much like the structure for tableStatement in your question's grammar, it doesn't really represent any structure (see my recommendation to change it to BEGIN ITEM atom+ END rather than accepting any combination in any order. The same thought process needs to be applied to atom. Both of these approaches let ANTLR march through your code consuming a LOT of tokens without any clue as to whether the order is actually correct (which is then very expensive to attempt to "back out of" when a problem is encountered).

Shell help text syntax for repeatable group of arguments

I'm writing a help output for a Bash script. Currently it looks like this:
dl [m|r]… (<file>|<URL> [m|r|<index>]…)…
The meaning that I'm trying to convey (and elsewhere describe with words) is that (after a potential "m" and/or "r") there can be an endless list of sets of arguments. The first argument in each set is always a file or URL and the further arguments can each be "m", "r" or a number. After that, it starts over with a file or URL and so on.
In my special case, I could just write this:
dl [m|r]… (<file>|<URL>) (<file>|<URL>|m|r|<index>)…
This works, because listing a URL and then another URL with nothing in between is allowed, as well as listing an arbitrarily long chain of "m"s (it's just useless to do so) and pretty much any other combination.
But what if that wasn't the case? What if I had for example a command like this:
change (<from> <to>)…
…which would be used e.g. like this:
change from1 to1 from2 to2 from3 to3
Would the bracket syntax be correct here? I just guessed it based on the grouping of (a|b), but I wasn't able to find any documentation that uses this for multiple, non-exclusive arguments that belong together. Is there even a standard for this?

Xtext - Multiline String like in YAML

I'm trying to model a YAML-like DSL in Xtext. In this DSL, I need some Multiline String as in YAML.
description: |
Line 1
line 2
...
My first try was this:
terminal BEGIN:
'synthetic:BEGIN'; // increase indentation
terminal END:
'synthetic:END'; // decrease indentation
terminal MULTI_LINE_STRING:
"|"
BEGIN ANY_OTHER END;
and my second try was
terminal MULTI_LINE_STRING:
"|"
BEGIN
((!('\n'))+ '\n')+
END;
but both of them did not succeed. Is there any way to do this in Xtext?
UPDATE 1:
I've tried this alternative as well.
terminal MULTI_LINE_STRING:
"|"
BEGIN ->END
When I triggered the "Generate Xtext Artifacts" process, I got this error:
3492 [main] INFO nerator.ecore.EMFGeneratorFragment2 - Generating EMF model code
3523 [main] INFO clipse.emf.mwe.utils.GenModelHelper - Registered GenModel 'http://...' from 'platform:/resource/.../model/generated/....genmodel'
error(201): ../.../src-gen/.../parser/antlr/lexer/Internal..Lexer.g:236:71: The following alternatives can never be matched: 1
error(3): cannot find tokens file ../.../src-gen/.../parser/antlr/internal/Internal...Lexer.tokens
error(201): ../....idea/src-gen/.../idea/parser/antlr/internal/PsiInternal....g:4521:71: The following alternatives can never be matched: 1

This slide deck shows how we implemented a whitespace block scoping in an Xtext DSL.
We used synthetic tokens called BEGIN corresponding to an indent, and END corresponding to an outdent.
(Note: the language was subsequently renamed to RAPID-ML, included as a feature of RepreZen API Studio.)

I think your main problem is, that you have not defined when your multiline token is ending. Before you come to a solution you have to make clear in your mind how an algorithm should determine the end of the token. No tool can take this mental burdon from you.
Issue: There is no end marking character. Either you have to define such a character (unlike YAML) or define the end of the token in anather way. For example through some sort of semantic whitespace (I think YAML does it like that).
The first approach would make the thing very easy. Just read content until you find the closing character. The sescond approach would probably be manageable using a custom lexer. Basically you replace the generated lexer with your own implemented solution that is able to cound blanks or similar.
Here are some starting points about how this could be done (different approaches thinkable):
Writing a custom Xtext/ANTLR lexer without a grammar file
http://consoliii.blogspot.de/2013/04/xtext-is-incredibly-powerful-framework.html

Antlr3 behaves differently in 2 different rules when the rule's intent is effectively same

While working with Antlr3 grammar, I have come across a situation when the rule's intent is effectively same but behaves differently.
I have created a small example-
I want to parse a qualified object name which may be 3-part or 2-part or unqualified (Dot is the separator).
Test Input-
1. SCH.LIB.TAB1;
2. LIB.TAB1;
3. TAB1;
I changed the below rule from having optionals to having alternatives (ORed rules).
Before State-
qualified_object_name
:
( identifier ( ( DOT identifier )? DOT identifier )? )
;
After State-
qualified_object_name_new
:
( identifier DOT identifier DOT identifier ) // 3 part name
| ( identifier DOT identifier ) // 2 part name
| ( identifier ) // 1 part name
;
Input 1 is parsed correctly by both the rules, but the new rule gives error while parsing input 2 and 3.
line 1:22 no viable alternative at input ';'
I assumed that Antlr will try to match against alternative 1 of qualified_object_name_new, but when does not match alternative 1 fully, then would try to match alternative 2 and so on.
So, for input 'LIB.TAB1' it would finally match against alternative 2 of qualified_object_name_new.
However, it is not working this way and gives error while paring 2-part name or unqualified name.
Interestingly, when I set option k = 1, then all 3 inputs are parsed correctly by the new rule.
But with any other value of k, it gives error.
I want to understand why Antlr behaves this way and is this correct.

You probably have not increased the lookahead size (which is 1 by default in ANTLR3). With one token lookahead the new object name rule cannot resolve the ambiquity (all alts start with the same token). You should have gotten a warning about this too.
You have 3 options to solve this problem with ANTLR3 (though I also recommend to switch to version 4):
Enable backtracking (see the backgtrack option), though I'm not 100% sure if that really helps.
Increase lookahead size (see the k option).
Use syntactic predicates to force ANTLR to look ahead the entire alt.
For more details read the ANTLR3 documentation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio