Does antlr4 memoize tokens? - caching

Let's say I have the following expression alternation:
expr
: expr BitwiseAnd expr
| expr BitwiseXor expr
// ...
;
Just for arguments sake, let's say that the expr on the left-hand-side turns out to be 1MB. Will antlr be able to 'save' that expression so it doesn't have to start-from-zero on each alternation, or how far does it have to backtrack when it fails to match on an alternation?
Just

ANTLR will recognize the 1st expr and then if it doesn't find a BitwiseAnd, it will look for a BitwiseXor to try to match the second alternative. It won't backtrack all the way to trying to recognize the 1st expr again. It's not exactly memoization, but you get the same benefit (arguably even better).
You may find it useful to have ANTLR generate the ATN for your grammar. Use the -atn option when running the antlr4 command, this will generate *.dot files for each of your rules (both Lexer and Parser). You can then use graphViz to render them to svg, pdf, etc. They may look a bit intimidating at first glance, but just take a moment with them and you'll get a LOT of insight into how ANTLR goes about parsing your input.
The second place to look is the generated parser code. It too is much more understandable than you might expect (especially if reading it with the ATN graph handy).

Related

Macro contains a cycle

So I'm trying to make a lexical analyzer for scheme and when I run JFlex to convert the lever.flex file I get an error similar to this one for example:
Reading "lexer.flex"
Macro definition contains a cycle.
1 error, 0 warnings.
the macro it's referring to is this one:
definition = {variable_definition}
| {syntax_definition}
| \(begin {definition}*\)
| \(let-syntax \({syntax_binding}*\){definition}*\)
| \(letrec-syntax \({syntax_binding}*\){definition}*\)
all of the macros defined here have been implemented but fro some reason I can't get rid of this error and I don't know why its happening.
A lex/flex/JFlex style "definition" is a macro expansion, as that error message indicates. Recursive macro expansions are impossible, since macro expansion is not conditional; trying to expand
definition = ... \(begin {definition}*\) ...
will result in an infinitely long regular expression.
Do not mistake a lexical analyser for a general-purpose parser. A lexical analyser does no more than split an input into individual tokens (or "lexemes"), using regular expressions to identify each token. Tokens do not have structure (at least for the purposes of parsing); once a token is identified, it is a single indivisible object. If you find yourself writing lexical descriptions which match structured text, you have almost certainly pushed the lexical analysis beyond its limits.
Parsers use an algorithm which does allow recursive descriptions (but which has very limited forward lookahead) and which can create a recursive description of the input (such as a parse tree).

What does the "%" mean in tcl?

In a situation like this for example:
[% $create_port %]
or [list [% $RTL_LIST %]]
I realized it had to do with the brackets, but what confuses me is that sometimes it is used with the brackets and variable followed, and sometimes you have brackets with variables inside without the %.
So i'm not sure what it is used for.
Any help is appreciated.
% is not a metacharacter in the Tcl language core, but it still has a few meanings in Tcl. In particular, it's the modulus operator in expr and a substitution field specifier in format, scan, clock format and clock scan. (It's also the default prompt character, and I have a trivial pass-through % command in my ~/.tclshrc to make cut-n-pasting code easier, but nobody else in the world needs to follow my lead there!)
But the code you have written does not appear to be any of those (because it would be a syntax error in all of the commands I've mentioned). It looks like it is some sort of directive processing scheme (with the special sequences being [% and %], with the brackets) though not one I recognise such as doctools or rivet. Because a program that embeds a Tcl interpreter could do an arbitrary transformation to scripts before executing them, it's extremely difficult to guess what it might really be.

Extract function names from function calls in C files

Is it posible to extract function calls in C source files, e.g.,
...
myfunc(1);
...
or
...
myfunc(anotherfunc(1, 2));
....
by just using Ruby regular expression? If not, would a parser generator such as ANTLR be useful?
This is not a full-proof pattern for finding out method calls but should just serve the pattern that you are interested in.
[a-zA-Z\s]*\([a-zA-Z0-9]*(\([a-zA-Z0-9\s]*[\s,]*[\sa-zA-Z0-9]*\))?\);
This regex will match following method call patterns.
1. myfunc(another(one,two));
2. myfunc();
3. myfunc(another());
4. myfunc(oneArg);
You can also use the regular expressions already written from grammar that are used by emacs -- imenu , etags, ecb, c-mode etc.
In the purest sense you can't, because the possibility to nest function calls recursively makes it a non-regular language. That is, you cannot write a regular expression that matches an arbitrary function call and extracts all of the contained function names.
But of course you could search incrementally for sequences of characters allowed in function names (ie., must start with a letter or underscore, followed by letters, underscore, numbers, etc...) followed by an left parenthesis, or something along those lines.
Keep in mind, however, that any such approach is prone to errors: what if a function is referenced in a comment? What if it appears inside a string constant? Really, to catch all the special cases you would have to (almost) properly parse the full C file.
Most modern regular expression engines have features to parse more than regular languages e.g. by means of back-references to subexpressions. But you shouldn't go down that road. With a proper parser such as ANTLR that can parse context-free languages you'll make your own life a lot easier.

Using phrase_from_file to read a file's lines

I've been trying to parse a file containing lines of integers using phrase_from_file with the grammar rules
line --> I,line,{integer(I)}.
line --> ['\n'].
thusly: phrase_from_file(line,'input.txt').
It fails, and I got lost very quickly trying to trace it.
I've even tried to print I, but it doesn't even get there.
EDIT::
As none of the solutions below really fit my needs (using read/1 assumes you're reading terms, and sometimes writing that DCG might just take too long), I cannibalized this code I googled, the main changes being the addition of:
read_rest(-1,[]):-!.
read_word(C,[],C) :- ( C=32 ;
C=(-1)
) , !.
If you are using phrase_from_file/2 there is a very simple way to test your programs prior to reading actual files. Simply call the very same non-terminal with phrase/2. Thus, a goal
phrase(line,"1\n2").
is the same as calling
phrase_from_file(line,fichier)
when fichier is a file containing above 3 characters. So you can test and experiment in a very compact manner with phrase/2.
There are further issues #Jan Burse already mentioned. SWI reads in character codes. So you have to write
newline --> "\n".
for a newline. And then you still have to parse integers yourself. But all that is tested much easier with phrase/2. The nice thing is that you can then switch to reading files without changing the actual DCG code.
I guess there is a conceptional problem here. Although I don't know the details of phrase_from_file/2, i.e. which Prolog system you are using, I nevertheless assume that it will produce character codes. So for an integer 123 in the file you will get the character codes 0'1, 0'2 and 0'3. This is probably not what you want.
If you would like to process the characters, you would need to use a non-terminal instead of a bare bone variable I, to fetch them. And instead of the integer test, you would need a character test, and you can do the test earlier:
line --> [I], {0'0=<I, I=<0'9}, line.
Best Regards
P.S.: Instead of going the DCG way, you could also use term read operations. See also:
read numbers from file in prolog and sorting

History of trailing comma in programming language grammars

Many programming languages allow trailing commas in their grammar following the last item in a list. Supposedly this was done to simplify automatic code generation, which is understandable.
As an example, the following is a perfectly legal array initialization in Java (JLS 10.6 Array Initializers):
int[] a = { 1, 2, 3, };
I'm curious if anyone knows which language was first to allow trailing commas such as these. Apparently C had it as far back as 1985.
Also, if anybody knows other grammar "peculiarities" of modern programming languages, I'd be very interested in hearing about those also. I read that Perl and Python for example are even more liberal in allowing trailing commas in other parts of their grammar.
I'm not an expert on the commas, but I know that standard Pascal was very persnickity about semi-colons being statement separators, not terminators. That meant you had to be very very careful about where you put one if you didn't want to get yelled at by the compiler.
Later Pascal-esque languages (C, Modula-2, Ada, etc.) had their standards written to accept the odd extra semicolon without behaving like you'd just peed in the cake mix.
I just found out that a g77 Fortran compiler has the -fugly-comma Ugly Null Arguments flag, though it's a bit different (and as the name implies, rather ugly).
The -fugly-comma option enables use of a single trailing comma to mean “pass an extra trailing null argument” in a list of actual arguments to an external procedure, and use of an empty list of arguments to such a procedure to mean “pass a single null argument”.
For example, CALL FOO(,) means “pass two null arguments”, rather than “pass one null argument”. Also, CALL BAR() means “pass one null argument”.
I'm not sure which version of the language this first appeared in, though.
[Does anybody know] other grammar "peculiarities" of modern programming languages?
One of my favorites, Modula-3, was designed in 1990 with Niklaus Wirth's blessing as the then-latest language in the "Pascal family". Does anyone else remember those awful fights about where semicolon should be a separator or a terminator? In Modula-3, the choice is yours! The EBNF for a sequence of statements is
stmt ::= BEGIN [stmt {; stmt} [;]] END
Similarly, when writing alternatives in a CASE statement, Modula-3 let you use the vertical bar | as either a separator or a prefix. So you could write
CASE c OF
| 'a', 'e', 'i', 'o', 'u' => RETURN Char.Vowel
| 'y' => RETURN Char.Semivowel
ELSE RETURN Char.Consonant
END
or you could leave off the initial bar, perhaps because you prefer to write OF in that position.
I think what I liked as much as the design itself was the designers' awareness that there was a religious war going on and their persistence in finding a way to support both sides.
Let the programmer choose!
P.S. Objective Caml allows permissive use of | in case expressions whereas the earlier and closely related dialect Standard ML does not. As a result, case expressions are often uglier in Standard ML code.
EDIT: After seeing T.E.D.'s answer I checked the Modula-2 grammar and he's correct, Modula-2 also supported semicolon as terminator, but through the device of the empty statement, which makes stuff like
x := x + 1;;;;;; RETURN x
legal. I suppose that's not a bad thing. Modula-2 didn't allow flexible use of the case separator |, however; that seems to have originated with Modula-3.
Something which has always galled me about C is that although it allows an extra trailing comma in an intializer list, it does not allow an extra trailing comma in an enumerator list (for defining the literals of an enumeration type). This little inconsistency has bitten me in the ass more times than I care to admit. And for no reason!

Resources