ANTL4 Parser (Flat parser vs Structor parse) for Language Translator - performance

Over the last couple of months, with the help of members from this site, I have been able to write (Phase 1) a Lexer and Parser to translate Lang X to Java. Because I was new to this topic, I opted for a simple line by line, parser and now it's able to parse around 1000 language files in 15 minutes with a small number of errors/exceptions and circa 1M lines of code, with the problems being isolated to the source files not the parser. I will refer to this a flat parsing, for want of a better expression.
Now for Phase 2, the translation to Java. Like any language, mine has Data Structures, Procedures, Sub-routines, etc and I thought it best to alter the parser from below (for simplicity I have focussed on the Data Structure (called TABLE)):
// Main entry point of the program
program
: executableUnit+ EOF
;
// Execution units (line by line)
executableUnit:
| itemBlockStart
| itemBlockEnd
| itemStatement
| tableHeader
;
itemBlockStart: BEGIN;
itemBlockEnd: END;
tableHeader: // A TABLE declaration statement
TABLE atom LETTER (atom)*
;
// Item statement
itemStatement:
// Tables with Item statements
ITEM atom+
// Base atom lowest of the low
atom:
MINUS? INT #IntegerAtom
| REAL_FORMAT #RealAtom
| FIX_POINT #FixPointAtom
| (MINUS | EQUALS)? NAME DOT? #NameAtom
| LETTER #LetterAtom
| keywords DOT? #KeywordAtom
| DOLLAR atom DOLLAR #DollarAtom
| hex_assign #HexItem
;
to this:
// Execution units (by structure)
executableUnit:
tableStatement
| itemStatement
;
// Table statement, header and body
tableStatement:
tableHeader (itemBlockStart | itemBlockEnd | itemStatement)*;
Before we go any futher, TABLE and individual ITEM statements can occur anywhere in the code, on their own (Java output would be public) or inside a Procedure (Jave output would be private)
Imagine my dismay (if you will) when the parser produced the same number of errors, but took 10 times longer to parser the input. I kind of understand the increased time period, in terms of selecting the right path. My questions for the group are:
Is there a way to force the parser down the TABLE structure early to reduce the time period?
Whether having this logical tree structure grouping is worth the increased time?
My desire to move in this direction was to have a Listener callback with a mini tree with all the relevant items accessable to walk. I.e. If the mini tree wasnt inside a Procedure statement is was public in Java.

It's not entirely clear to me what performance difference you are referring to (presumably, the difference between the "line by line" parser, and this, full file, parser. (???)
A few things that "jump out" about your grammar, and could have some performance impact:
1 - itemBlockStart: BEGIN; and itemBlockEnd: END;. There's no point in have a rule that is a single Token. Just use the token in the rule definition.
2 - You are, probably unintentionally, being VERY relaxed in the acceptance of itemStartBlock and itemStopBlock in this rule (tableStatement: tableHeader (itemBlockStart | itemBlockEnd | itemStatement)*;). This could also have performance implications. I'm assuming in the rest of this response that BEGIN should appear at the beginning of an itemStatement and END should appear at the end (not that the three can appear in any order willy-nilly).
Try this refactoring:
// Main entry point of the program
program
: executableUnit+ EOF
;
// Execution units (line by line)
executableUnit:
| itemStatement # ItemStmt
| tableHeader # TableHeader
;
tableHeader: // A TABLE declaration statement
TABLE atom LETTER atom*
;
// Item statement
itemStatement: // Tables with Item statements
BEGIN ITEM atom+ END
;
// Base atom lowest of the low
atom: MINUS? INT #IntegerAtom
| REAL_FORMAT #RealAtom
| FIX_POINT #FixPointAtom
| (MINUS | EQUALS)? NAME DOT? #NameAtom
| LETTER #LetterAtom
| keywords DOT? #KeywordAtom
| DOLLAR atom DOLLAR #DollarAtom
| hex_assign #HexItem
;
admittedly, I can't quite make out what your intention is, but this should be a step in the right direction.
As Kaby76 points out, the greedy operator at the end of the tableHeader is quite likely to "gobble up" a lot of input. This is partly because of the lack of a terminator token (which would, no doubt, stop the token consumption earlier than not having a termination token. However, your atom rule seems to be something of a "kitchen sink" rule that can match all manner of input. Couple that with the use of atom+ and atom* and there's quite the likelihood of consuming a long stream of tokens. Is it really your intention that any of the atoms can appear one after the other with no structure? They appear to be pieces/parts of expressions. If that's the case, you will want to define your grammar for expressions. This added structure will both help performance and give you a MUCH more useful parse tree to act upon.
Much like the structure for tableStatement in your question's grammar, it doesn't really represent any structure (see my recommendation to change it to BEGIN ITEM atom+ END rather than accepting any combination in any order. The same thought process needs to be applied to atom. Both of these approaches let ANTLR march through your code consuming a LOT of tokens without any clue as to whether the order is actually correct (which is then very expensive to attempt to "back out of" when a problem is encountered).

Related

Need an algorithm that detects diffs between two files for additions and reorders

I am trying to figure out if there are existing algorithms that can detect changes between two files in terms of additions but also reorders. I have an example below:
1 - User1 commit
processes = 1
a = 0
allactive = []
2 - User2 commit
processes = 2
a = 0
allrecords = range(10)
allactive = []
3 - User3 commit
a = 0
allrecords = range(10)
allactive = []
processes = 2
I need to be able to say that for example user1 code is the three initial lines of code, user 2 added the "allrecords = range(10)" part (as well as a number change), and user 3 did not change anything since he/she just reordered the code.
Ideally, at commit 3, I want to be able to look at the code and say that from character 0 to 20 (this is user1's code), 21-25 user2's code, 26-30 user1's code etc.
I know there are two popular algorithms, Longest common subsequence and longest common substring but I am not sure which one can correctly count additions of new code but be able also to identify reorders.
Of course this still leaves out the question of having the same substring existing twice in a text. Are there any other algorithms that are better suited to this problem?
Each "diff" algorithm defines a set of possible code-change edit types, and then (typically) tries to find the smallest set of such changes that explains how the new file resulted from the old. Usually such algorithms are defined purely syntactically; semantics are not taken into account.
So what you want, based on your example, is an algorithm that allow "change line", "insert line", "move line" (and presumably "delete line" [not in your example but necessary for a practical set of edits]). Given this you ought to be able to define a dynamic programming algorithm to find a smallest set of edits to explain how one file differs from another. Note that this set is defined in terms of edits to whole-lines, rather like classical "diff"; of course classical diff does not have "change line" or "move line" which is why you are looking for something else.
You could pick different types of deltas. Your example explicitly noted "number change"; if narrowly interpreted, this is NOT an edit on lines, but rather within lines. Once you start to allow partial line edits, you need to define how much of a partial line edit is allowed ("unit of change"). (Will your edit set allow "change of digit"?)
Our Smart Differencer family of tools defines the set of edits over well-defined sub-phrases of the targeted language; we use formal language grammar (non)terminals as the unit of change. [This makes each member of the family specific to the grammar of some language] Deltas include programmer-centric concepts such as "replace phrase by phrase", "delete listmember", "move listmember", "copy listmember", "rename identifier"; the algorithm operates by computing a minimal tree difference in terms of these operations. To do this, the SmartDifferencer needs (and has) a full parser (producing ASTs) for the language.
You didn't identify the language for your example. But in general, for a language looking like that, the SmartDifferencer would typically report that User2 commit changes were:
Replaced (numeric literal) "1" in line 1 column 13 by "2"
Inserted (statement) "allrecords = range(10)" after line 2
and that User3 commit changes were:
Move (statement) at line 1 after line 4
If you know who contributed the original code, with the edits you can straightforwardly determine who contributed which part of the final answer. You have to decide the unit-of-reporting; e.g., if you want report such contributions on a line by line basis for easy readability, or if you really want to track that Mary wrote the code, but Joe modified the number.
To detect that User3's change is semantically null can't be done with purely syntax-driven diff tool of any kind. To do this, the tool has to be able to compute the syntactic deltas somehow, and then compute the side effects of all statements (well, "phrases"), requiring a full static analyzer of the language to interpret the deltas to see if they have such null effects. Such a static analyzer requires a parser anyway so it makes sense to do a tree based differencer, but it also requires a lot more than just parser [We have such language front ends and have considered building such tools, but haven't gotten there yet].
Bottom line: there is no simple algorithm for determining "that user3 did not change anything". There is reasonable hope that such tools can be built.

Keeping track of path while recursively going through branches (more info in description )- Using Tcl

Background:
Writting script in Tcl
Running the script using a tool called IDSBatch from linux (centos) terminal
I have a system (.rdl file) that contains blocks, groups and registers.
Blocks can contain other blocks, groups, or registers. Whereas groups can only have registers and registers stand alone.
The problem I am having is I want to print out the "address" of each register i.e the name of the block(s), group and register associated with that specific register. For example:
______Block (a)______
| |
Block (b) reg(bob)
| |
group(tall) group(short)
| | |
reg(bill) reg(bobby) reg(burt)
In the end the output should be something along the lines of:
reg one: a.bob
reg two: a.b.tall.bill
reg three: a.b.tall.bobby
reg four: a.b.short.burt
The true problem comes from the fact that blocks can contain blocks. So the system will not always have one to three levels (one level would be Block--reg, two levels would be Block--Block--reg or Block ---group---reg and so on...)
I was leaning to some sort of recursive solution, where I would access the element say a block and get all of it's children (groups,blocks and regs) then I would use the same function to access it's children (unless it's a register). This way it can take care of any combination blocks groups and registers but then I'm stuck on how to keep track of the address of a specific register.
Thank you for taking the time in reading this and would appreciate any input or suggestions.
You could use a list for doing that.
Starting with an empty list, you append all address parts to it. If you come across a register, you can then construct the path from front to back. After every level of recursion, you remove the last element to get rid of the part you handled.
Example: you just came across the register bill. Then, your list is a -> b ->tall. To get the address, you iterate over the list and concatenate the nodes together, then appending bill to the resulting string.
So, your recursion function would be somewhat like
If the currently handled element is a register: Reconstruct the path.
If the currently handled element is not a register: Append the path element to the list, call the function with that list and remove the last element of that list.

Getting First and Follow metadata from an ANTLR4 parser

Is it possible to extract the first and follow sets from a rule using ANTLR4? I played around with this a little bit in ANTLR3 and did not find a satisfactory solution, but if anyone has info for either version, it would be appreciated.
I would like to parse user input up the user's cursor location and then provide a list of possible choices for auto-completion. At the moment, I am not interested in auto-completing tokens which are partially entered. I want to display all possible following tokens at some point mid-parse.
For example:
sentence:
subjects verb (adverb)? '.' ;
subjects:
firstSubject (otherSubjects)* ;
firstSubject:
'The' (adjective)? noun ;
otherSubjects:
'and the' (adjective)? noun;
adjective:
'small' | 'orange' ;
noun:
CAT | DOG ;
verb:
'slept' | 'ate' | 'walked' ;
adverb:
'quietly' | 'noisily' ;
CAT : 'cat';
DOG : 'dog';
Given the grammar above...
If the user had not typed anything yet the auto-complete list would be ['The'] (Note that I would have to retrieve the FIRST and not the FOLLOW of rule sentence, since the follow of the base rule is always EOF).
If the input was "The", the auto-complete list would be ['small', 'orange', 'cat', 'dog'].
If the input was "The cat slept, the auto-complete list would be ['quietly', 'noisily', '.'].
So ANTLR3 provides a way to get the set of follows doing this:
BitSet followSet = state.following[state._fsp];
This works well. I can embed some logic into my parser so that when the parser calls the rule at which the user is positioned, it retrieves the follows of that rule and then provides them to the user. However, this does not work as well for nested rules (For instance, the base rule, because the follow set ignores and sub-rule follows, as it should).
I think I need to provide the FIRST set if the user has completed a rule (this could be hard to determine) as well as the FOLLOW set of to cover all valid options. I also think I will need to structure my grammar such that two tokens are never subsequent at the rule level.
I would have break the above "firstSubject" rule into some sub rules...
from
firstSubject:
'The'(adjective)? CAT | DOG;
to
firstSubject:
the (adjective)? CAT | DOG;
the:
'the';
I have yet to find any information on retrieving the FIRST set from a rule.
ANTLR4 appears to have drastically changed the way it works with follows at the level of the generated parser, so at this point I'm not really sure if I should continue with ANTLR3 or make the jump to ANTLR4.
Any suggestions would be greatly appreciated.
ANTLRWorks 2 (AW2) performs a similar operation, which I'll describe here. If you reference the source code for AW2, keep in mind that it is only released under an LGPL license.
Create a special token which represents the location of interest for code completion.
In some ways, this token behaves like the EOF. In particular, the ParserATNSimulator never consumes this token; a decision is always made at or before it is reached.
In other ways, this token is very unique. In particular, if the token is located at an identifier or keyword, it is treated as though the token type was "fuzzy", and allowed to match any identifier or keyword for the language. For ANTLR 4 grammars, if the caret token is located at a location where the user has typed g, the parser will allow that token to match a rule name or the keyword grammar.
Create a specialized ATN interpreter that can return all possible parse trees which lead to the caret token, without looking past the caret for any decision, and without constraining the exact token type of the caret token.
For each possible parse tree, evaluate your code completion in the context of whatever the caret token matched in a parser rule.
The union of all the results found in step 3 is a superset of the complete set of valid code completion results, and can be presented in the IDE.
The following describes AW2's implementation of the above steps.
In AW2, this is the CaretToken, and it always has the token type CARET_TOKEN_TYPE.
In AW2, this specialized operation is represented by the ForestParser<TParser> interface, with most of the reusable implementation in AbstractForestParser<TParser> and specialized for parsing ANTLR 4 grammars for code completion in GrammarForestParser.
In AW2, this analysis is performed primarily by GrammarCompletionQuery.TaskImpl.runImpl(BaseDocument).

Prolog query not returning the expected result

For university exam revision, I came across a past paper question with a Prolog database with the following structures:
% The structure of a media production team takes the form
% team(Producer, Core_team, Production_assistant).
% Core_team is an arbitrarily long list of staff structures,
% but excludes the staff structures for Producer and
% and Production_assistant.
% staff structures represent employees and take the form
% staff(Surname,Initial,file(Speciality,Grade,CV)).
% CV is an arbitrarily long list of titles of media productions.
team(staff(lyttleton,h,file(music,3,[my_music,best_tunes,showtime])),
[staff(garden,g,file(musical_comedy,2,[on_the_town,my_music])),
staff(crier,b,file(musical_comedy,2,[on_the_town,best_tunes]))],
staff(brooke-taylor,t,file(music,2,[my_music,best_tunes]))).
team(staff(wise,e,file(science,3,[horizon,frontiers,insight])),
[staff(morcambe,e,file(science,3,[horizon,leading_edge]))],
staff(o_connor,d,file(documentary,2,[horizon,insight]))).
team(staff(merton,p,file(variety,2,[showtime,dance,circus])),
[staff(smith,p,file(variety,1,[showtime,dance,circus,my_music])),
staff(hamilton,a,file(variety,1,[dance,best_tunes]))],
staff(steaffel,s,file(comedy,2,[comedians,my_music]))).
team(staff(chaplin,c,file(economics,3,[business_review,stock_show])),
[staff(keaton,b,file(documentary,3,[business_review,insight])),
staff(hardy,o,file(news,3,[news_report,stock_show,target,now])),
staff(laurel,s,file(economics,3,[news_report,stock_show,now]))],
staff(senate,m,file(news,3,[business_review]))).
One of the rules I have to write is the following:
Return the initial and surname of any producer whose team includes 2
employees whose CVs include a production entitled ‘Now’.
This is my solution:
recurseTeam([],0).
recurseTeam[staff(_,_file(_,_,CV))|List],Sum):-
member(now,CV),
recurseTeam(List,Rest),
Sum is Rest + 1.
query(Initial,Surname):-
team(staff(Surname,Initial,file(Speciality,Grade,CV)),Core_team,Production_assistant),
recurseTeam([staff(Surname,Initial,file(Speciality,Grade,CV)),Production_assistant|Core_team,Sum),
Sum >= 2.
The logic I have here is that I have a recursive predicate which takes each staff member in turn, and a match is found only if the CV list contains the production 'now', and as you can see it will return the Initial and Surname of a Producer if at least 2 employees CV contains the 'now' production.
So, at least as far as I can see, it should return the c,chaplin producer, right? Because this team has staff members who have CV's which contains the 'now' production.
But when I query it, e.g.
qii(Initial,Surname).
It returns 'false'.
When I remove the "member(now,CV)" predicate, it successfully returns all four producers. So it would seem the issues lies with this rule. Member is the built-in predicate for querying the contents of lists, and 'CV' is the list structure that is contained within the file structure of a staff structure.
Any ideas why this isn't working as I had expected?
Any suggestions on what else I could try here?
You need one more clause for the recurseTeam predicate, namely for the case that the first argument is a non-empty list, but its first element is a file structure that does not contain now.
In the current version, recurseTeam simply fails as soon as it encounters such an element in the list.
One possible solution is to add the following third clause for recurseTeam:
recurseTeam([staff(_,_,file(_,_,CV))|List],Sum):-
\+ member(now,CV),
recurseTeam(List,Sum).
Alternatively, one can use a cut ! in the second recurseTeam clause after member(now,CV) and drop \+ member(now,CV) in the third clause. This is more efficient, since it avoids calling member(now,CV) twice. (Note, however, that this is a red cut – the declarative and the operational semantics of the program are no longer the same. Language purists may find this disturbing – "real programmers" don't care.)

Format statement with unknown columns

I am attempting to use fortran to write out a comma-delimited file for import into another commercial package. The issue is that I have an unknown number of data columns. My output needs to look like this:
a_string,a_float,a_different_float,float_array_elem1,float_array_elem2,...,float_array_elemn
which would result in something that might look like this:
L1080,546876.23,4325678.21,300.2,150.125,...,0.125
L1090,563245.1,2356345.21,27.1245,...,0.00983
I have three issues. One, I would prefer the elements to be tightly grouped (variable column width), two, I do not know how to define a variable number of array elements in the format statement, and three, the array elements can span a large range--maybe 12 orders of magnitude. The following code conceptually does what I want, but the variable 'n' and the lack of column-width definition throws an error (of course):
WRITE(50,900) linenames(ii),loc(ii,1:2),recon(ii,1:n)
900 FORMAT(A,',',F,',',F,n(',',F))
(I should note that n is fixed at run-time.) The write statement does what I want it to when I do WRITE(50,*), except that it's width-delimited.
I think this thread almost answered my question, but I got quite confused: SO. Right now I have a shell script with awk fixing the issue, but that solution is...inelegant. I could do some manipulation to make the output a string, and then just write it, but I would rather like to avoid that option if at all possible.
I'm doing this in Fortran 90 but I like to try to keep my code as backwards-compatible as possible.
the format close to what you want is f0.3, this will give no spaces and a fixed number of decimal places. I think if you want to also lop off trailing zeros you'll need to do a good bit of work.
The 'n' in your write statement can be larger than the number of data values, so one (old school) approach is to put a big number there, eg 100000. Modern fortran does have some syntax to specify indefinite repeat, i'm sure someone will offer that up.
----edit
the unlimited repeat is as you might guess an asterisk..and is evideltly "brand new" in f2008
In order to make sure that no space occurs between the entries in your line, you can write them separately in character variables and then print them out using theadjustl() function in fortran:
program csv
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: nn = 3
real(dp), parameter :: floatarray(nn) = [ -1.0_dp, -2.0_dp, -3.0_dp ]
integer :: ii
character(30) :: buffer(nn+2), myformat
! Create format string with appropriate number of fields.
write(myformat, "(A,I0,A)") "(A,", nn + 2, "(',',A))"
! You should execute the following lines in a loop for every line you want to output
write(buffer(1), "(F20.2)") 1.0_dp ! a_float
write(buffer(2), "(F20.2)") 2.0_dp ! a_different_float
do ii = 1, nn
write(buffer(2+ii), "(F20.3)") floatarray(ii)
end do
write(*, myformat) "a_string", (trim(adjustl(buffer(ii))), ii = 1, nn + 2)
end program csv
The demonstration above is only for one output line, but you can easily write a loop around the appropriate block to execute it for all your output lines. Also, you can choose different numerical format for the different entries, if you wish.

Resources