How to create AST parser which allows syntax errors? - algorithm

First, what to read about parsing and building AST?
How to create parser for a language (like SQL) that will build an AST and allow syntax errors?
For example, for "3+4*5":
+
/ \
3 *
/ \
4 5
And for "3+4*+" with syntax error, parser would guess that the user meant:
+
/ \
3 *
/ \
4 +
/ \
? ?
Where to start?
SQL:
SELECT_________________
/ \ \
. FROM JOIN
/ \ | / \
a city_name people address ON
|
=______________
/ \
.____ .
/ \ / \
p address_id a id

The standard answer to the question of how to build parsers (that build ASTs), is to read the standard texts on compiling. Aho and Ullman's "Dragon" Compiler book is pretty classic. If you haven't got the patience to get the best reference materials, you're going to have more trouble, because they provide theory and investigate subtleties. But here is my answer for people in a hurry, building recursive descent parsers.
One can build parsers with built-in error recovery. There are many papers on this sort of thing, a hot topic in the 1980s. Check out Google Scholar, hunt for "syntax error repair". The basic idea is that the parser, on encountering a parsing error, skips to some well-known beacon (";" a statement delimiter is pretty popular for C-like languages, which is why you got asked in a comment if your language has statement terminators), or proposes various input stream deletions or insertions to climb over the point of the syntax error. The sheer variety of such schemes is surprising. The key idea is generally to take into account as much information around the point of error as possible. One of the most intriguing ideas I ever saw had two parsers, one running N tokens ahead of the other, looking for syntax-error land-mines, and the second parser being feed error repairs based on the N tokens available before it encounters the syntax error. This lets the second parser choose to act differently before arriving at the syntax error. If you don't have this, most parser throw away left context and thus lose the ability to repair. (I never implemented such a scheme.)
The choice of things to insert can often be derived from information used to build the parser (often First and Follow sets) in the first place. This is relatively easy to do with L(AL)R parsers, because the parse tables contain the necessary information and are available to the parser at the point where it encounters an error. If you want to understand how to do this, you need to understand the theory (oops, there's that compiler book again) of how the parsers are constructed. (I have implemented this scheme successfully several times).
Regardless of what you do, syntax error repair doesn't help much, because it is almost impossible to guess what the writer of the parsed document actually intended. This suggests fancy schemes won't be really helpful. I stick to simple ones; people are happy to get an error report and some semi-graceful continuation of parsing.
A real problem with rolling your own parser for a real language, is that real languages are nasty messy things; people building real implementations get it wrong and frozen in stone because of existing code bases, or insist on bending/improving the language (standards are for wimps, goodies are for marketing) because its cool. Expect to spend a lot of time re-calibrating what you think the grammar is, against the ground truth of real code. As a general rule, if you want a working parser, better to get one that has a track record rather than roll it yourself.
A lesson most people that are hell-bent to build a parser don't get, is that if they want to do anything useful with the parse result or tree, they'll need a lot more basic machinery than just the parser. Check my bio for "Life After Parsing".

There are two things the parser could do:
Report the error and have the user try again.
Repair the error and proceed.
Generally speaking the first one is easier (and safer). There may not always be enough information for the parser to infer the intent when the syntax is wrong. Depending on the circumstances, it may be dangerous to proceed with a repair that makes the input syntactically correct but semantically wrong.
I've written a few hand-rolled recursive descent parsers for little languages. When writing code to interpret the grammar rules explicitly (as opposed to using a parser-generator), it's easy to detect errors, because the next token doesn't fit the production rule. Generated parsers tend to spit out a simplistic "expected $(TOKEN_TYPE) here" message, which isn't always useful to the user. With a hand-written parser, it's often easy to give a more specific diagnostic message, but it can be time consuming to cover every case.
If your goal is the report the problem but to keep parsing (so that you can see if there are additional problems), you can put a special AST node in the tree at the point of the error. This keeps the tree from falling apart.
You then have to resync to some point beyond the error in order to continue parsing. As Ira Baxter mentioned in his answer, you might look for a token, like ';', that separates statements. The correct token(s) to look for depends on the language you're parsing. Another possibility is to guess what the user meant (e.g., infer an extra token or a different token at the point the error was detected) and then continue. If you encounter another syntax error within the next few tokens, you could backtrack, make a different guess, and try again.

Related

Boost Spirit Qi : Is it suitable language/tool to analyse/cut a "multiline" data file?

I want to apply various operations to data files : algebra of sets, statistics, reporting, changes. But the format of the files is far from code examples and a bit weird. There are differents sorts of items, items type, and some of them are put together as a collection. There is a simplistic example below.
I'm new in boost::spirit and I have tried coding to split the items and get basic informations (name, version, date) required for most of treatments. Eventually it seems tricky for me. Is the problem my lack of skills or boost::spirit is not suitable to this format?
Studying boost::spirit is not a waste of time, I am sure to use it later. But I didn't find examples of code like mine, I may not go the right way.
>>>process_type_A
//name(typeA_1)
//version(A.1.99)
//date(2016.01.01)
//property1 "pA11"
//property2 "pA12"
//etc_A_1 (thousand of lines - a lot are "multiline" and/or mulitline sub-records)
<<<process_type_A
>>>process_type_A
//name(typeA_2)
//version(A.2.99)
//date(2016.01.02)
//property1 "pA21"
//property2 "pA22"
//etc_A_2 (hundred or thousand of lines)
<<<process_type_A
>>>process_type_B
//name(typeB_1)
//version(B.1.99)
//date(2016.02.01)
//property1 "pB11"
//property2 "pB12"
//etc_B_1 (hundred or thousand of lines)
<<<process_type_B
>>>paramset_type_C
//>>paramlist
////name(typeC_1)
////version(C.1.99)
////date(2016.03.01)
////property1 "pC11"
////property2 "pC12"
////etc_C_1 (hundred or thousand of lines)
//<<paramlist
//>>paramlist
////name(typeC_2)
////version(C.2.99)
////date(2016.04.01)
////property1 "pC21"
////property2 "pC22"
////etc_C_2 (hundred or thousand of lines)
//<<paramlist
<<<paramset_type_C
Code::Blocks
Boost 1.60.0
GCC Compiler on Windows and Linux
I think #Orient is right: regex w/captures is enough here.
However, Spirit has the upside of coming without a linker dependency. Here's some approaches (using seek[] and raw[]) for inspiration:
Boost spirit revert parsing
rule to extract key+phrases from a text document
Parsing text file with binary envelope using boost Spririt (binary content)
much more involved logic: How to implement #ifdef in a boost::spirit::qi grammar?
Note that Spirit X3 (still experimental) also has a seek[] directive and it will compiler much faster.
The main advice I would give about Qi is that it is a very powerful and flexible tool for parsing. You can define quite complicated, possibly recursive structures, using boost::variant, boost::optional, etc., and associate these types with qi rules and it seemingly magically does the right thing, giving you a nice AST for your data.
The biggest sources of difficulty in my (limited) experience are when you try to make it do more than that and also process the data. It's sometimes tempting to try to "eagerly" do some processing at the same time that you are parsing the data, often in a semantic action or something. Don't do it! It usually makes things harder to read in the end, a bit harder to debug, and sometimes you can be surprised what will happen if the grammar has to backtrack across your semantic action which it already executed.
qi should work great if you can write a nice grammar for your data. If you can't write an unambiguous grammar, you might be able to use qi::eps to make it parseable but you don't want to have to do that too often IMO. I don't think "hundreds or thousands" of items will pose any particular problem.
Right now the question is rather opinion-oriented -- if you can post a more complete description of the data format you have, or better, a complete code example which is failing, it might make it easier to give precise answers.

Error handling monads in Scala? Try vs Validation

scalaz.Validation is said to be more powerful than the Try monad, because it can accumulate errors.
Are there any occasions where you might choose Try over scalaz.Validation or scalaz.\/ ?
The most significant argument in favor of Try is that it's in the standard library. It's also used in the standard library—for example the callbacks you register with Future's onComplete must be functions from Try. It may be used more extensively in the standard library in the future.
The fact that it's in the standard library also means it'll look familiar to more people. You'll probably tend to find it in more of the third-party libraries you use. And of course sometimes you may not be allowed to use Scalaz (or any other dependencies) or may want to avoid Scalaz for other perfectly good reasons.
Other stuff: I can't remember the last time I wrote a \/ that didn't have Throwable on its left side (I have—it's just not something I do often). Try bakes this in, so you don't have to worry about writing an alias or whatever.
As senia notes in the comments above, there's arguably something a little unintuitive about biasing an either-like type but still using the language of "left" and "right" (as \/ is, and does). Why does \/ bind through the right side? Because it does, that's why. I personally don't find the naming that objectionable, but I can understand why some people might. Try avoids the issue by having constructor names that clearly indicate their semantics: Success and Failure, not Left and Right or -\/ and \/-.
Now that we're getting to the completely superficial and subjective reasons to use Try, some people may just think \/ and -\/ and \/- are ugly. I generally don't mind operator-heavy code, and I find the jumble of slashes and dashes really unpleasant to type and read.
So those are some arguments in favor of Try, as requested, but I'll conclude by saying that I never use it, myself. I don't specifically care all that much about the fact that it violates the monad laws (although I can understand why people do), but I do find \/ and Validation much less ad-hoc and easier to reason about, and I like having access to both (Validation when I want to accumulate errors, \/ when I need monadic sequencing) in a single framework.

Grammar for the Chef Language

I'm just starting to use antlr, with antlr for ruby. The version is 3.2.1
I'm trying to create a parser for the chef language, and the grammar is giving me a real headache :P I'm sure I'm missing some fundamental concept, but I just couldn't figure it out.
I created 3 grammars. The main one is the recipe parser, which (of course) parses the recipes. Once a recipe is parsed, I used the other 2 grammars, that parse ingredients and instructions (the method section).
My problem is with the last one, the one that parses the instructions, such as "put ... into the mixing bowl", "liquefy ...", etc. Everything works great except for a few rules. I've posted the Instructions.g source here, at paste.bin because of its length.
Here's what's happening:
When I uncomment the rules combine_ingredient_into_mixing_bowl or divide_ingredient_into_mixing_bowl, the parser stops recognizing almost all of the other rules (such as put_ingredient_into_mixing_bowl). This seems strange to me, because they don't seem to override each other (of course they are, somehow). I get the error: "line 0:-1 mismatched input "" expecting WS"
stir_mixing_bowl does not match anything, but it's really no different from the other rules that do work ok. I get the error: "line 0:-1 mismatched input "" expecting set nil"
Is it possible to include the rules verb_the_ingredient and liquefy_ingredient without making them conflict with the other rules? The former will actually conflict with everything else I guess, and the latter will conflict with liquefy_mixing_bowl. What would be the best way to deal with such a nasty grammar?
By the way, I haven't set the WS (whitespace such as space and tab) to the ignore channel because since an ingredient can consiste of one or more words (such as dijon mustard or just zuchinnis) I found that it is easier to specify the grammar by using the WS token as separators.
Also, running the antlr4ruby command to generate the parsers/lexers code shows no warnings at all.
Any tips, hints, or enlightening is really appreciated here :)
Thanks in advance.

Code duplication refactoring tool for VB

I need a very specific tool for VB (or multi-language). I thought I would ask if one already exists, before I start making one myself (probably, in python).
What I need:
The tool must crawl, recursivelly or not, a path, searching for a list of extension, such as .bas, .frm, .xxx
Then, It has to parse that files, searching for functions, routines, etc.
And finally, it must output what it found.
I based this on the idea of, "reducing code redundance", in an scenario where, bad programmers make a lot of functions that do the same thing, sometimes with the same name, sometimes not. There are 4 cases:
Case 1: Same name, Same content.
Case 2: Same name, Diff content.
Case 3: Diff name, Same content.
Case 4: Diff name, Diff Content.
So, the output, should be something like this
===========================================================================
RESULT
===========================================================================
Errors:
---------------------------------------------------------------------------
==Name, ==Content --> 3: (Func(), Foo(), Bar()) In files (f,f2,f3)
!=Name, ==Content --> 2: (Func() + Func1(), Bar() + Bar1()) In Files (f4)
---------------------------------------------------------------------------
Warnings:
==Name, !=Content --> 1 (Foobar()) In Files (f19)
---------------------------------------------------------------------------
This is to give you an idea of what I need.
So, the question is: is there any tool that acomplish something similar to this???
P.S: Yes, we should write good code, in first instance, but, you know, stuff happens.
What you want is a "clone detector". These tools find copy-and-pasted code across a large set of designated files. Clones are not just of functions; they can be code blocks, data declarations, etc.
There are a variety of detectors out there, and you should know how they work before you attempt to build one of your own.
Some simply match lines for exact equivalence. While these demonstrate the basic idea, their detection is not good because they don't take into account the fact that cloned code often has variations; what people really do is clone-and-edit when making copies.
Some match sequences of langauge tokens, e.g., identifiers, keywords, literals, punctuation. These at least are relatively tolerant of whitespace changes. And they can find clones in which single tokens have been substituted for single tokens. However, because they don't understand language structure (blocks, statements, function bodies) they often match sequences that cross such structure boundaries (e.g., "} {" is often considered a clone by these tools), they produce rather high false-positive indications of (non)clones. Some of these attempt to limit the matches to key program structures, such as complete functions, as you have kind of suggested.
More sophisticated detectors match program structures.
Our CloneDR (I'm the original author) is a detector that
uses compiler-quality parsing to abstract syntax trees, which extracts the precise structure of the code. It does this for many languages (including VB6 and VBScript), locating clones as arbitrary functions, blocks, statements or declarations, with parameters shows how the clones vary. CloneDR can find clones in spite of formatting changes, changes in comment locations or content, and even variations where complex constructs (multiple statements or expressions) have been used as alternatives to simple ones (e.g., a single statment or a literal). While it tends to have a high detection rate(it usually finds 10-20% removable redundancy!), its false-positive rate tends to be considerably lower than the token based detectors. You can see sample reports for
a variety of different langauges at the link above.
See Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach which explicitly discusses different approaches and benefits, and compares a large number of detectors including CloneDR.
EDIT October 2010: ... When I first wrote this response, I assumed the OP was interested in VB.net, which CloneDR didn't do. We've since added VB.net, VB6 and VBScript capability to CloneDR. (Parsing VB.net in its modern form is a lot messier than one might imagine for "simple"(!) langauge like Visual Basic).

Oracle Coding Standards Feature Implementation

Okay, I have reached a sort of an impasse.
In my open source project, a .NET-based Oracle database browser, I've implemented a bunch of refactoring tools. So far, so good. The one feature I was really hoping to implement was a big "Global Reformat" that would make the code (scripts, functions, procedures, packages, views, etc.) standards compliant. (I've always been saddened by the lack of decent SQL refactoring tools, and wanted to do something about it.)
Unfortunatey, I am discovering, much to my chagrin, that there doesn't seem to be any one widely-used or even "generally accepted" standard for PL-SQL. That kind of puts a crimp on my implementation plans.
My search has been fairly exhaustive. I've found lots of conflicting documents, threads and articles and the opinions are fairly diverse. (Comma placement, of all things, seems to generate quite a bit of debate.)
So I'm faced with a couple of options:
Add a feature that lets the user customize the standard and then reformat the code according to that standard.
—OR—
Add a feature that lets the user customize the standard and simply generate a violations list like StyleCop does, leaving the SQL untouched.
In my mind, the first option saves the end-users a lot of work, but runs the risk of modifying SQL in potentially unwanted ways. The second option runs the risk of generating lots of warnings and doing no work whatsoever. (It'd just be generally annoying.)
In either scenario, I still have no standard to go by. What I'd need to know from you guys is kind of poll-ish, but kind of not. If you were going to use a tool of this nature, what parts of your SQL code would you want it to warn you about or fix?
Again, I'm just at a loss due to a lack of a cohesive standard. And given that there isn't anything out there that's officially published by Oracle, I think this is something the community could weigh in on. Also, given the way that voting works on SO, the votes would help to establish the popularity of a given "refactoring."
P.S. The engine parses SQL into an expression tree so it can robustly analyze the SQL and reformat it. There should be quite a bit that we can do to correct the format of the SQL. But I am thinking that for the first release of the thing, layout is the primary concern. Though it is worth noting that the thing already has refactorings for converting keywords to upper case, and identifiers to lower case.
PL/SQL is an Ada derivative, however Ada's style guide is almost as gut-twisting disgusting as the one most "old-school" DB-people prefer. (The one where you have to think their caps lock got stuck pretty bad)
Stick with what you already know from .Net, which means sensible identifiers, without encrypting/compressing half the database into 30 chars.
You could use a dictionary and split camel-cased or underscored identifier parts and check if they are real words. Kinda like what FxCop does.
Could be bit annoying, though. Since the average Oracle database has the most atrocious and inconsistent naming guidelines that were obsolete even 30 years ago.
So, I don't think you'll reach the goal of getting clean identifiers everywhere in your projects (or your user's)
Since PL/SQL is case insensitive and columns are preferred over equally named local vars, you'll have to make even more tradeoffs. You can take parts of the style guide of other pascal derivatives (Ada is based on Modula, which is based on Pascal), like Delphi which feel a bit closer to home for PL/SQL (I use a mixture of .Net & Delphi).
Especially the "aPrefix" for parameters can be a life saver, because you won't collide with column names that way:
subtype TName is SomeTable.Name%type;
subtype TId is SomeTable.Id%type;
function Test(aName in TName) return TId is
result TId;
begin
SELECT t.Id
INTO result
FROM SomeTable t
WHERE t.Name = aName;
return result;
exception
when No_Data_Found then
return null;
end;
Without the prefix, oracle would always pick the column "Name" and not the parameter "Name". (Which is pretty annoying, since columns can be qualified with an alias...)
I configured my PL/SQL Devloper to make all keywords in lowercase, however, I made the ones that are used in plain SQL to be uppercased (SELECT,WHERE, etc)
As a result, SQLs are sticking out of the code, but not all my code has to be brutalized by all-upper keywords. (They are highlighted anyways, so what's with the all-upper fetish? ;-) )
When your tool is capable of identifying plain SQLs and give some visual clue, then even the SQL keywords wouldn't need to have a different casing.
btw, I'd love to take a look at it. Can you post an url, or is still "under cover"?
Cheers,
Robert
TOAD has a "pretty printer" and uses a ton of options to give the user some say in what is done. (But it has gotten so complicated that I still can't manage to get the results I would like.)
For me, some options look downward horrible, but it seems that some people like them. A sensible default should be okay for 80% of the time, but as this is an issue of religious wars, I'm sure that you can spend a totally unreasonable amount of time for pretty small results. I'd suggest to code some things to handle the 10-year-old sp you mentioned, and to include something like a <pre> tag that the pretty printer leaves alone.
I like the "standard" Of Tom Kyte (in his books). That means everything in lowercase. Most easy for the eyes.
If all you're doing is rearranging whitespace to make the code look consistently clean, then there's no risk of changing SQL results.
However, as an Oracle/PLSQL developer for the past 8 years, I can almost guarantee I wouldn't use your tool no matter how many options you give it. Bulk reformatting of code sounds great in principle, but then you've totally destroyed its diffability in version control between revisions prior to and after the reformat.

Resources