Grammar for the Chef Language - antlr3

I'm just starting to use antlr, with antlr for ruby. The version is 3.2.1
I'm trying to create a parser for the chef language, and the grammar is giving me a real headache :P I'm sure I'm missing some fundamental concept, but I just couldn't figure it out.
I created 3 grammars. The main one is the recipe parser, which (of course) parses the recipes. Once a recipe is parsed, I used the other 2 grammars, that parse ingredients and instructions (the method section).
My problem is with the last one, the one that parses the instructions, such as "put ... into the mixing bowl", "liquefy ...", etc. Everything works great except for a few rules. I've posted the Instructions.g source here, at paste.bin because of its length.
Here's what's happening:
When I uncomment the rules combine_ingredient_into_mixing_bowl or divide_ingredient_into_mixing_bowl, the parser stops recognizing almost all of the other rules (such as put_ingredient_into_mixing_bowl). This seems strange to me, because they don't seem to override each other (of course they are, somehow). I get the error: "line 0:-1 mismatched input "" expecting WS"
stir_mixing_bowl does not match anything, but it's really no different from the other rules that do work ok. I get the error: "line 0:-1 mismatched input "" expecting set nil"
Is it possible to include the rules verb_the_ingredient and liquefy_ingredient without making them conflict with the other rules? The former will actually conflict with everything else I guess, and the latter will conflict with liquefy_mixing_bowl. What would be the best way to deal with such a nasty grammar?
By the way, I haven't set the WS (whitespace such as space and tab) to the ignore channel because since an ingredient can consiste of one or more words (such as dijon mustard or just zuchinnis) I found that it is easier to specify the grammar by using the WS token as separators.
Also, running the antlr4ruby command to generate the parsers/lexers code shows no warnings at all.
Any tips, hints, or enlightening is really appreciated here :)
Thanks in advance.

Related

Did Ruby deprecate the wrong File exists method?

The docs of File.exist? says:
Return true if the named file exists.
Note that last word used; "exists". This is correct. "File exist" without the ending s is not correct.
The method File.exists? exists, but they deprecated this method. I am thinking it should have been the other way around. What am I missing?
Also, it's noteworthy that other languages/libraries use exists, for example Java and .NET.
Similarly, "this equals that" - but Ruby uses equal, again dropping the ending s. I am getting a feeling that Ruby is actively walking in another direction than mainstream. But then there has to be a reason?
This is largely a subjective call. Do you read the call as "Does this file exist?" or "File exists"? Both readings have their merits.
Historically Ruby has had a lot of aliased methods like size vs. length, but lately it seems like the core team is trying to focus on singular, consistent conventions that apply more broadly.
You'd have to look closely at the conversations on the internal mailing list surrounding the decisions here. I can't find them easily, only people dealing with the changes as deprecation warnings pop up.
The Ruby core team is a mix of people who speak different languages but the native language is Japanese, so perhaps that's guiding some of these decisions. It could be a preference to avoid odd inflections on verbs.
I agree that if File.exists?('x.txt') reads much more natural than the plural form, which was probably Matz's intention for the alias. As far as I'm concerned, this particular deprecation was misguided.
However the general preference of a plural form may well be routed in handling enumarables/collections, where plural makes a sense when used with idioms like this:
pathnames.select(&:exist?)
exist? matches the convention used elsewhere throughout the stdlib, and goes back to the early days of ruby. For example array.include? (not includes?), string.match? (not matches?), object.respond_to? (not responds_to?). So in this light, File.exists? was always a blemish.
Some recommend that you read the dot as "does". So "if file does exist," "if array does include," "if string does match," etc.

Adding keyword commands and functions to Textmate 2 bundle syntax

I want to add some additional syntax highlighting definitions to an existing bundle, but I need some general advice on how to do this. I'm not building a syntax from scratch, and I think my request is pretty simple, but I think it involves some subtleties for which I find the manual somewhat impenetrable in finding the answer.
Basically, I'm trying to fill out the syntax definitions for the Stata bundle. It's great, but there is no built in support for automatically highlighting the base commands and the installed functions, only a handful of basic control statements. Stata is a language which is primarily used by calling lots of different high level pre-defined command calls, like command foo bar, options(). The convention is that these command calls be highlighted.
There are a ton of these commands, and stubs which are used for convenience. Just the base install has almost 3500. Even optimizing them using the bundle helper, which obviously gets rid of the stub issue, still yields a massive regex list. I can easily cut this down to less than 1000 important ones, but its still a lot. There are also 350 "functions" which I would like to match with the syntax function()
I essentially have 3 questions:
Am I creating a serious problem by including a very comprehensive list of matching definitions?
How do I restrict a command to only highlight when it either begins a line or there is only whitespace between the beginning line and the command
What is the preferred way of restricting the list of functions() to only highlight when they have attached parentheses?

How to create AST parser which allows syntax errors?

First, what to read about parsing and building AST?
How to create parser for a language (like SQL) that will build an AST and allow syntax errors?
For example, for "3+4*5":
+
/ \
3 *
/ \
4 5
And for "3+4*+" with syntax error, parser would guess that the user meant:
+
/ \
3 *
/ \
4 +
/ \
? ?
Where to start?
SQL:
SELECT_________________
/ \ \
. FROM JOIN
/ \ | / \
a city_name people address ON
|
=______________
/ \
.____ .
/ \ / \
p address_id a id
The standard answer to the question of how to build parsers (that build ASTs), is to read the standard texts on compiling. Aho and Ullman's "Dragon" Compiler book is pretty classic. If you haven't got the patience to get the best reference materials, you're going to have more trouble, because they provide theory and investigate subtleties. But here is my answer for people in a hurry, building recursive descent parsers.
One can build parsers with built-in error recovery. There are many papers on this sort of thing, a hot topic in the 1980s. Check out Google Scholar, hunt for "syntax error repair". The basic idea is that the parser, on encountering a parsing error, skips to some well-known beacon (";" a statement delimiter is pretty popular for C-like languages, which is why you got asked in a comment if your language has statement terminators), or proposes various input stream deletions or insertions to climb over the point of the syntax error. The sheer variety of such schemes is surprising. The key idea is generally to take into account as much information around the point of error as possible. One of the most intriguing ideas I ever saw had two parsers, one running N tokens ahead of the other, looking for syntax-error land-mines, and the second parser being feed error repairs based on the N tokens available before it encounters the syntax error. This lets the second parser choose to act differently before arriving at the syntax error. If you don't have this, most parser throw away left context and thus lose the ability to repair. (I never implemented such a scheme.)
The choice of things to insert can often be derived from information used to build the parser (often First and Follow sets) in the first place. This is relatively easy to do with L(AL)R parsers, because the parse tables contain the necessary information and are available to the parser at the point where it encounters an error. If you want to understand how to do this, you need to understand the theory (oops, there's that compiler book again) of how the parsers are constructed. (I have implemented this scheme successfully several times).
Regardless of what you do, syntax error repair doesn't help much, because it is almost impossible to guess what the writer of the parsed document actually intended. This suggests fancy schemes won't be really helpful. I stick to simple ones; people are happy to get an error report and some semi-graceful continuation of parsing.
A real problem with rolling your own parser for a real language, is that real languages are nasty messy things; people building real implementations get it wrong and frozen in stone because of existing code bases, or insist on bending/improving the language (standards are for wimps, goodies are for marketing) because its cool. Expect to spend a lot of time re-calibrating what you think the grammar is, against the ground truth of real code. As a general rule, if you want a working parser, better to get one that has a track record rather than roll it yourself.
A lesson most people that are hell-bent to build a parser don't get, is that if they want to do anything useful with the parse result or tree, they'll need a lot more basic machinery than just the parser. Check my bio for "Life After Parsing".
There are two things the parser could do:
Report the error and have the user try again.
Repair the error and proceed.
Generally speaking the first one is easier (and safer). There may not always be enough information for the parser to infer the intent when the syntax is wrong. Depending on the circumstances, it may be dangerous to proceed with a repair that makes the input syntactically correct but semantically wrong.
I've written a few hand-rolled recursive descent parsers for little languages. When writing code to interpret the grammar rules explicitly (as opposed to using a parser-generator), it's easy to detect errors, because the next token doesn't fit the production rule. Generated parsers tend to spit out a simplistic "expected $(TOKEN_TYPE) here" message, which isn't always useful to the user. With a hand-written parser, it's often easy to give a more specific diagnostic message, but it can be time consuming to cover every case.
If your goal is the report the problem but to keep parsing (so that you can see if there are additional problems), you can put a special AST node in the tree at the point of the error. This keeps the tree from falling apart.
You then have to resync to some point beyond the error in order to continue parsing. As Ira Baxter mentioned in his answer, you might look for a token, like ';', that separates statements. The correct token(s) to look for depends on the language you're parsing. Another possibility is to guess what the user meant (e.g., infer an extra token or a different token at the point the error was detected) and then continue. If you encounter another syntax error within the next few tokens, you could backtrack, make a different guess, and try again.

Building a syntax checker

I am building a app like a compiler with my own script language. The user will enter the code and the output will be another app.
So I need tell to user if some line is wrong and why it is.
But I don't know how to start.
I thought this:
All lines will start with a keyword, except for those who start with an variable. So different that are wrong.
So, I can calculate the next valid entries and check them.
Also, I thought that I can check each line, but it's complex because I can have this
var varName { /* ... */ };
Or
var varName {
/* ... */
};
Or Even
var varName
{
/* ... */
};
So why not remove the break-lines and check? Because I will lose the line number, which in this case is the most important.
Maybe I'm going to create a map between the code with and without break-line.
But first I want to hear you, if you already has this experience or you have any idea.
Thanks
There are formal languages to describe syntax and semantics of the language and there are tools that will generate parsers out of these descriptions. I suggest reading on flex and bison for starters.
It'll be fairly complicated to write your own language. But totally doable.
To able to recognize if a line is wrong, in the syntactical sense, you'd need to build a parser.
The parser checks the context-free grammar for a correct derivation of a structure from its tokens.
First you need to tokenize the file, then reconstruct it into a parse tree (to check syntax).
I took a class in this, CS 241. There's a very nice set of course notes which this is all explained in detail.
https://github.com/christhomson/lecture-notes/blob/master/cs241.pdf
You should check tools like: lex, bison and yacc.
lex is lexical analyser generator. It generates a code, which could be used for breaking the script to tokens (like numbers, keywords and so on...).
bison and yacc are both parser generators. Both can be used for generating code for parsing your language (combining tokens to statements).
Just google tutorials for those tools.

What is an example of a lexical error and is it possible that a language has no lexical errors?

for our compiler theory class, we are tasked with creating a simple interpreter for our own designed programming language. I am using jflex and cup as my generators but i'm a bit stuck with what a lexical error is. Also, is it recommended that i use the state feature of jflex? it feels wrong as it seems like the parser is better suited to handling that aspect. and do you recommend any other tools to create the language. I'm sorry if i'm impatient but it's due on tuesday.
A lexical error is any input that can be rejected by the lexer. This generally results from token recognition falling off the end of the rules you've defined. For example (in no particular syntax):
[0-9]+ ===> NUMBER token
[a-zA-Z] ===> LETTERS token
anything else ===> error!
If you think about a lexer as a finite state machine that accepts valid input strings, then errors are going to be any input strings that do not result in that finite state machine reaching an accepting state.
The rest of your question was rather unclear to me. If you already have some tools you are using, then perhaps you're best to learn how to achieve what you want to achieve using those tools (I have no experience with either of the tools you mentioned).
EDIT: Having re-read your question, there's a second part I can answer. It is possible that a language could have no lexical errors - it's the language in which any input string at all is valid input.
A lexical error could be an invalid or unacceptable character by the language, like '#' which is rejected as a lexical error for identifiers in Java (it's reserved).
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other side, will be thrown by your scanner when a given set of already recognised valid tokens don't match any of the right sides of your grammar rules.
it feels wrong as it seems like the
parser is better suited to handling
that aspect
No. It seems because context-free languages include regular languages (meaning than a parser can do the work of a lexer). But consider than a parser is a stack automata, and you will be employing extra computer resources (the stack) to recognise something that doesn't require a stack to be recognised (a regular expression). That would be a suboptimal solution.
NOTE: by regular expression, I mean... regular expression in the Chomsky Hierarchy sense, not a java.util.regex.* class.
lexical error is when the input doesn't belong to any of these lists:
key words: "if", "else", "main"...
symbols: '=','+',';'...
double symbols: ">=", "<=", "!=", "++"
variables: [a-z/A-Z]+[0-9]*
numbers: [0-9]*
examples: 9var: error, number before characters, not a variable and not a key word either.
$: error
what I don't know is whether something like more than one symbol after each other is accepted, like "+-"
Compiler can catch an error when it has the grammar in it!
It will depend on the compiler itself whether it has the capacity (scope) of catching the lexical errors or not.
If is decided during the development of compiler what types of lexical error and how (according to the grammar) they are going to be handled.
Usually all famous and mostly used compiler has this capabilities.

Resources