Hi I just started working on lex and yacc tools.
I realized that yyerror recieves only the string "syntax error" from yacc.
I was wondering if I can customize this string.
Oh and also can I differentiate different types of errors? (tyring to have missing token and additional token as different erros.)
If so, how should I..?
Many thanks.
You're free to print any message you want to in yyerror (or even no message at all), so you can customise messages as you see fit. A common customisation is to add the line number (and possibly column number) of the token which triggered the error. You can certainly change the text if you want to, but if you just want to change it to a different language, you should probably use the gettext mechanism. You'll find .po files in the runtime-po subdirectory of the source distribution. If this facility is enabled, bison will arrange for the string to be translated before it is passed to yyerror, but of course you could do the translation yourself in yyerror if that is more convenient for you.
I suspect that what you actually want is for bison to produce a more informative error message. Bison only has one alternative error message format, which includes a list of "expected" tokens. You can ask Bison to produce such an error message by including
%define parse.error verbose
in your prologue. As the manual indicates, the bison parsing algorithm can sometimes produce an incorrect list of expected tokens (since it was not designed for this particular purpose); you can get a more precise list by enabling lookahead correction by also including
%define parse.lac full
This does have a minor performance penalty. See the linked manual section for details.
The list of tokens produced by this feature uses the name of the token as supplied in the bison file. These names are usually not very user-friendly, so you might find yourself generating error messages such as the infamous PHP error
syntax error, unexpected T_CONSTANT_ENCAPSED_STRING
(Note: more recent PHP versions produce a different but equally mysterious message.)
To avoid this, define double-quoted aliases for your tokens. This can also make your grammar a lot more readable:
%type <string> TOK_ID "identifier"
%token TOK_IF "if" TOK_ELSE "else" TOK_WHILE "while"
%token TOK_LSH "<<"
/* Etc. */
%%
stmt: expr ';'
| while
| if
| /* ... */
while: "while" '(' expr ')' stmt
expr: "identifier"
| expr "<<" expr
/* ... */
The quoted names will not be passed through gettext. That's appropriate for names which are keywords, but it might be desirable to translate descriptive token aliases. A procedure to do so is outline in this answer.
Related
I've written a custom printf implementation for an embedded environment. In this effort I have also added some additional specifiers for printing unique types and timestamps, among other things:
printf("[%T] time\n");
Perhaps this ones unique as it does not take any arguments, but rather has unique handling in the parser. Although this could easily be remedied by using a macro to always pass false data as needed. Or if it cause too much trouble I could make it more unique so it does not appear the same as other specifiers that do take arguments.
All other custom types I have implemented take arguments as usual. The only trouble I'm having is with the compiler's warnings:
test.c:130:32: warning: unknown conversion type character āvā in format [-Wformat=]
printf("%v\n", type);
^
test.c:130:16: error: too many arguments for format [-Werror=format-extra-args]
printf("%v\n", type);
^~~~~~~~~~~~~~~~~~~~
These warnings can be suppressed by adding -Wno-format-extra-args -Wno-format compiler arguments, which I do (for now). But this can mask genuine errors like passing integers where pointers are expected, or legitimately not providing enough arguments for a given specifier list.
Is it possible to add new semantic checks to printf-style functions?
So I'm trying to make a lexical analyzer for scheme and when I run JFlex to convert the lever.flex file I get an error similar to this one for example:
Reading "lexer.flex"
Macro definition contains a cycle.
1 error, 0 warnings.
the macro it's referring to is this one:
definition = {variable_definition}
| {syntax_definition}
| \(begin {definition}*\)
| \(let-syntax \({syntax_binding}*\){definition}*\)
| \(letrec-syntax \({syntax_binding}*\){definition}*\)
all of the macros defined here have been implemented but fro some reason I can't get rid of this error and I don't know why its happening.
A lex/flex/JFlex style "definition" is a macro expansion, as that error message indicates. Recursive macro expansions are impossible, since macro expansion is not conditional; trying to expand
definition = ... \(begin {definition}*\) ...
will result in an infinitely long regular expression.
Do not mistake a lexical analyser for a general-purpose parser. A lexical analyser does no more than split an input into individual tokens (or "lexemes"), using regular expressions to identify each token. Tokens do not have structure (at least for the purposes of parsing); once a token is identified, it is a single indivisible object. If you find yourself writing lexical descriptions which match structured text, you have almost certainly pushed the lexical analysis beyond its limits.
Parsers use an algorithm which does allow recursive descriptions (but which has very limited forward lookahead) and which can create a recursive description of the input (such as a parse tree).
R7RS-small says that all identifiers must be terminated by a delimiter, but at the same time it defines pretty elaborate rules for what can be in an identifier. So, which one is it?
Is an identifier supposed to start with an initial character and then continue until a delimiter, or does it start with an initial character and continue following the syntax defined in 7.1.1.
Here are a couple of obvious cases. Are these valid identifiers?
a#a
b,b
c'c
d[d]
If they are not supposed to be valid, what is the purpose of saying that an identifier must be terminated by a delimiter?
|..ident..| are delimiters for symbols in R7RS, to allow any character that you cannot insert in an old style symbol (| is the delimiter).
However, in R6RS the "official" grammar was incorrect, as it did not allow to define symbols such that 1+, which led all implementations define their own rules to overcome this illness of the official grammar.
Unless you need to read the source code of a given implementation and see how it defines the symbols, you should not care too much about these rules and use classical symbols.
In the section 7.1.1 you find the backus-naur form that defines the lexical structure of R7RS identifiers but I doubt the implementations follow it.
I quote from here
As with identifiers, different implementations of Scheme use slightly
different rules, but it is always the case that a sequence of
characters that contains no special characters and begins with a
character that cannot begin a number is taken to be a symbol
In other words, an implementation will use a function like read-atom and after that it will classify an atom by backtracking with read-number and if number? fails it will be a symbol.
I'm compiling my game engine code on VS2015 and Xcode using gcc.
I used many compilers for my code and only gcc shows the warning:
warning : missing terminating " character
The code is like following:
#if 1
...
#else
asm __volatile__("
some assembly code
...
"::: );
#endif
I know recent gcc does not accept new lines in a string but I don't know why gcc preprocessor spews warning for false block.
I know my writing style of inline assembly is old but they are in false conditional block. I don't want to touch them because there are so many.
How can I avoid this warning in false conditional block except suppressing all the warnings ?
Edit:
I compiled my code with Armcc, Vc++(2005,2008,2012,2013,2015) and clang. They don't show this kind of warning, only GCC does.
If the warnings are for the code in the TRUE conditional block, I will fix them. But this warnings are for FALSE conditional blocks which should not be evaluated.
C preprocessor is string-literal aware. Per language grammar, string literal is one of the preprocessing tokens
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
This is, of course, perfectly expected, since otherwise the preprocessor wouldn't be able to tell MACRO_NAME from "MACRO_NAME" (the latter being just a string).
For this reason, your string literals are supposed to be formatted properly for preprocessing purposes. I presume that GCC sees the first " and begins interpreting the sequence as a string literal, but then abandons the attempt (due to missing terminating ") and files it under the generic "cannot be one of the above" category, accompanying it with a warning,
You seem to be assuming that conditional inclusion works at the higher preprocessing level than recognition of such tokens as string literals. This assumption is incorrect. The preprocessor has no other choice but to parse the "disabled" stretch of the code as well.
I have a bison parser that works sufficiently well for my purpose. It even prints localized error messages. But the token names are not translated. Looking at the source code I found, that I can use define YY_ to my own gettext function and pass YY_ to gettext in order to provide my own translation of the error messages. But this does not work for token names.
Is there some switch or hidden feature that I could use to extract the token names from the parser and to translate them?
So far I found yytnamerr which could be overridden to format the token names. As it does more than just reformat names I don't like to touch this function, as I would have to sync it with the progress of Bison. On the other hand, I need also a way to extract the token names from the parser in order to add them to the language definition file.
How do you implement user friendly error reporting with Bison?
If you specify %token-table, then bison will generate the yytname table. This table includes all bison symbols, including internal symbols ($end, $error and $undefined), terminals -- named, single-quoted characters and double-quoted strings -- and non-terminals, which include also the generated names for mid-rule actions.
With yytname visible, it's easy to extract the tokens in a format recognizable by the gettext package. For example, you could add to your .y file something like this:
#ifdef MAKE_TOKEN
int main(void) {
puts("#include <libintl.h>");
puts("#include <stdio.h>");
puts("int main() {");
for (const char* const* p = yytname; *p; ++p) {
// See Note 1 below
printf(" printf(\"%%s: %%s\\n\", \"%s\", gettext (\"%s\"));\n", *p, *p);
}
puts("}");
}
#endif
and then add a stanza to your Makefile (making appropriate substitutions for file names):
messages.pot: my_parser.c
$(CC) $(CFLAGS) -DMAKE_TOKEN -o token_lister $<
./token_lister > my_parser.tokens.c
# See Note 2 below
$(CC) -o my_parser.tokens my_parser.tokens.c
xgettext -o $# my_parser.tokens.c
Once you have the translations, you still need to figure out how to use them, since bison does not offer an interface for inserting translated token names into its generated error messages. Probably the simplest way is to insert the translations directly into yytname by iterating through that array and substituting each token name with its translation (that would have to be done at parser startup). That presents the annoyance that yytname is declared const by the bison skeleton; however, a very simple sed or awk invocation can be used to remove the offending const. [Note 3]
Having said that, it's not at all clear to me that these automatically generated error messages are "user friendly", unless the user is surprisingly familiar with the language's formal grammar. And a user who is familiar with the grammar might well prefer the original token name, in order to find it in the grammar, rather than a non-expert translation which only coincidentally resembles the original concept. Not that I'm pointing fingers at anyone in particular.
You might enjoy this fascinating essay by Russ Cox, about how he implemented actually friendly error messages for Go.
NOTES:
The direct use of the token name in a C string won't work in the case of the tokens whose representation includes a " or a \. In particular, any keyword token ("and" or "<=") will fail, as will the single character tokens '"' and '\\'. These don't show up very often in grammars; if you're substituting internationalized keywords in your scanner, you're very unlikely to use bison's quoted string feature at all.
If you do want to use such tokens, you'll have to output code for the gettext generator which escapes " and \ characters in the token name.
Actually, it would be better to use several stanzas, but that one is enough to get you going, I think. You probably want to mark some or all of the intermediate results as .INTERMEDIATE. The generated executable my_parser.tokens can be used to verify the translations, but that's totally optional, so you might want to remove that line. On the other hand, it does verify that the strings are compilable.
See Russ Cox's gc (link provided above) for an example. His Makefile modifies the bison output to remove the const from yytname, so that the generated parser can substitute his preferred token names for error messages, so you can see the general idea at work.