How to translate token names in bison - internationalization

I have a bison parser that works sufficiently well for my purpose. It even prints localized error messages. But the token names are not translated. Looking at the source code I found, that I can use define YY_ to my own gettext function and pass YY_ to gettext in order to provide my own translation of the error messages. But this does not work for token names.
Is there some switch or hidden feature that I could use to extract the token names from the parser and to translate them?
So far I found yytnamerr which could be overridden to format the token names. As it does more than just reformat names I don't like to touch this function, as I would have to sync it with the progress of Bison. On the other hand, I need also a way to extract the token names from the parser in order to add them to the language definition file.
How do you implement user friendly error reporting with Bison?

If you specify %token-table, then bison will generate the yytname table. This table includes all bison symbols, including internal symbols ($end, $error and $undefined), terminals -- named, single-quoted characters and double-quoted strings -- and non-terminals, which include also the generated names for mid-rule actions.
With yytname visible, it's easy to extract the tokens in a format recognizable by the gettext package. For example, you could add to your .y file something like this:
#ifdef MAKE_TOKEN
int main(void) {
puts("#include <libintl.h>");
puts("#include <stdio.h>");
puts("int main() {");
for (const char* const* p = yytname; *p; ++p) {
// See Note 1 below
printf(" printf(\"%%s: %%s\\n\", \"%s\", gettext (\"%s\"));\n", *p, *p);
}
puts("}");
}
#endif
and then add a stanza to your Makefile (making appropriate substitutions for file names):
messages.pot: my_parser.c
$(CC) $(CFLAGS) -DMAKE_TOKEN -o token_lister $<
./token_lister > my_parser.tokens.c
# See Note 2 below
$(CC) -o my_parser.tokens my_parser.tokens.c
xgettext -o $# my_parser.tokens.c
Once you have the translations, you still need to figure out how to use them, since bison does not offer an interface for inserting translated token names into its generated error messages. Probably the simplest way is to insert the translations directly into yytname by iterating through that array and substituting each token name with its translation (that would have to be done at parser startup). That presents the annoyance that yytname is declared const by the bison skeleton; however, a very simple sed or awk invocation can be used to remove the offending const. [Note 3]
Having said that, it's not at all clear to me that these automatically generated error messages are "user friendly", unless the user is surprisingly familiar with the language's formal grammar. And a user who is familiar with the grammar might well prefer the original token name, in order to find it in the grammar, rather than a non-expert translation which only coincidentally resembles the original concept. Not that I'm pointing fingers at anyone in particular.
You might enjoy this fascinating essay by Russ Cox, about how he implemented actually friendly error messages for Go.
NOTES:
The direct use of the token name in a C string won't work in the case of the tokens whose representation includes a " or a \. In particular, any keyword token ("and" or "<=") will fail, as will the single character tokens '"' and '\\'. These don't show up very often in grammars; if you're substituting internationalized keywords in your scanner, you're very unlikely to use bison's quoted string feature at all.
If you do want to use such tokens, you'll have to output code for the gettext generator which escapes " and \ characters in the token name.
Actually, it would be better to use several stanzas, but that one is enough to get you going, I think. You probably want to mark some or all of the intermediate results as .INTERMEDIATE. The generated executable my_parser.tokens can be used to verify the translations, but that's totally optional, so you might want to remove that line. On the other hand, it does verify that the strings are compilable.
See Russ Cox's gc (link provided above) for an example. His Makefile modifies the bison output to remove the const from yytname, so that the generated parser can substitute his preferred token names for error messages, so you can see the general idea at work.

Related

What do I do in ANTLR if I want to parse something which is extremely configurable?

I'm writing a grammar to recognise simple mathematical expressions. I have it working for English.
Now I want to expand the grammar to support i18n. Therefore, the digits, radix separator and so forth depend on the user's locale.
What is the best way to do this in ANTLR?
What I'm currently considering is something like this:
lexer grammar ExpressionLexer;
options {
superClass = AbstractLexer;
}
DIGIT: . {isDigit(getText())}?;
// ... and so on for other tokens ...
abstract class AbstractLexer(input: CharStream, symbols: Symbols) extends Lexer(input) {
fun isDigit(codePoint: Int): Boolean = symbols.isDigit(codePoint)
// ... and so on for other tokens ...
}
Alternative approaches I am considering:
(b) I gather every possible digit and every possible separator in every possible locale, and jam all of those into the one grammar, and then check isDigit after that.
(c) I make a different lexer for every single numbering system and somehow align them all to emit the same token types in the same order, so they can be swapped in and out (sounds like it might be the most pure and correct solution? but not the most enjoyable.)
(And on a side tangent, how do people in European countries which use comma for the decimal separator deal with writing function calls with more than one parameter?)
I recommend doing that in two steps:
Parse the main language structure (e.g. (digits+ separator)+), regardless of what a digit or a separator is.
Do a semantic check against the user's locale if the digits that were given actually match what's allowed. Same for the separator.
This way you don't need to do all kind of hacks, add platform code and so on.
For your side question: programming usually uses the english language, including the number format. In strings you can use any format you want, but that doesn't affect the surrounding code.
Note that since ANTLR v4.7 and up, there is more possible w.r.t. Unicode inside ANTLR's lexer grammar: https://github.com/antlr/antlr4/blob/master/doc/unicode.md
So you could define a lexer rule like this:
DIGIT
: [\p{Digit}]
;
which will match both ٣ and 3.

Macro contains a cycle

So I'm trying to make a lexical analyzer for scheme and when I run JFlex to convert the lever.flex file I get an error similar to this one for example:
Reading "lexer.flex"
Macro definition contains a cycle.
1 error, 0 warnings.
the macro it's referring to is this one:
definition = {variable_definition}
| {syntax_definition}
| \(begin {definition}*\)
| \(let-syntax \({syntax_binding}*\){definition}*\)
| \(letrec-syntax \({syntax_binding}*\){definition}*\)
all of the macros defined here have been implemented but fro some reason I can't get rid of this error and I don't know why its happening.
A lex/flex/JFlex style "definition" is a macro expansion, as that error message indicates. Recursive macro expansions are impossible, since macro expansion is not conditional; trying to expand
definition = ... \(begin {definition}*\) ...
will result in an infinitely long regular expression.
Do not mistake a lexical analyser for a general-purpose parser. A lexical analyser does no more than split an input into individual tokens (or "lexemes"), using regular expressions to identify each token. Tokens do not have structure (at least for the purposes of parsing); once a token is identified, it is a single indivisible object. If you find yourself writing lexical descriptions which match structured text, you have almost certainly pushed the lexical analysis beyond its limits.
Parsers use an algorithm which does allow recursive descriptions (but which has very limited forward lookahead) and which can create a recursive description of the input (such as a parse tree).

Making a customized error message in yacc tool

Hi I just started working on lex and yacc tools.
I realized that yyerror recieves only the string "syntax error" from yacc.
I was wondering if I can customize this string.
Oh and also can I differentiate different types of errors? (tyring to have missing token and additional token as different erros.)
If so, how should I..?
Many thanks.
You're free to print any message you want to in yyerror (or even no message at all), so you can customise messages as you see fit. A common customisation is to add the line number (and possibly column number) of the token which triggered the error. You can certainly change the text if you want to, but if you just want to change it to a different language, you should probably use the gettext mechanism. You'll find .po files in the runtime-po subdirectory of the source distribution. If this facility is enabled, bison will arrange for the string to be translated before it is passed to yyerror, but of course you could do the translation yourself in yyerror if that is more convenient for you.
I suspect that what you actually want is for bison to produce a more informative error message. Bison only has one alternative error message format, which includes a list of "expected" tokens. You can ask Bison to produce such an error message by including
%define parse.error verbose
in your prologue. As the manual indicates, the bison parsing algorithm can sometimes produce an incorrect list of expected tokens (since it was not designed for this particular purpose); you can get a more precise list by enabling lookahead correction by also including
%define parse.lac full
This does have a minor performance penalty. See the linked manual section for details.
The list of tokens produced by this feature uses the name of the token as supplied in the bison file. These names are usually not very user-friendly, so you might find yourself generating error messages such as the infamous PHP error
syntax error, unexpected T_CONSTANT_ENCAPSED_STRING
(Note: more recent PHP versions produce a different but equally mysterious message.)
To avoid this, define double-quoted aliases for your tokens. This can also make your grammar a lot more readable:
%type <string> TOK_ID "identifier"
%token TOK_IF "if" TOK_ELSE "else" TOK_WHILE "while"
%token TOK_LSH "<<"
/* Etc. */
%%
stmt: expr ';'
| while
| if
| /* ... */
while: "while" '(' expr ')' stmt
expr: "identifier"
| expr "<<" expr
/* ... */
The quoted names will not be passed through gettext. That's appropriate for names which are keywords, but it might be desirable to translate descriptive token aliases. A procedure to do so is outline in this answer.

Extract function names from function calls in C files

Is it posible to extract function calls in C source files, e.g.,
...
myfunc(1);
...
or
...
myfunc(anotherfunc(1, 2));
....
by just using Ruby regular expression? If not, would a parser generator such as ANTLR be useful?
This is not a full-proof pattern for finding out method calls but should just serve the pattern that you are interested in.
[a-zA-Z\s]*\([a-zA-Z0-9]*(\([a-zA-Z0-9\s]*[\s,]*[\sa-zA-Z0-9]*\))?\);
This regex will match following method call patterns.
1. myfunc(another(one,two));
2. myfunc();
3. myfunc(another());
4. myfunc(oneArg);
You can also use the regular expressions already written from grammar that are used by emacs -- imenu , etags, ecb, c-mode etc.
In the purest sense you can't, because the possibility to nest function calls recursively makes it a non-regular language. That is, you cannot write a regular expression that matches an arbitrary function call and extracts all of the contained function names.
But of course you could search incrementally for sequences of characters allowed in function names (ie., must start with a letter or underscore, followed by letters, underscore, numbers, etc...) followed by an left parenthesis, or something along those lines.
Keep in mind, however, that any such approach is prone to errors: what if a function is referenced in a comment? What if it appears inside a string constant? Really, to catch all the special cases you would have to (almost) properly parse the full C file.
Most modern regular expression engines have features to parse more than regular languages e.g. by means of back-references to subexpressions. But you shouldn't go down that road. With a proper parser such as ANTLR that can parse context-free languages you'll make your own life a lot easier.

User-defined Literals suffix, with *_digit..."?

A user-defined literal suffix in C++0x should be an identifier that
starts with _ (underscore) (17.6.4.3.5)
should not begin with _ followed by uppercase letter (17.6.4.3.2)
Each name that [...] begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.
Is there any reason, why such a suffix may not start _ followed by a digit? I.E. _4 or _3musketeers?
Musketeer dartagnan = "d'Artagnan"_3musketeers;
int num = 123123_4; // to be interpreted in base4 system?
string s = "gdDadndJdOhsl2"_64; // base64decoder
The precedent for identifiers of the form _<number> is the function argument placeholder object mechanism in std::placeholders (§20.8.9.1.3), which defines an implementation-defined number of such symbols.
This is a good thing, because it means the user cannot #define any identifier of that form. §17.6.4.3.1/1:
A translation unit that includes a standard library header shall not #define or #undef names declared in any standard library header.
The name of the user-defined literal function is operator "" _123, not simply _123, so there is no direct conflict between your name and the library name if presence of the using namespace std::placeholders;.
My 2¢, though, is that you would be better off with an operator "" _baseconv and encoding the base within the literal, "123123_4"_baseconv.
Edit: Looking at Johannes' (deleted) answer, there is There may be concern that _123 could be used as a macro by the implementation. This is certainly the realm of theory, as the implementation would have little to gain by such preprocessor use. Furthermore, if I'm not mistaken, the reason for hiding these symbols in std::placeholders, not std itself, is that such names are more likely to be used by the user, such as by inclusion of Boost Bind (which does not hide them inside a named namespace).
The tokens are not reserved for use by the implementation globally (17.6.4.3.2), and there is precedent for their use, so they are at least as safe as, say, forward.
"can" vs "may".
can denotes ability where may denotes permission.
Is there a reason why you would not have permission to the start a user-defined literal suffix with _ followed by a digit?
Permission implies coding standards or best-practices. The examples you provides seem to show that _\d would fine suffixes if used correctly (to denote numeric base). Unfortunately your question can't have a well thought out answer as no one has experience with this new language feature yet.
Just to be clear user-defined literal suffixes can start with _\d.
An underscore followed by a digit is a legal user-defined literal suffix.
The function signature would be:
operator"" _4();
so it couldn;t get eaten by a placeholder.
The literal would be a single preprocessor token:
123123_4;
so the _4 would not get clobbered by a placeholder or a preprocessor symbol.
My reading of 17.6.4.3.5 is that suffixes not containing a leading underscore risk collision with the implementation or future library additions. They also collide with existing suffixes: F, L, ULL, etc. One of the rationales for user-defined literals is that a new type (such as decimals for example) could be defined as a pure library extension including literals with suffuxes d, df, dl.
Then there's the question of style and readability. Personally, I think I would loose sight of the suffix 1234_3; Maybe, maybe not.
Finally, there was some idea that didn't make it into the standard (but I kind of like) to have _ be a literal separator for numbers like in Ada and Ruby. So you could have 123_456_789 to visually separate thousands for example. Your suffix would break if that ever went through.
I knew I had some papers on this subject:
Digital Separators describes a proposal to use _ as a digit separator in numeric literals
Ambiguity and Insecurity with User-Defined literals Describes the evolution of ideas about literal suffix naming and namespace reservation and efforts to deconflict user-defined literals against a future digit separator.
It just doesn't look that good for the _ digit separator.
I had an idea though: how about either a backslash or a backtick for digit separator? It isn't as nice as _ but I don't think there would be any collision as long as the backslash was inside the stream of digits. The backtick has no lexical use currently that I know of.
i = 123\456\789;
j = 0xface\beef;
or
i = 123`456`789;
j = 0xface`beef;
This would leave _123 as a literal suffix.

Resources