What do I do in ANTLR if I want to parse something which is extremely configurable? - internationalization

I'm writing a grammar to recognise simple mathematical expressions. I have it working for English.
Now I want to expand the grammar to support i18n. Therefore, the digits, radix separator and so forth depend on the user's locale.
What is the best way to do this in ANTLR?
What I'm currently considering is something like this:
lexer grammar ExpressionLexer;
options {
superClass = AbstractLexer;
}
DIGIT: . {isDigit(getText())}?;
// ... and so on for other tokens ...
abstract class AbstractLexer(input: CharStream, symbols: Symbols) extends Lexer(input) {
fun isDigit(codePoint: Int): Boolean = symbols.isDigit(codePoint)
// ... and so on for other tokens ...
}
Alternative approaches I am considering:
(b) I gather every possible digit and every possible separator in every possible locale, and jam all of those into the one grammar, and then check isDigit after that.
(c) I make a different lexer for every single numbering system and somehow align them all to emit the same token types in the same order, so they can be swapped in and out (sounds like it might be the most pure and correct solution? but not the most enjoyable.)
(And on a side tangent, how do people in European countries which use comma for the decimal separator deal with writing function calls with more than one parameter?)

I recommend doing that in two steps:
Parse the main language structure (e.g. (digits+ separator)+), regardless of what a digit or a separator is.
Do a semantic check against the user's locale if the digits that were given actually match what's allowed. Same for the separator.
This way you don't need to do all kind of hacks, add platform code and so on.
For your side question: programming usually uses the english language, including the number format. In strings you can use any format you want, but that doesn't affect the surrounding code.

Note that since ANTLR v4.7 and up, there is more possible w.r.t. Unicode inside ANTLR's lexer grammar: https://github.com/antlr/antlr4/blob/master/doc/unicode.md
So you could define a lexer rule like this:
DIGIT
: [\p{Digit}]
;
which will match both ٣ and 3.

Related

What are valid identifiers in R7RS-small?

R7RS-small says that all identifiers must be terminated by a delimiter, but at the same time it defines pretty elaborate rules for what can be in an identifier. So, which one is it?
Is an identifier supposed to start with an initial character and then continue until a delimiter, or does it start with an initial character and continue following the syntax defined in 7.1.1.
Here are a couple of obvious cases. Are these valid identifiers?
a#a
b,b
c'c
d[d]
If they are not supposed to be valid, what is the purpose of saying that an identifier must be terminated by a delimiter?
|..ident..| are delimiters for symbols in R7RS, to allow any character that you cannot insert in an old style symbol (| is the delimiter).
However, in R6RS the "official" grammar was incorrect, as it did not allow to define symbols such that 1+, which led all implementations define their own rules to overcome this illness of the official grammar.
Unless you need to read the source code of a given implementation and see how it defines the symbols, you should not care too much about these rules and use classical symbols.
In the section 7.1.1 you find the backus-naur form that defines the lexical structure of R7RS identifiers but I doubt the implementations follow it.
I quote from here
As with identifiers, different implementations of Scheme use slightly
different rules, but it is always the case that a sequence of
characters that contains no special characters and begins with a
character that cannot begin a number is taken to be a symbol
In other words, an implementation will use a function like read-atom and after that it will classify an atom by backtracking with read-number and if number? fails it will be a symbol.

What caseless comparison algorithm is CompareStringW using?

For compatibility reasons I need to replicate the behaviour of another application. It is using Unicode strings as identifiers but ignoring case and performing some sort of normalisation. By intercepting API calls I have determined it is using CompareStringW(LOCALE_USER_DEFAULT, NORM_IGNORECASE, SORT_STRINGSORT, ...) to do the comparison.
I can just call this function directly for every pair of strings in the set of strings I am considering but I would prefer a canonical form I can use in a hash table.
Does anyone know what algorithm CompareStringW uses with those flags set? Is it a standard Unicode algorithm?
Can I use NormalizeString and FoldString to generate this canonical form? If i can, what arguments do I need to pass?
edit: As David Heffernan pointed out I'll need to use FoldString as well as NormalizeString to do a proper caseless comparison.
The great Michael Kaplan (RIP) provided a lot of useful information about the NLS functions on his blog over the years and some posts have been archived before Microsoft made the blog internal only.
His A few of the gotchas of CompareString post provides descriptions of these flags:
NORM_IGNORECASE - Ignore case. A better name for this flag might have been IGNORE_TERTIARYWEIGHT since that is what it accomplishes (it masks the tertiary weight), although it is obviously too late to consider such a change. It can cause undesirable results when used in the comparison of strings containing characters that depend on the weight for vital information, which thankfully is a very small number of cases. But if you are not expecting "ʏ", "Y", and "y" (U+028f, U+0059, and U+0079, a.k.a. LATIN LETTER SMALL CAPITAL Y, LATIN LETTER CAPITAL Y, and LATIN LETTER SMALL Y) to all be equal, then you may want to think twice about throwing this flag into the mix. You will also lose the distinctions of the final forms for Hebrew (e.g. "מ" and "ם", U+05de U+05dd a.k.a. HEBREW LETTER MEM and HEBREW LETTER FINAL MEM), Arabic (e.g. "ش" U+0634 a.k.a. ARABIC LETTER SHEEN and its isolated, final, initial, and medial forms (ﺵ, ﺶ, ﺷ, and ﺸ) at U+feb5, U+feb6, U+feb7, and U+feb8, and other languages.
SORT_STRINGSORT - Treat punctuation the same as symbols. For example, a STRING sort treats co-op and co_op as strings that should sort together since the hyphen and the underscore are both treated as symbols. On the other hand, a WORD sort treats the hyphen and apostrophe differently, so that co-op and co_op would not sort together but co-op and coop would. The real documentation for this is built into the winnls.h header file:
//
// Sorting Flags.
//
// WORD Sort: culturally correct sort
// hyphen and apostrophe are special cased
// example: "coop" and "co-op" will sort together in a list
//
// co_op <------- underscore (symbol)
// coat
// comb
// coop
// co-op <------- hyphen (punctuation)
// cork
// went
// were
// we're <------- apostrophe (punctuation)
//
//
// STRING Sort: hyphen and apostrophe will sort with all other symbols
//
// co-op <------- hyphen (punctuation)
// co_op <------- underscore (symbol)
// coat
// comb
// coop
// cork
// we're <------- apostrophe (punctuation)
// went
// were
//
The results might also vary depending on the Windows version because newer versions supports later Unicode versions.

What is VBS UCASE function doing to Japanese?

In order to avoid case conflicts comparing strings on an ASP classic site, some inherited code converts all strings with UCASE() first. This seems to work well across languages ... except Japanese. Here's a simple example on a Japanese string. I've provided the UrlEncoded values to make it clear how little is changing behind the scenes:
Server.UrlEncode("戦艦帝国") = %E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD
UCASE("戦艦帝国") = ƈ�ȉ�Ÿ�ś�
Server.UrlEncode(UCASE("戦艦帝国")) = %C6%88%A6%C8%89%A6%C5%B8%9D%C5%9B%BD
So is UCASE doing anything sensible with this Japanese string? Or is its behavior buggy, undefined, or known to be incompatible with Japanese?
(LCASE leaves the sample string alone. But now I'm wary of switching all comparisons to LCASE because I don't know if it bungles other non-western languages that do work with UCASE....)
https://msdn.microsoft.com/en-us/library/1systdcy(v=vs.84).aspx
Only lowercase letters are converted to uppercase; all uppercase letters and non-letter characters remain unchanged.
https://en.wikipedia.org/wiki/Letter_case
Most Western languages (particularly those with writing systems based on the Latin, Cyrillic, Greek, Coptic, and Armenian alphabets) use letter cases in their written form as an aid to clarity. Scripts using two separate cases are also called bicameral scripts. Many other writing systems make no distinction between majuscules and minuscules – a system called unicameral script or unicase.
"lowercase or uppercase letters" does not apply in Chinese-Japanese-Korean languages, hence, the output of UCase() should remain unchanged.

How do you check for a changing value within a string

I am doing some localization testing and I have to test for strings in both English and Japaneses. The English string might be 'Waiting time is {0} minutes.' while the Japanese string might be '待ち時間は{0}分です。' where {0} is a number that can change over the course of a test. Both of these strings are coming from there respective property files. How would I be able to check for the presence of the string as well as the number that can change depending on the test that's running.
I should have added the fact that I'm checking these strings on a web page which will display in the relevant language depending on the location of where they are been viewed. And I'm using watir to verify the text.
You can read elsewhere about various theories of the best way to do testing for proper language conversion.
One typical approach is to replace all hard-coded text matches in your code with constants, and then have a file that sets the constants which can be updated based on the language in use. (I've seen that done by wrapping the require of that file in a case statement based on the language being tested. Another approach is an array or hash for each value, enumerated by a variable with a name like 'language', which lets the tests change the language on the fly. So validations would look something like this
b.div(:id => "wait-time-message).text.should == WAIT_TIME_MESSAGE[language]
To match text where part is expected to change but fall within a predictable pattern, use a regular expression. I'd recommend a little reading about regular expressions in ruby, especially using unicode regular expressions in ruby, as well as some experimenting with a tool like Rubular to test regexes
In the case above a regex such as:
/Waiting time is \d+ minutes./ or /待ち時間は\d+分です。/
would match the messages above and expect one or more digits in the middle (note that it would fail if no digits appear, if you want zero or more digits, then you would need a * in place of the +
Don't check for the literal string. Check for some kind of intermediate form that can be used to render the final string.
Sometimes this is done by specifying a message and any placeholder data, like:
[ :waiting_time_in_minutes, 10 ]
Where that would render out as the appropriate localized text.
An alternative is to treat one of the languages as a template, something that's more limited in flexibility but works most of the time. In that case you could use the English version as the string that's returned and use a helper to render it to the final page.

Extract function names from function calls in C files

Is it posible to extract function calls in C source files, e.g.,
...
myfunc(1);
...
or
...
myfunc(anotherfunc(1, 2));
....
by just using Ruby regular expression? If not, would a parser generator such as ANTLR be useful?
This is not a full-proof pattern for finding out method calls but should just serve the pattern that you are interested in.
[a-zA-Z\s]*\([a-zA-Z0-9]*(\([a-zA-Z0-9\s]*[\s,]*[\sa-zA-Z0-9]*\))?\);
This regex will match following method call patterns.
1. myfunc(another(one,two));
2. myfunc();
3. myfunc(another());
4. myfunc(oneArg);
You can also use the regular expressions already written from grammar that are used by emacs -- imenu , etags, ecb, c-mode etc.
In the purest sense you can't, because the possibility to nest function calls recursively makes it a non-regular language. That is, you cannot write a regular expression that matches an arbitrary function call and extracts all of the contained function names.
But of course you could search incrementally for sequences of characters allowed in function names (ie., must start with a letter or underscore, followed by letters, underscore, numbers, etc...) followed by an left parenthesis, or something along those lines.
Keep in mind, however, that any such approach is prone to errors: what if a function is referenced in a comment? What if it appears inside a string constant? Really, to catch all the special cases you would have to (almost) properly parse the full C file.
Most modern regular expression engines have features to parse more than regular languages e.g. by means of back-references to subexpressions. But you shouldn't go down that road. With a proper parser such as ANTLR that can parse context-free languages you'll make your own life a lot easier.

Resources