What caseless comparison algorithm is CompareStringW using? - winapi

For compatibility reasons I need to replicate the behaviour of another application. It is using Unicode strings as identifiers but ignoring case and performing some sort of normalisation. By intercepting API calls I have determined it is using CompareStringW(LOCALE_USER_DEFAULT, NORM_IGNORECASE, SORT_STRINGSORT, ...) to do the comparison.
I can just call this function directly for every pair of strings in the set of strings I am considering but I would prefer a canonical form I can use in a hash table.
Does anyone know what algorithm CompareStringW uses with those flags set? Is it a standard Unicode algorithm?
Can I use NormalizeString and FoldString to generate this canonical form? If i can, what arguments do I need to pass?
edit: As David Heffernan pointed out I'll need to use FoldString as well as NormalizeString to do a proper caseless comparison.

The great Michael Kaplan (RIP) provided a lot of useful information about the NLS functions on his blog over the years and some posts have been archived before Microsoft made the blog internal only.
His A few of the gotchas of CompareString post provides descriptions of these flags:
NORM_IGNORECASE - Ignore case. A better name for this flag might have been IGNORE_TERTIARYWEIGHT since that is what it accomplishes (it masks the tertiary weight), although it is obviously too late to consider such a change. It can cause undesirable results when used in the comparison of strings containing characters that depend on the weight for vital information, which thankfully is a very small number of cases. But if you are not expecting "ʏ", "Y", and "y" (U+028f, U+0059, and U+0079, a.k.a. LATIN LETTER SMALL CAPITAL Y, LATIN LETTER CAPITAL Y, and LATIN LETTER SMALL Y) to all be equal, then you may want to think twice about throwing this flag into the mix. You will also lose the distinctions of the final forms for Hebrew (e.g. "מ" and "ם", U+05de U+05dd a.k.a. HEBREW LETTER MEM and HEBREW LETTER FINAL MEM), Arabic (e.g. "ش" U+0634 a.k.a. ARABIC LETTER SHEEN and its isolated, final, initial, and medial forms (ﺵ, ﺶ, ﺷ, and ﺸ) at U+feb5, U+feb6, U+feb7, and U+feb8, and other languages.
SORT_STRINGSORT - Treat punctuation the same as symbols. For example, a STRING sort treats co-op and co_op as strings that should sort together since the hyphen and the underscore are both treated as symbols. On the other hand, a WORD sort treats the hyphen and apostrophe differently, so that co-op and co_op would not sort together but co-op and coop would. The real documentation for this is built into the winnls.h header file:
//
// Sorting Flags.
//
// WORD Sort: culturally correct sort
// hyphen and apostrophe are special cased
// example: "coop" and "co-op" will sort together in a list
//
// co_op <------- underscore (symbol)
// coat
// comb
// coop
// co-op <------- hyphen (punctuation)
// cork
// went
// were
// we're <------- apostrophe (punctuation)
//
//
// STRING Sort: hyphen and apostrophe will sort with all other symbols
//
// co-op <------- hyphen (punctuation)
// co_op <------- underscore (symbol)
// coat
// comb
// coop
// cork
// we're <------- apostrophe (punctuation)
// went
// were
//
The results might also vary depending on the Windows version because newer versions supports later Unicode versions.

Related

What do I do in ANTLR if I want to parse something which is extremely configurable?

I'm writing a grammar to recognise simple mathematical expressions. I have it working for English.
Now I want to expand the grammar to support i18n. Therefore, the digits, radix separator and so forth depend on the user's locale.
What is the best way to do this in ANTLR?
What I'm currently considering is something like this:
lexer grammar ExpressionLexer;
options {
superClass = AbstractLexer;
}
DIGIT: . {isDigit(getText())}?;
// ... and so on for other tokens ...
abstract class AbstractLexer(input: CharStream, symbols: Symbols) extends Lexer(input) {
fun isDigit(codePoint: Int): Boolean = symbols.isDigit(codePoint)
// ... and so on for other tokens ...
}
Alternative approaches I am considering:
(b) I gather every possible digit and every possible separator in every possible locale, and jam all of those into the one grammar, and then check isDigit after that.
(c) I make a different lexer for every single numbering system and somehow align them all to emit the same token types in the same order, so they can be swapped in and out (sounds like it might be the most pure and correct solution? but not the most enjoyable.)
(And on a side tangent, how do people in European countries which use comma for the decimal separator deal with writing function calls with more than one parameter?)
I recommend doing that in two steps:
Parse the main language structure (e.g. (digits+ separator)+), regardless of what a digit or a separator is.
Do a semantic check against the user's locale if the digits that were given actually match what's allowed. Same for the separator.
This way you don't need to do all kind of hacks, add platform code and so on.
For your side question: programming usually uses the english language, including the number format. In strings you can use any format you want, but that doesn't affect the surrounding code.
Note that since ANTLR v4.7 and up, there is more possible w.r.t. Unicode inside ANTLR's lexer grammar: https://github.com/antlr/antlr4/blob/master/doc/unicode.md
So you could define a lexer rule like this:
DIGIT
: [\p{Digit}]
;
which will match both ٣ and 3.

DT_WORDBREAK: list of word break symbols

I use DT_WORDBREAK flag when I call DrawTextEx. About this flag MSDN says:
Lines are automatically broken between words if a word extends past
the edge of the rectangle specified by the lprc parameter. A carriage
return-line feed sequence also breaks the line.
But I cannot find "official" list of symbols that are used as word break symbols. Is it exist?
If you get the TEXTMETRICs for the font you're using, it corresponds to the tmBreakChar field.
For any Latin font, this is almost certainly just the plain old space character (Unicode U+0020 SPACE or ASCII 32).
I don't think DrawTextEx does anything fancier. You'd have to use a more advanced API to get more sophisticated behavior such as breaking after hyphens, soft-hyphens, other kinds of spaces, etc.

What is VBS UCASE function doing to Japanese?

In order to avoid case conflicts comparing strings on an ASP classic site, some inherited code converts all strings with UCASE() first. This seems to work well across languages ... except Japanese. Here's a simple example on a Japanese string. I've provided the UrlEncoded values to make it clear how little is changing behind the scenes:
Server.UrlEncode("戦艦帝国") = %E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD
UCASE("戦艦帝国") = ƈ�ȉ�Ÿ�ś�
Server.UrlEncode(UCASE("戦艦帝国")) = %C6%88%A6%C8%89%A6%C5%B8%9D%C5%9B%BD
So is UCASE doing anything sensible with this Japanese string? Or is its behavior buggy, undefined, or known to be incompatible with Japanese?
(LCASE leaves the sample string alone. But now I'm wary of switching all comparisons to LCASE because I don't know if it bungles other non-western languages that do work with UCASE....)
https://msdn.microsoft.com/en-us/library/1systdcy(v=vs.84).aspx
Only lowercase letters are converted to uppercase; all uppercase letters and non-letter characters remain unchanged.
https://en.wikipedia.org/wiki/Letter_case
Most Western languages (particularly those with writing systems based on the Latin, Cyrillic, Greek, Coptic, and Armenian alphabets) use letter cases in their written form as an aid to clarity. Scripts using two separate cases are also called bicameral scripts. Many other writing systems make no distinction between majuscules and minuscules – a system called unicameral script or unicase.
"lowercase or uppercase letters" does not apply in Chinese-Japanese-Korean languages, hence, the output of UCase() should remain unchanged.

Error with utf8 encoding

When I get data from some website, sometime the data is encode in utf8 but look like this:
Thỏ , Nạt
The accent mark is seperated from character when in fact these string must be:
Thỏ, Nạt
I don't know what is the problem here and how to correct it. Can someone help me with this
The first sample string contains two Vietnamese characters in decomposed form. The first one of them is “ỏ”, consisting of simple letter “o” followed by U+0309 COMBINING HOOK ABOVE.
The second sample string has those characters in precomposed form. The first one of them is “ỏ” U+1ECF LATIN SMALL LETTER O WITH HOOK ABOVE.
The decomposed and precomposed form are defined to be “canonical equivalent” and are normally expected to result in the same rendering (though this does not always happen). They are not identical, however; in programmatic comparison of characters and strings, they are very much different.
Mostly Latin letters with diacritics, such as “é” and “ä”, are used in precomposed form only, since that’s what keyboard drivers, online keyboards, character picking utilities, etc., normally produce. However, Vietnamese keyboard drivers often work so that some diacritic marks are entered after entering a base character, and the diacritic is thus produced as a combining character, i.e. the letter (like “ỏ”) is then in decomposed form.
One way of dealing with this issue, recommended in many contexts, is to convert your strings to Normalization Form C (NFC). This would put these characters into precomposed form. Note, however, that conversion to NFC removes some other distinctions, too (but this is not relevant if the text is in Vietnamese only and does not contain special symbols).
It remains a mystery why the first sample string has a space character before the comma.

User-defined Literals suffix, with *_digit..."?

A user-defined literal suffix in C++0x should be an identifier that
starts with _ (underscore) (17.6.4.3.5)
should not begin with _ followed by uppercase letter (17.6.4.3.2)
Each name that [...] begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.
Is there any reason, why such a suffix may not start _ followed by a digit? I.E. _4 or _3musketeers?
Musketeer dartagnan = "d'Artagnan"_3musketeers;
int num = 123123_4; // to be interpreted in base4 system?
string s = "gdDadndJdOhsl2"_64; // base64decoder
The precedent for identifiers of the form _<number> is the function argument placeholder object mechanism in std::placeholders (§20.8.9.1.3), which defines an implementation-defined number of such symbols.
This is a good thing, because it means the user cannot #define any identifier of that form. §17.6.4.3.1/1:
A translation unit that includes a standard library header shall not #define or #undef names declared in any standard library header.
The name of the user-defined literal function is operator "" _123, not simply _123, so there is no direct conflict between your name and the library name if presence of the using namespace std::placeholders;.
My 2¢, though, is that you would be better off with an operator "" _baseconv and encoding the base within the literal, "123123_4"_baseconv.
Edit: Looking at Johannes' (deleted) answer, there is There may be concern that _123 could be used as a macro by the implementation. This is certainly the realm of theory, as the implementation would have little to gain by such preprocessor use. Furthermore, if I'm not mistaken, the reason for hiding these symbols in std::placeholders, not std itself, is that such names are more likely to be used by the user, such as by inclusion of Boost Bind (which does not hide them inside a named namespace).
The tokens are not reserved for use by the implementation globally (17.6.4.3.2), and there is precedent for their use, so they are at least as safe as, say, forward.
"can" vs "may".
can denotes ability where may denotes permission.
Is there a reason why you would not have permission to the start a user-defined literal suffix with _ followed by a digit?
Permission implies coding standards or best-practices. The examples you provides seem to show that _\d would fine suffixes if used correctly (to denote numeric base). Unfortunately your question can't have a well thought out answer as no one has experience with this new language feature yet.
Just to be clear user-defined literal suffixes can start with _\d.
An underscore followed by a digit is a legal user-defined literal suffix.
The function signature would be:
operator"" _4();
so it couldn;t get eaten by a placeholder.
The literal would be a single preprocessor token:
123123_4;
so the _4 would not get clobbered by a placeholder or a preprocessor symbol.
My reading of 17.6.4.3.5 is that suffixes not containing a leading underscore risk collision with the implementation or future library additions. They also collide with existing suffixes: F, L, ULL, etc. One of the rationales for user-defined literals is that a new type (such as decimals for example) could be defined as a pure library extension including literals with suffuxes d, df, dl.
Then there's the question of style and readability. Personally, I think I would loose sight of the suffix 1234_3; Maybe, maybe not.
Finally, there was some idea that didn't make it into the standard (but I kind of like) to have _ be a literal separator for numbers like in Ada and Ruby. So you could have 123_456_789 to visually separate thousands for example. Your suffix would break if that ever went through.
I knew I had some papers on this subject:
Digital Separators describes a proposal to use _ as a digit separator in numeric literals
Ambiguity and Insecurity with User-Defined literals Describes the evolution of ideas about literal suffix naming and namespace reservation and efforts to deconflict user-defined literals against a future digit separator.
It just doesn't look that good for the _ digit separator.
I had an idea though: how about either a backslash or a backtick for digit separator? It isn't as nice as _ but I don't think there would be any collision as long as the backslash was inside the stream of digits. The backtick has no lexical use currently that I know of.
i = 123\456\789;
j = 0xface\beef;
or
i = 123`456`789;
j = 0xface`beef;
This would leave _123 as a literal suffix.

Resources