String sorting algorithm in FAST ESP - sorting

Is anyone aware of sorting logic in FAST ESP engine ver. 5.3? How special characters are handled and how sorting of Japanese and Chinese words is performed?
Here are top 8 of search results which were sorted in ascending order:
門
¿ c
¿ c¡a «n »c ‹e ›r § ¶~#15
¿ c¡a «n »c ‹e ›r § ¶~#44
¿ c¡a «n »c ‹e ›r § ¶~#45
§ word document4
門 他の他の
門 他の他の 2
Does it mean that 門 character is omitted from sorting scope?
And these are top 10 of search results sorted in descending order:
他の門そ他の門
の他
他の
そ他の門そ他の
そ他の門門門
そ他他そ
そ
そ他
СЌРЅРІР»гЃќд»
марцпиорыв
It appears that last two results with Cyrillic symbols are handled correctly but then ambiguity is observed when そ result is put between そ他 and そ他他そ.

Sorting is handled in alphabetical order in Latin languages and Greek, but in the case of JKC languages, you need to set up properly the document configuration to be able to handle those languages. Also you need to install the tokenization for those languages too. Microsoft provides the patches to include tokenization and dictionary for each of those languages. I think that would be really useful to verify that the search engine and documents in the collection are properly configured.

Related

What is the syntax for MariaDB 'IN NATURAL LANGUAGE MODE'?

According to the MariaDB documentation:
There are no special operators, and searches consist of one or more
comma-separated keywords.
The search clearly does not need to be comma-separated, as replacing commas with spaces gives the same result.
I assume that it breaks the string into separate keywords, but exactly how doesn't appear to be well documented.
With my test data, these two return the same results:
AGAINST('Quality Water Environment' IN NATURAL LANGUAGE MODE)
AGAINST('Quality Water åîøüé!##$%^&*()_+Environment' IN NATURAL LANGUAGE MODE)
The second search has some characters that I consider to be 'word characters' that seem to have no influence on the result.
So what exactly is accepted by this function, and what is filtered out?

What caseless comparison algorithm is CompareStringW using?

For compatibility reasons I need to replicate the behaviour of another application. It is using Unicode strings as identifiers but ignoring case and performing some sort of normalisation. By intercepting API calls I have determined it is using CompareStringW(LOCALE_USER_DEFAULT, NORM_IGNORECASE, SORT_STRINGSORT, ...) to do the comparison.
I can just call this function directly for every pair of strings in the set of strings I am considering but I would prefer a canonical form I can use in a hash table.
Does anyone know what algorithm CompareStringW uses with those flags set? Is it a standard Unicode algorithm?
Can I use NormalizeString and FoldString to generate this canonical form? If i can, what arguments do I need to pass?
edit: As David Heffernan pointed out I'll need to use FoldString as well as NormalizeString to do a proper caseless comparison.
The great Michael Kaplan (RIP) provided a lot of useful information about the NLS functions on his blog over the years and some posts have been archived before Microsoft made the blog internal only.
His A few of the gotchas of CompareString post provides descriptions of these flags:
NORM_IGNORECASE - Ignore case. A better name for this flag might have been IGNORE_TERTIARYWEIGHT since that is what it accomplishes (it masks the tertiary weight), although it is obviously too late to consider such a change. It can cause undesirable results when used in the comparison of strings containing characters that depend on the weight for vital information, which thankfully is a very small number of cases. But if you are not expecting "ʏ", "Y", and "y" (U+028f, U+0059, and U+0079, a.k.a. LATIN LETTER SMALL CAPITAL Y, LATIN LETTER CAPITAL Y, and LATIN LETTER SMALL Y) to all be equal, then you may want to think twice about throwing this flag into the mix. You will also lose the distinctions of the final forms for Hebrew (e.g. "מ" and "ם", U+05de U+05dd a.k.a. HEBREW LETTER MEM and HEBREW LETTER FINAL MEM), Arabic (e.g. "ش" U+0634 a.k.a. ARABIC LETTER SHEEN and its isolated, final, initial, and medial forms (ﺵ, ﺶ, ﺷ, and ﺸ) at U+feb5, U+feb6, U+feb7, and U+feb8, and other languages.
SORT_STRINGSORT - Treat punctuation the same as symbols. For example, a STRING sort treats co-op and co_op as strings that should sort together since the hyphen and the underscore are both treated as symbols. On the other hand, a WORD sort treats the hyphen and apostrophe differently, so that co-op and co_op would not sort together but co-op and coop would. The real documentation for this is built into the winnls.h header file:
//
// Sorting Flags.
//
// WORD Sort: culturally correct sort
// hyphen and apostrophe are special cased
// example: "coop" and "co-op" will sort together in a list
//
// co_op <------- underscore (symbol)
// coat
// comb
// coop
// co-op <------- hyphen (punctuation)
// cork
// went
// were
// we're <------- apostrophe (punctuation)
//
//
// STRING Sort: hyphen and apostrophe will sort with all other symbols
//
// co-op <------- hyphen (punctuation)
// co_op <------- underscore (symbol)
// coat
// comb
// coop
// cork
// we're <------- apostrophe (punctuation)
// went
// were
//
The results might also vary depending on the Windows version because newer versions supports later Unicode versions.

What is VBS UCASE function doing to Japanese?

In order to avoid case conflicts comparing strings on an ASP classic site, some inherited code converts all strings with UCASE() first. This seems to work well across languages ... except Japanese. Here's a simple example on a Japanese string. I've provided the UrlEncoded values to make it clear how little is changing behind the scenes:
Server.UrlEncode("戦艦帝国") = %E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD
UCASE("戦艦帝国") = ƈ�ȉ�Ÿ�ś�
Server.UrlEncode(UCASE("戦艦帝国")) = %C6%88%A6%C8%89%A6%C5%B8%9D%C5%9B%BD
So is UCASE doing anything sensible with this Japanese string? Or is its behavior buggy, undefined, or known to be incompatible with Japanese?
(LCASE leaves the sample string alone. But now I'm wary of switching all comparisons to LCASE because I don't know if it bungles other non-western languages that do work with UCASE....)
https://msdn.microsoft.com/en-us/library/1systdcy(v=vs.84).aspx
Only lowercase letters are converted to uppercase; all uppercase letters and non-letter characters remain unchanged.
https://en.wikipedia.org/wiki/Letter_case
Most Western languages (particularly those with writing systems based on the Latin, Cyrillic, Greek, Coptic, and Armenian alphabets) use letter cases in their written form as an aid to clarity. Scripts using two separate cases are also called bicameral scripts. Many other writing systems make no distinction between majuscules and minuscules – a system called unicameral script or unicase.
"lowercase or uppercase letters" does not apply in Chinese-Japanese-Korean languages, hence, the output of UCase() should remain unchanged.

Can I alpha sort base32/64 encoded MD5 hashes?

I've got a massive file of hex encoded MD5 values that I'm using linux 'sort' utility to sort. The result is that the hashes come out in sequential order (which is what I need for the next stage of processing). E.g:
000001C35AE83CEFE245D255FFC4CE11
000003E4B110FE637E0B4172B386ACAC
000004AAD0EB3D896B654A960B0111FA
In the interest of speeding up the sort operation (and making the files smaller), I was considering encoding the data as base32 or base64.
The question is, would an alpha-sort of the base32/64 data get me the same result? My quick tests seem to indicate that it would work. For example, the above three hex strings correspond 1:1 to these base64 strings:
AAABw1roPO/iRdJV/8TOEQ==
AAAD5LEQ/mN+C0Fys4asrA==
AAAEqtDrPYlrZUqWCwER+g==
But I'm unsure as to the sort order when it comes to special characters used in Base64 like "/" and "+" and how those would be treated in the context of an alpha sort.
Note: I happen to be using the linux sort utility but the question still applies to other alpha-sorting tools. The tool used is not really part of the question.
I've since discovered that this isn't possible with the standard base32/64 implementations. There exists however a base32 variation called "base32hex" which preserves sort ordering, but there is no official "base64hex" equivalent.
Looks like that leaves creating a custom encoding like this.
EDIT:
This turned out to be very trivial to solve. Simply encode in base 64 then translate character to character with a custom table of characters that respects sort order.
Simply map from the standard Mime 64 characters:
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
To something like this:
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz|~"
Then sorting will work.

What characters can des(unix) have?

All lowercase and uppercase, all digits, dot and slash.
Have I missed anything?
This seems like an very easy question found to find at Google but actually I haven't found any information about it :(
Edit, if anybody missunderstod, what characters can the OUTPUT have.
I'm not asking what kind of stuff I can hash, I'm asking what the hash looks like.
DES (and many other encryption algorithms) work on a bit level - it has no concept of what's a valid character and what isn't, the range of the output characters can be anything from 0x00 to 0xFF.
Any output to the contrary is likely just characters not supported by whatever you're trying to display the output with, which are typically replaced by some predefined character.
The output can also be converted to hex characters for cosmetic or storage purposes (I'm not sure whether the des command would do this - it's simple enough to see by just running it), e.g. a single 'a' (0x61) character will be converted to two characters: '61'. The resulting output characters would thus be in the range A-F or a-f and 0-9.
Note that keys require ASCII, but this is not a requirement of DES itself, as can be derived from "Bugs" on the same page, and it doesn't affect the range of output values.
The DES algorithm is considered obsolete and unsafe. The DES standard (FIPS 46-3) has been withdrawn in 2005.
Use at your own risk.
See http://en.wikipedia.org/wiki/Data_Encryption_Standard

Resources