What is the syntax for MariaDB 'IN NATURAL LANGUAGE MODE'? - full-text-search

According to the MariaDB documentation:
There are no special operators, and searches consist of one or more
comma-separated keywords.
The search clearly does not need to be comma-separated, as replacing commas with spaces gives the same result.
I assume that it breaks the string into separate keywords, but exactly how doesn't appear to be well documented.
With my test data, these two return the same results:
AGAINST('Quality Water Environment' IN NATURAL LANGUAGE MODE)
AGAINST('Quality Water åîøüé!##$%^&*()_+Environment' IN NATURAL LANGUAGE MODE)
The second search has some characters that I consider to be 'word characters' that seem to have no influence on the result.
So what exactly is accepted by this function, and what is filtered out?

Related

What are valid date-time separators in RFC3339 strings?

I'm quite confused as to what's allowed as the time separator/designator in the RFC3339 standard. By time separator I mean the sequence of characters that draw the line between date and time.
The standard states in section 5.6 different things that are unclear or conflicting. First of all, it says that the production rule for a full datetime is this:
date-time = full-date "T" full-time
Meaning that the delimiter between the date and the time is an uppercase T. Right after comes this:
NOTE: Per [ABNF] and ISO8601, the "T" and "Z" characters in this
syntax may alternatively be lower case "t" or "z" respectively
Meaning the upper case T may be a lower case t. It conflicts with the ABNF, but OK, it stills sounds to me within the realm of reasonable. Then the following is stated
NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of
readability, to specify a full-date and full-time separated by
(say) a space character.
Which is very confusing. Does this allow not only a space character but anything? which is what this say implies. Or does it by this syntax refer to ISO8601 and unnecessarily describes a detail of that other standard?
In other words, are the following valid RFC3339 strings?
2020-09-07 20:26:03.623359300+02:00
2020-09-07hey johnny20:26:03.623359300+02:00
2020-09-07💩20:26:03.623359300+02:00
Meaning the upper case T may be a lower case t. It conflicts with the ABNF, [...]
It does not. See 2.3 Terminal Values of RFC 2234:
Literal text strings are interpreted as a concatenated set of
printable characters.
NOTE: ABNF strings are case-insensitive and
the character set for these strings is us-ascii.
So it is allowed to use t here.
NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of
readability, to specify a full-date and full-time separated by
(say) a space character.
Which is very confusing. Does this allow not only a space character
but anything?
This "deviation" is used for readability to the user when displayed. So when the value is displayed to the user in some kind, it can be displayed as:
2020-09-07 20:26:03.623359300+02:00
2020-09-07, 20:26:03.623359300+02:00
That way it might be easier for the user to see the clear space between the date and time, so they don't have to look for the T or t character to find the separation. It is indeed a vague sentence as it basically mean the application can do anything.
To answer your question: These listed date formats are not valid according to RFC 3339.
Short answer: T (or t as discouraged alternative).
After reading on this as much as I could, it turns out the time separator must be a T or t. What has made think this way is first of all this thread in the GNU lists where F. Alexander Njemz contacted the authors of RFC3339 Graham Klyne and Chris Newman asking if T is mandatory and got this response from Mr. Klyne:
In short: "yes"
Per section 5.5, the intent in this draft was to specify a timestamp format using
elements from and compatible with 8601, but eliminating as far as
reasonable any variations that could make timestamp data harder to
process. This includes making the 'T' mandatory in date+time values.
#g
Just for clarity's sake, this is stated in the section 5.5:
Simplicity is achieved by making most fields and punctuation
mandatory.
This clearly clashes with a non-mandatory T and strongly makes me think that the this syntax in that problematic passage refers to ISO8601 and not RFC3339.
For those who want to read more, here are some links regarding the confusion created by this specific point:
https://lists.gnu.org/archive/html/bug-coreutils/2006-05/msg00014.html
http://validator.w3.org/feed/docs/error/InvalidRFC3339Date.html
https://www.rfc-editor.org/errata/eid5783
Plus of course divergent implementations. For instance, the developers of GNU Date chose to use a space character:
$ date --rfc-3339=seconds
2020-09-14 14:53:51+02:00

What is VBS UCASE function doing to Japanese?

In order to avoid case conflicts comparing strings on an ASP classic site, some inherited code converts all strings with UCASE() first. This seems to work well across languages ... except Japanese. Here's a simple example on a Japanese string. I've provided the UrlEncoded values to make it clear how little is changing behind the scenes:
Server.UrlEncode("戦艦帝国") = %E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD
UCASE("戦艦帝国") = ƈ�ȉ�Ÿ�ś�
Server.UrlEncode(UCASE("戦艦帝国")) = %C6%88%A6%C8%89%A6%C5%B8%9D%C5%9B%BD
So is UCASE doing anything sensible with this Japanese string? Or is its behavior buggy, undefined, or known to be incompatible with Japanese?
(LCASE leaves the sample string alone. But now I'm wary of switching all comparisons to LCASE because I don't know if it bungles other non-western languages that do work with UCASE....)
https://msdn.microsoft.com/en-us/library/1systdcy(v=vs.84).aspx
Only lowercase letters are converted to uppercase; all uppercase letters and non-letter characters remain unchanged.
https://en.wikipedia.org/wiki/Letter_case
Most Western languages (particularly those with writing systems based on the Latin, Cyrillic, Greek, Coptic, and Armenian alphabets) use letter cases in their written form as an aid to clarity. Scripts using two separate cases are also called bicameral scripts. Many other writing systems make no distinction between majuscules and minuscules – a system called unicameral script or unicase.
"lowercase or uppercase letters" does not apply in Chinese-Japanese-Korean languages, hence, the output of UCase() should remain unchanged.

What does Elasticsearch's auto_generate_phrase_queries do?

In the docs for query string query, auto_generate_phrase_queries is listed as a parameter but the only description is "defaults to false." So what does this parameter do exactly?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
This will directly match to the lucene's org.apache.lucene.queryparser.classic.QueryParserSettings#autoGeneratePhraseQueries. When the analyzer applied on the query string, this setting allows lucene to generate quoted phrases no keywords.
Quoting:
SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
autoGeneratePhraseQueries="true" (the default) causes the query parser to
generate phrase queries if multiple tokens are generated from a single
non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11
will cause the parser to generate text:"pdp 11" rather than (text:PDP OR text:11).
Note that autoGeneratePhraseQueries="true" tends to not work well for non whitespace
delimited languages.
where word delimiter works as WordDelimiterFilter.html
Important thing to note is single non-quoted analysis string, i.e. if your query string is non-quoted. If you are already searching for a quoted phrase then it won't make any sense.

How do you check for a changing value within a string

I am doing some localization testing and I have to test for strings in both English and Japaneses. The English string might be 'Waiting time is {0} minutes.' while the Japanese string might be '待ち時間は{0}分です。' where {0} is a number that can change over the course of a test. Both of these strings are coming from there respective property files. How would I be able to check for the presence of the string as well as the number that can change depending on the test that's running.
I should have added the fact that I'm checking these strings on a web page which will display in the relevant language depending on the location of where they are been viewed. And I'm using watir to verify the text.
You can read elsewhere about various theories of the best way to do testing for proper language conversion.
One typical approach is to replace all hard-coded text matches in your code with constants, and then have a file that sets the constants which can be updated based on the language in use. (I've seen that done by wrapping the require of that file in a case statement based on the language being tested. Another approach is an array or hash for each value, enumerated by a variable with a name like 'language', which lets the tests change the language on the fly. So validations would look something like this
b.div(:id => "wait-time-message).text.should == WAIT_TIME_MESSAGE[language]
To match text where part is expected to change but fall within a predictable pattern, use a regular expression. I'd recommend a little reading about regular expressions in ruby, especially using unicode regular expressions in ruby, as well as some experimenting with a tool like Rubular to test regexes
In the case above a regex such as:
/Waiting time is \d+ minutes./ or /待ち時間は\d+分です。/
would match the messages above and expect one or more digits in the middle (note that it would fail if no digits appear, if you want zero or more digits, then you would need a * in place of the +
Don't check for the literal string. Check for some kind of intermediate form that can be used to render the final string.
Sometimes this is done by specifying a message and any placeholder data, like:
[ :waiting_time_in_minutes, 10 ]
Where that would render out as the appropriate localized text.
An alternative is to treat one of the languages as a template, something that's more limited in flexibility but works most of the time. In that case you could use the English version as the string that's returned and use a helper to render it to the final page.

Ruby (on Rails) Regex: removing thousands comma from numbers

This seems like a simple one, but I am missing something.
I have a number of inputs coming in from a variety of sources and in different formats.
Number inputs
123
123.45
123,45 (note the comma used here to denote decimals)
1,234
1,234.56
12,345.67
12,345,67 (note the comma used here to denote decimals)
Additional info on the inputs
Numbers will always be less than 1 million
EDIT: These are prices, so will either be whole integers or go to the hundredths place
I am trying to write a regex and use gsub to strip out the thousands comma. How do I do this?
I wrote a regex: myregex = /\d+(,)\d{3}/
When I test it in Rubular, it shows that it captures the comma only in the test cases that I want.
But when I run gsub, I get an empty string: inputstr.gsub(myregex,"")
It looks like gsub is capturing everything, not just the comma in (). Where am I going wrong?
result = inputstr.gsub(/,(?=\d{3}\b)/, '')
removes commas only if exactly three digits follow.
(?=...) is a lookahead assertion: It needs to be possible to be matched at the current position, but it's not becoming part of the text that is actually matched (and subsequently replaced).
You are confusing "match" with "capture": to "capture" means to save something so you can refer to it later. You want to capture not the comma, but everything else, and then use the captured portions to build your substitution string.
Try
myregex = /(\d+),(\d{3})/
inputstr.gsub(myregex,'\1\2')
In your example, it is possible to tell from the number of digits after the last separator (either , or .) that it is a decimal point, since there are 2 lone digits. For most cases, if the last group of digits does not have 3 digits then you can assume that the separator in front is decimal point. Another sign is the multiple appearance of a separator in big numbers allows us to differentiate between decimal point and separators.
However, I can give a string 123,456 or 123.456 without any sort of context. It is impossible to tell whether they are "123 thousand 456" or "123 point 456".
You need to scan the document to look for clue whether , is used for thousand separator or decimal point, and vice versa for .. With the context provided, then you can safely apply the same method to remove the thousand separators.
You may also want to check out this article on Wikipedia on the less common ways to specify separators or decimal points. Knowing and deciding not to support is better than assuming things will work.

Resources