Rules for field names in ElasticSearch 6? - elasticsearch

Currently all what I can find online is:
must not start with underscore "_"
must not contain comma ","
must not contain hash mark "#"
usage of point "." is discouraged but possible
field names must not be longer than 255
But it seems that these are the rules for ElasticSearch 5 and older versions.
I did some experiments and found:
using dots (.) may result in various kinds of errors, e.g. illegal_state_exception, array_index_out_of_bounds_exception, but sometimes it's legal
empty strings are not allowed (illegal_argument_exception)
leading underscores, commas, hash marks seem to be legal in ElasticSearch 6
field names can be longer than 255 (but perhaps there's a new limit?)
I wonder whether there's an official document for this? Am I just being blind?

We are currently planning an upgrade from 5.6.5 to 6.2.x.
I'm looking for evidence to support the worrying comment "...as underscores in field names will not be allowed" mentioned in Breaking Changes for Watcher in 6.0.0-alpha2.
I've been unable to find any additional evidence that underscores are now verboten. I'll open a support case referencing this question to get an official response on this.

Related

How do I escape the word "And" in Elasticsearch if I want to search by the literal "And"?

I'm trying to search over an index that includes constellation code names, and the code name for the Andromeda constellation is And.
Unfortunately, if I search using And, all results are returned. This is the only one that doesn't work, across dozens of constellation code names, and I assume it's because it's interpreted as the logical operator AND.
(constellation:(And)) returns my entire result set, regardless of the value of constellation.
Is there a way to fix this without doing tricks like indexing with an underscore in front?
Thanks!
I went for a bit of a hack, indexing the constellation as __Foo__ and then changing my search query accordingly by adding the __ prefix and suffix to the selected constellation.

Getting an exact match to the string `#deprecated` in Kibana/ELK

I'm using Kibana to find all logs containing an exact match of the string #deprecated.
For a reason I don't understand, it matches string with the word "deprecated" without the # sign.
I tried to use escaping for # according to the Lucene Documentation. i.e. message:"\\#deprecated" - without change in results.
How can I query to exact match the #deprecated text exact match only
Why is this happening?
You problem isn't an issue with query syntax, which is what escaping is for, it's with analysis. You analyzer removes punctuation, because it's parsing it as full text. It removes #, in much the same way that it will remove periods and commas.
So, after analysis (assuming standard analysis) of something like: "Class is #deprecated" the token stream generated will have the following tokens: "class", "deprecated" ("is" is a stop word). The indexed form of "#deprecated" and "deprecated" are identical, so it is impossible to have a query that can differentiate between them as it is currently indexed.
To fix this you would have to change your analyzer. WhitespaceAnalyzer may be a good choice, and should fix this issue. However, be careful you aren't doing more harm than good. If you use WhitespaceAnalyzer, you are going to have to contend with other punctuation as well, and a search for "sentence"
would not find "match at the end of this sentence.", because of the period. So, if you are searching full text, this will certainly cause far more problems than it solves.
If you want to know the full rules of standard analysis, by the way, it's an implementation of UAX #29 word boundaries

How to use a DN containing commas as the attribute value in an LDAP search filter?

Was attempting to search our directory based on an attribute whose value is a DN. However, our user RDNs are of the form CN=Surname, GivenName, which requires that the comma be quoted in the full DN. But given an attribute like manager whose value is the DN of another user, I was unable to search for all users having specific manager. I tried (manager=CN=Surname\, GivenName,CN=users,DC=mydomain,DC=com), but got a syntax error "Bad search filter". I tried various options for quoting the DN, but all either gave me a syntax error or failed to match any objects. What am I doing wrong?
(Note that if I were looking for user objects directly, I could search for simply (CN=Surname, GivenName), with no quoting required, but I was searching for users having a specific manager. The comma-containing attribute value only becomes a problem when part of a Distinguished Name.)
The problem is that quoting the comma in the Common Name is not for the benefit of the filter parser, but for the benefit of the DN parser; the attribute value passed to that by the filter has to literally contain the backslash character. Unfortunately, the backslash is also (differently) special in LDAP filters, thus the syntax errors.
The solution is simple, but it isn't as obvious as doubling the backslash; backslash in LDAP filters works like % in URIs, so you have to use a literal backslash followed by the 2-digit hexadecimal code point for a backslash:
(manager=CN=Surname\5c, Givenname,OU=org,DC=mydomain,DC=com)
It turns out there's an example of this specific use case at the very bottom of https://docs.oracle.com/cd/E19424-01/820-4811/gdxpo/index.html#6ng8i269q.

What constitutes a valid URI query parameter key?

I'm looking over Section 3.4 of RFC 3986 trying to understand what constitutes a valid URI query parameter key, but I'm not seeing a clear answer.
The reason I'm asking is because I'm writing a Ruby class that composes a URI with query parameters. When a new parameter is added I want to validate the key. Based on experience, it seems like the key will be invalid if it requires any escaping.
I should also say that I plan to validate the key. I'm not sure how to go about validating this data either, but I do know that in all cases I should escape this value.
Advice is appreciated. Advice in the context of how validation might already be possible through say a Ruby Gem would also be a plus.
I could well be wrong, but that spec seems to say that anything following '?' or '#' is valid as long. I wonder if you should be looking more at the spec for 'application/x-www-form-urlencoded' (ie. the key/value pairs we're all used to)?
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
This is the default content type. Forms submitted with this content
type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by %HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by &'.
I don't believe key=value is part of the RFC, it's a convention that has emerged. Wikipedia suggests this is an 'W3C recommendation'.
Seems like some good stuff to be found searching on the application/x-www-form-urlencoded content type.
http://www.w3.org/TR/REC-html40/interact/forms.html#form-data-set

Sphinx char_set resource?

I need to have Sphinx index some characters that are apparently part of the internal char_set, notably "," and "&".
As far as I understand there is no way (?!) to simply add these as indexable chars, I need to
Make a char_set table in my index
Not only include the "," and "&" as indexable characters but now manually add the char_set Sphinx uses
This is a little frustrating as it seems one should be able to add back in chars without manually recreating the char_set used internally. However if that is the situation that is the situation, yet in the documentation
http://sphinxsearch.com/docs/current/conf-charset-table.html
I don't see or understand how I would manually specify a char_set table and exclude / index the two or three characters I want.
You do need to duplicate the 'stock' charset_table before you can add to it.
At less than 100 characters, even if you typed it rather than copy/pasting, would still be les effort than typing this whole question!
(even less if use the the english shorthand)
Just copy what there and add your new chars to the end. If you don't know the unicode mapping, something like asciitable.com is useful at least for the chars in ASCII.

Resources