Elasticsearch cannot find standalone reserved characters - elasticsearch

I use Kibana to execute query elastic (Query string query).
When i search a word include escapable characters (reserved characters like: '\', '+', '-', '&&', '||', '!', '(', ')', '{', '}', '[', ']', '^', '"', '~', '*', '?', ':', '/'). It will get expected result.
My example use: '!'
But when i search single reserved character. I got nothing.
Or:
How can i search with single reserved character?

TL;DR You'll need to specify an analyzer (+ a tokenizer) which ensures that special chars like ! won't be stripped away during the ingestion phase.
In the first screenshot you've correctly tried running _analyze. Let's use it to our advantage.
See, when you don't specify any analyzer, ES will default to the standard analyzer which is, by definition, constrained by the standard tokenizer which'll strip away any special chars (except the apostrophe ' and some other chars).
Running
GET dev_application/_analyze?filter_path=tokens.token
{
"tokenizer": "standard",
"text": "Se, det ble grønt ! a"
}
thus yields:
["Se", "det", "ble", "grønt", "a"]
This means you'll need to use some other tokenizer which'll preserve these chars instead. There are a few built-in ones available, the simplest of which would be the whitespace tokenizer.
Running
GET _analyze?filter_path=tokens.token
{
"tokenizer": "whitespace",
"text": "Se, det ble grønt ! a"
}
retains the !:
["Se,", "det", "ble", "grønt", "!", "a"]
So,
1. Drop your index:
DELETE dev_application
2. Then set the mappings anew:
(I chose the multi-field approach which'll preserve the original, standard analyzer and only apply the whitespace tokenizer on the name.splitByWhitespace subfield.)
PUT dev_application
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"splitByWhitespaceAnalyzer": {
"tokenizer": "whitespace"
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"splitByWhitespace": {
"type": "text",
"analyzer": "splitByWhitespaceAnalyzer"
}
}
}
}
}
}
3. Reindex
POST dev_application/_doc
{
"name": "Se, det ble grønt ! a"
}
4. Search freely for special chars:
GET dev_application/_search
{
"query": {
"query_string": {
"default_field": "name.splitByWhitespace",
"query": "*\\!*",
"default_operator": "AND"
}
}
}
Do note that if you leave the default_field out, it won't work because of the standard analyzer.
Indeed, you could reverse this approach, apply whitespace by default, and create a multi-field mapping for the "original" indexing strategy (-> the only config being "type": "text").
Shameless plug: I wrote a book on Elasticsearch and you may find it useful!

Standard analyzer
The standard analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.
so no token is generated for hyphen. If you want to find text with hyphen, you need to look into keyword fields and use wildcard for full text match
{
"query": {
"query_string": {
"query": "*\\-*"
}
}
}

Related

ElasticSearch apostrophe handling

I have a problem with handling apostrophes in ElasticSearch.
I have a doc with field value = 'Pangk’ok' and I want to be able to find the document by the search requests: 'Pangk’ok', 'Pangkok' or 'Pangk ok'.
I've tried to do this with such analyzer:
"filter": {
"my_word_delimiter_graph": {
"type": "word_delimiter_graph",
"catenate_words": true
}
},
"analyzer": {
"source_analyzer": {
"tokenizer": "icu_tokenizer",
"filter": [ "my_word_delimiter_graph", "icu_folding" ],
"type": "custom"
}
}
It's successful for the match query for all of the searches but fails when part of a phrase and match_phrase query.
And this case actually described in ElasticSearch documentation:
catenate_words
(Optional, Boolean) If true, the filter produces catenated tokens for chains of alphabetical characters separated by non-alphabetic delimiters. For example: super-duper-xl → [ super, superduperxl, duper, xl ]. Defaults to false.
When used for search analysis, catenated tokens can cause problems for the match_phrase query and other queries that rely on token position for matching. Avoid setting this parameter to true if you plan to use these queries.
So, my question is if it exists some other solution to solve my problem?

What keywords are used by the Swedish analyzer?

On this part of the elasticsearch docs, it says that the Swedish analyzer can be reimplemented like this:
PUT /swedish_example
{
"settings": {
"analysis": {
"filter": {
"swedish_stop": {
"type": "stop",
"stopwords": "_swedish_"
},
"swedish_keywords": {
"type": "keyword_marker",
"keywords": ["exempel"]
},
"swedish_stemmer": {
"type": "stemmer",
"language": "swedish"
}
},
"analyzer": {
"swedish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"swedish_stop",
"swedish_keywords",
"swedish_stemmer"
]
}
}
}
}
My question is, how does this analyser recognise keywords? Sure, the keywords can be defined in the settings.analysis.filter.swedish_keywords.keywords field, but what if I'm too lazy to do that? Does Elasticsearch look at some other keywords list of pre-defined Swedish keywords? Because in the example above it looks like there is no such list provided in the settings.
In other words, is it solely up to me to define keywords or does Elasticsearch look at some other list to find keywords by default?
Yes, you need to specify this list by you. Otherwise, this filter wouldn't do anything.
As per documentation of Elasticsearch:
Keyword Marker Token Filter
Protects words from being modified by stemmers. Must be placed before
any stemming filters.
Alternatively, you could specify:
keywords_path
A path (either relative to config location, or absolute) to a list of
words.
keywords_pattern
A regular expression pattern to match against words in the text.
More information about this filter - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html

Is Simple Query Search compatible with shingles?

I am wondering if it is possible to use shingles with the Simple Query String query. My mapping for the relevant field looks like this:
{
"text_2": {
"type": "string",
"analyzer": "shingle_analyzer"
}
}
The analyzer and filters are defined as follows:
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "custom_delimiter", "lowercase", "stop", "snowball", "filter_shingle"]
}
},
"filter": {
"filter_shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"custom_delimiter": {
"type": "word_delimiter",
"preserve_original": True
}
}
I am performing the following search:
{
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"analyzer": "shingle_analyzer",
"fields": [
"text_2"
],
"lenient": "false",
"default_operator": "and",
"query": "porsches small red"
}
}
]
}
}
}
Now, I have a document with text_2 = small red porsches. Since I am using the AND operator, I would expect my document to NOT match, since the above query should produce a shingle of "porsches small red", which is a different order. However, when I look at the match explanation I am only seeing the single word tokens "red" "small" "porsche", which of course match.
Is SQS incompatible with shingles?
The answer is "Yes, but...".
What you're seeing is normal given the fact that the text_2 field probably has the standard index analyzer in your mapping (according to the explanation you're seeing), i.e. the only tokens that have been produced and indexed for small red porsches are small, red and porsches.
On the query side, you're probably using a shingle analyzer with output_unigrams set to true (default), which means that the unigram tokens will also be produced in addition to the bigrams (again according to the explanation you're seeing). Those unigrams are the only reason why you get matches at all. If you want to match on bigrams, then one solution is to use the shingle analyzer at indexing time, too, so that bigrams small red and red porsches can be produced and indexed as well in addition to the unigrams small, red and porsches.
Then at query time, the unigrams would match as well but small red bigram would definitely match, too. In order to only match on the bigrams, you can have another shingle analyzer just for query time whose output_unigrams is set to false, so that only bigrams get generated out of your search input. And in case your query only contains one single word (e.g. porsches), then that shingle analyzer would only generate a single unigram (because output_unigrams_if_no_shingles is true) and the query would still match your document. If that's not desired you can simply set output_unigrams_if_no_shingles to false in your shingle search analyzer.

Elasticsearch strange filter behaviour

I'm trying to replace a particular string inside a field. So I used custom analyser and character filter just as it's described in the docs, but it didn't work.
Here are my index settings:
{
"settings": {
"analysis": {
"char_filter": {
"doule_colon_to_space": {
"type": "mapping",
"mappings": [ "::=> "]
}},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [ "doule_colon_to_space" ],
"tokenizer": "standard"
}}
}}}
which should replace all double colons (::) in a field with spaces. I then update my mapping to use the analyzer:
{
"posts": {
"properties": {
"id": {
"type": "long"
},
"title": {
"type": "string",
"analyzer": "my_analyzer",
"fields": {
"simple": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Then I put a document in the index:
{
"id": 1,
"title": "Person::Bruce Wayne"
}
I then test if analyzer works, but it appears it's not - when I send this https://localhost:/first_test/_analyze?analyzer=my_analyzer&text=Person::Someone+Close, I got two tokens back - 'PersonSomeone' (together) and 'Close'. Am I doing this right? May be I should escape the space somehow? I use Elasticsearch 1.3.4
I think the whitespace in your char_filter pattern is being ignored. Try using the unicode escape sequence for a single space instead:
"mappings": [ "::=>\\u0020"]
Update:
In response to your comment, the short answer is yes, the example is wrong. The docs do suggest that you can use a mapping character filter to replace a token with another one which is padded by whitespace, but the code disagrees.
The source code for the MappingCharFilterFactory uses this regex to parse the settings:
// source => target
private static Pattern rulePattern = Pattern.compile("(.*)\\s*=>\\s*(.*)\\s*$");
This regex matches (and effectively discards) any whitespace (\\s*) surrounding the second replacement token ((.*)), so it seems that you cannot use leading or trailing whitespace as part of your replacement mapping (though it could include interstitial whitespace). Even if the regex were different, the matched token is trim()ed, which would have removed any leading and trailing whitespace.

Semi-exact (complete) match in ElasticSearch

Is there a way to require a complete (though not necessarily exact) match in ElasticSearch?
For instance, if a field has the term "I am a little teapot short and stout", I would like to match on " i am a LITTLE TeaPot short and stout! " but not just "teapot short and stout". I've tried the term filter, but that requires an actual exact match.
If your "not necessarily exact" definition refers to uppercase/lowercase letters combination and the punctuation marks (like ! you have in your example), this would be a solution, not too simple and obvious tough:
The mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"my_pattern_replace"
]
}
},
"filter": {
"my_pattern_replace": {
"type": "pattern_replace",
"pattern": "!",
"replacement":""
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_keyword_lowercase"
}
}
}
}
}
The idea here is the following:
use a keyword tokenizer to keep the text as is and not to be split into tokens
use the lowercase filter to get rid of the mixing uppercase/lowercase characters
trim filter used to get rid of the trailing and leading whitespaces
use a pattern_replace filter to get rid of the punctuation. This is like this because a keyword tokenizer won't do anything to the characters inside the text. A standard analyzer will do this, but the standard will, also, split the text whereas you need it as is
And this is the query you would use for the mapping above:
{
"query": {
"match": {
"text": " i am a LITTLE TeaPot short and stout! "
}
}
}

Resources