Elasticsearch "simple_query_string" vs. "query_string" field analysis bug? - elasticsearch

Recently we discovered that, since we aren't sanitizing search terms as they come into our system, we would get occasional parsing exceptions in Elasticsearch when special characters such as / (forward slash) , etc. were used w/ "query_string". So, we decided to switch to "simple_query_string". However, we discovered that the same analyzers do not appear to be used for each. I reviewed When Analyzers Are Used to see if it indicated there would be a difference between simple and regular query string but it did not, so I'm wondering if this is a bug. For example:
"query_string": { "query": "sales", "fields": [ "title" ] }
will use the analyzer for the "title" field which is our "en_analyzer" (see definition below) and properly stem "sales" to "sale" and find the matching documents. Simply changing "query_string" to "simple_query_string" will not. We have to search for "sale" or add an analyzer to the query, like so:
"simple_query_string": { "query": "sales", "fields": [ "title" ], "analyzer": "en_analyzer" }
Of course, not all our fields are analyzed the same way and so the default behavior described in the documentation I referenced above makes perfect sense and that's what we desire. Is this a bug or does "simple_query_string" just not behave the same way w/ respect to field analysis during a query? We are using ES 1.7.2.
The relevant parts of our definition for "en_analyzer" are:
"en_analyzer": { "type": "custom", "tokenizer": "icu_tokenizer", "filter": [ "icu_normalizer", "en_stop_filter", "en_stem_filter", "icu_folding", "shingle_filter" ], "char_filter": [ "html_strip" ] }
with:
"en_stop_filter": { "type": "stop", "stopwords": [ "_english_" ] }, "en_stem_filter": { "type": "stemmer", "name": "minimal_english" }
Link to my same question on Github ... though I edited this one better after I asked on Github first. So far no response there.

In 1.7.2, simple_query_string will use the default standard analyzer when none is specified and won't use any search analyzer defined on the field being searched. When the documentation doesn't tell, one shall turn to the ultimate source of knowledge, i.e. the source code. In SimpleQueryStringParser.java, the class comment states:
analyzer: analyzer to be used for analyzing tokens to determine which kind of query they should be converted into, defaults to "standard"
And a bit further down in the same class, we can read:
Use standard analyzer by default
And that behavior hasn't changed in the ES 2.x releases. As can be seen in the source code for SimpleQueryStringBuilder.java, if no analyzer is specified in the query, then the standard analyzer is used.
Quoting a comment from the source linked above:
Use standard analyzer by default if none specified
So to answer your question, that's not a bug, but the intended behavior.

Related

ElasticSearch apostrophe handling

I have a problem with handling apostrophes in ElasticSearch.
I have a doc with field value = 'Pangk’ok' and I want to be able to find the document by the search requests: 'Pangk’ok', 'Pangkok' or 'Pangk ok'.
I've tried to do this with such analyzer:
"filter": {
"my_word_delimiter_graph": {
"type": "word_delimiter_graph",
"catenate_words": true
}
},
"analyzer": {
"source_analyzer": {
"tokenizer": "icu_tokenizer",
"filter": [ "my_word_delimiter_graph", "icu_folding" ],
"type": "custom"
}
}
It's successful for the match query for all of the searches but fails when part of a phrase and match_phrase query.
And this case actually described in ElasticSearch documentation:
catenate_words
(Optional, Boolean) If true, the filter produces catenated tokens for chains of alphabetical characters separated by non-alphabetic delimiters. For example: super-duper-xl → [ super, superduperxl, duper, xl ]. Defaults to false.
When used for search analysis, catenated tokens can cause problems for the match_phrase query and other queries that rely on token position for matching. Avoid setting this parameter to true if you plan to use these queries.
So, my question is if it exists some other solution to solve my problem?

Why Elasticsearch Arabic analyzer comes alongside Persian analyzer in the Persian examples?

I recently checked the language analyzer in elasticsearch docs and wonder why in this example: analysis-lang-analyzer the Persian analyzer comes after the Arabic analyzer?! Is that needed because of anything? I mean the Persian analyzer alone, is not enough for the Persian language?
Persian analyzer The Persian analyzer could be reimplemented as a
custom analyzer as follows:
PUT /persian_example
{
"settings": {
"analysis": {
"char_filter": {
"zero_width_spaces": {
"type": "mapping",
"mappings": [ "\\u200C=>\\u0020"]
}
},
"filter": {
"persian_stop": {
"type": "stop",
"stopwords": "_persian_"
}
},
"analyzer": {
"rebuilt_persian": {
"tokenizer": "standard",
"char_filter": [ "zero_width_spaces" ],
"filter": [
"lowercase",
"decimal_digit",
"arabic_normalization",
"persian_normalization",
"persian_stop"
]
}
}
}
}
}
I don't think its required, you can index few sentences by removing the arabic_normalization and use the analyze API to check the tokens generated by persian analyzer and see if it generates the correct expected tokens.
You can also open the issue on elastic repo https://github.com/elastic/elasticsearch/issues which is the best place to ask this question as there someone from elastic can comment whether its a documentation issue or some real issue.
this analyzer are made by Lucene Users and forum. you should check what arabic_normalization do in lucene and is it necessary?
this is arabic_normalization class description:
Normalizer for Arabic.
Normalization is defined as:
Normalization of hamza with alef seat to a bare alef.
Normalization of teh marbuta to heh
Normalization of dotless yeh (alef maksura) to yeh.
Removal of Arabic diacritics (the harakat) Removal of tatweel
(stretching character).
(as I know Persian) I think it is better for Persian indexing that you first use Arabic Normalizer

Elasticsearch - can I define index time analyzer on document level?

I want to index pages in multiple languages into a single index. But for each language I need to define custom language analyzer. So for english page it would use english analyzer, for czech page it would use czech analyzer.
At search time I would set the correct analyzer based on current locale as I do not need to search across languages.
It appears that it was possible in the early versions of Elasticsearch, but I cannot find a way to do it in 7.6
Is there a way to achieve this or do I really need to create an index for each type in each language? That would lead to many indices with only small number of indexed documents.
Or is there a better way to handle this scenario? We are considering about 20 languages and several document types (as far as I understand, types are now deprecated so each needs its own index).
You can use the fields feature which is available in Elastic 7.6, which allows you to store the different languages in a single index, also query time it would be possible to just use the subfield of language which you want to query.
In fact, there is a nice official blog from elastic talking about different approaches to have multi-lingual search and approach given by me is inspired by that which is called per-field based language search.
Example
Sample Index mapping would look like below
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {
"fr": {
"type": "text",
"analyzer": "french"
},
"es": {
"type": "text",
"analyzer": "spanish"
},
"estonian": {
"type": "text",
"analyzer": "estonian"
}
}
}
}
}
}

How to handle auto completion on multi word text?

My input text is a multiword english text and I have the requirement to implement a autocompletion feature for that text.
I initially looked at search completion suggesters only to figure out that those can only match the first characters of the input. This is fine for auto completion of product names or address but not very useful when requiring a auto completion on any word in the input text.
After that I setup an edge_ngram analyzer and query to locate those documents which contain the input string. That works just fine but I don't know how to use this information to provide options for my auto completion.
I could use a highlighter in order to show the words which match the query. That data could in turn be used to setup a list of options. This solution seems rather hacky and not very elegant and I wonder how this problem is usually solved?
I'm unfortunately not able to maintain another field which could include the auto completion options for the documents.
I'm currently using highlight information of the query in order to construct the autocomplete options.
My Query:
{
"query": {
"match": {
"fields.content.auto": {
"query": "content co",
"analyzer": "standard"
}
}
},
"highlight": {
"fields": {
"fields.content.auto": {
"fragment_size": 0,
"number_of_fragments": 10,
"pre_tags" : [ "%ha%" ],
"post_tags" : [ "%he%" ]
}
}
},
"_source": ["uuid", "language"]
}
My auto field used the autocomplete analyzer.
"auto": {
"type": "string",
"analyzer": "autocomplete"
}
And this is the index configuration that I'm using:
{
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_english_"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stop",
"autocomplete_filter"
]
}
}
}
}
The solution was mainly inspired by the Search-as-you-type post.
I process the response JSON in order to get the autocomplete options.
The highlight information is used to extract all found tokens. These tokens are next used to construct the potential autocomplete phrase by also comparing it to the phrase that the user has already entered. The neat thing is that a stop word filter can be applied and thus stopwords will never be highlighted and in turn never be used for autocomplete suggestions.
A PoC Java code of this processor can be found here
I'm not yet sure whether I'll run with this solution but I want to share it anyway.
I think your best option is to create a dedicated index for storing just the suggestions using the edge_ngram analyzer. If you use the completion suggesters you need to explicitly define your actual suggestions anyway. The completion suggester is also document centric in ES 5.x so if you index multiple documents with the same suggestions you will get duplicate suggestions returned on a match. There is a de-duplication option in ES 6, but that has only just been released.
If you have a dedicated suggestion index you can use a hash of the suggestion as a document ID to avoid duplicates. You can start indexing document titles and other useful meta data as suggestions. Later on you could include historical searches entered by users that are seen as successful due to the user ultimately clicking on or purchasing the returned results.

Is Simple Query Search compatible with shingles?

I am wondering if it is possible to use shingles with the Simple Query String query. My mapping for the relevant field looks like this:
{
"text_2": {
"type": "string",
"analyzer": "shingle_analyzer"
}
}
The analyzer and filters are defined as follows:
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "custom_delimiter", "lowercase", "stop", "snowball", "filter_shingle"]
}
},
"filter": {
"filter_shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"custom_delimiter": {
"type": "word_delimiter",
"preserve_original": True
}
}
I am performing the following search:
{
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"analyzer": "shingle_analyzer",
"fields": [
"text_2"
],
"lenient": "false",
"default_operator": "and",
"query": "porsches small red"
}
}
]
}
}
}
Now, I have a document with text_2 = small red porsches. Since I am using the AND operator, I would expect my document to NOT match, since the above query should produce a shingle of "porsches small red", which is a different order. However, when I look at the match explanation I am only seeing the single word tokens "red" "small" "porsche", which of course match.
Is SQS incompatible with shingles?
The answer is "Yes, but...".
What you're seeing is normal given the fact that the text_2 field probably has the standard index analyzer in your mapping (according to the explanation you're seeing), i.e. the only tokens that have been produced and indexed for small red porsches are small, red and porsches.
On the query side, you're probably using a shingle analyzer with output_unigrams set to true (default), which means that the unigram tokens will also be produced in addition to the bigrams (again according to the explanation you're seeing). Those unigrams are the only reason why you get matches at all. If you want to match on bigrams, then one solution is to use the shingle analyzer at indexing time, too, so that bigrams small red and red porsches can be produced and indexed as well in addition to the unigrams small, red and porsches.
Then at query time, the unigrams would match as well but small red bigram would definitely match, too. In order to only match on the bigrams, you can have another shingle analyzer just for query time whose output_unigrams is set to false, so that only bigrams get generated out of your search input. And in case your query only contains one single word (e.g. porsches), then that shingle analyzer would only generate a single unigram (because output_unigrams_if_no_shingles is true) and the query would still match your document. If that's not desired you can simply set output_unigrams_if_no_shingles to false in your shingle search analyzer.

Resources