How to handle auto completion on multi word text? - elasticsearch

My input text is a multiword english text and I have the requirement to implement a autocompletion feature for that text.
I initially looked at search completion suggesters only to figure out that those can only match the first characters of the input. This is fine for auto completion of product names or address but not very useful when requiring a auto completion on any word in the input text.
After that I setup an edge_ngram analyzer and query to locate those documents which contain the input string. That works just fine but I don't know how to use this information to provide options for my auto completion.
I could use a highlighter in order to show the words which match the query. That data could in turn be used to setup a list of options. This solution seems rather hacky and not very elegant and I wonder how this problem is usually solved?
I'm unfortunately not able to maintain another field which could include the auto completion options for the documents.

I'm currently using highlight information of the query in order to construct the autocomplete options.
My Query:
{
"query": {
"match": {
"fields.content.auto": {
"query": "content co",
"analyzer": "standard"
}
}
},
"highlight": {
"fields": {
"fields.content.auto": {
"fragment_size": 0,
"number_of_fragments": 10,
"pre_tags" : [ "%ha%" ],
"post_tags" : [ "%he%" ]
}
}
},
"_source": ["uuid", "language"]
}
My auto field used the autocomplete analyzer.
"auto": {
"type": "string",
"analyzer": "autocomplete"
}
And this is the index configuration that I'm using:
{
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_english_"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stop",
"autocomplete_filter"
]
}
}
}
}
The solution was mainly inspired by the Search-as-you-type post.
I process the response JSON in order to get the autocomplete options.
The highlight information is used to extract all found tokens. These tokens are next used to construct the potential autocomplete phrase by also comparing it to the phrase that the user has already entered. The neat thing is that a stop word filter can be applied and thus stopwords will never be highlighted and in turn never be used for autocomplete suggestions.
A PoC Java code of this processor can be found here
I'm not yet sure whether I'll run with this solution but I want to share it anyway.

I think your best option is to create a dedicated index for storing just the suggestions using the edge_ngram analyzer. If you use the completion suggesters you need to explicitly define your actual suggestions anyway. The completion suggester is also document centric in ES 5.x so if you index multiple documents with the same suggestions you will get duplicate suggestions returned on a match. There is a de-duplication option in ES 6, but that has only just been released.
If you have a dedicated suggestion index you can use a hash of the suggestion as a document ID to avoid duplicates. You can start indexing document titles and other useful meta data as suggestions. Later on you could include historical searches entered by users that are seen as successful due to the user ultimately clicking on or purchasing the returned results.

Related

Elasticsearch - can I define index time analyzer on document level?

I want to index pages in multiple languages into a single index. But for each language I need to define custom language analyzer. So for english page it would use english analyzer, for czech page it would use czech analyzer.
At search time I would set the correct analyzer based on current locale as I do not need to search across languages.
It appears that it was possible in the early versions of Elasticsearch, but I cannot find a way to do it in 7.6
Is there a way to achieve this or do I really need to create an index for each type in each language? That would lead to many indices with only small number of indexed documents.
Or is there a better way to handle this scenario? We are considering about 20 languages and several document types (as far as I understand, types are now deprecated so each needs its own index).
You can use the fields feature which is available in Elastic 7.6, which allows you to store the different languages in a single index, also query time it would be possible to just use the subfield of language which you want to query.
In fact, there is a nice official blog from elastic talking about different approaches to have multi-lingual search and approach given by me is inspired by that which is called per-field based language search.
Example
Sample Index mapping would look like below
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {
"fr": {
"type": "text",
"analyzer": "french"
},
"es": {
"type": "text",
"analyzer": "spanish"
},
"estonian": {
"type": "text",
"analyzer": "estonian"
}
}
}
}
}
}

how to remove a stop word from default _english_ stop-words list in elasticsearch?

I am filtering the text using default English stop-words. I found 'and' is a stop-word in English, but I need to search for the results containing 'and'. I just want to remove and word from this default English stop-words filter and use other stopwords as usually. My elasticsearch schema looks similar to below.
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace" ,
"filter": ["stop_english"]
}
}....,
"filter":{
"stop_english": {
"type": "stop",
"stopwords": "_english_"
}
}
I expect to see the docs containing AND word with _search api.
You can set the stop words for a given index manually like this:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
}
}
}
I also found the list of English stop words used by elasticsearch here. If you manage to manually set the same list of stop words minus "and" in an index and reindex you data in the newly configured index with the good stop words , you should be good to go!
regarding reindexation of your data, you should check out the reindex api. I believe it is required since the tokenization of your data happens at ingestion time so you need to redo the ingestion by reindexing it. It is requires most of the time when changing index settings or some mapping changes (not 100% sure, but i think it makes sense).

Query elasticsearch to make all analyzed ngram tokens to match

I indexed some data using a nGram analyzer (which emits only tri-grams), to solve the compound words problem exactly as described at the ES guide.
This doesn't work however as expected: the according match query will return all documents where at least one nGram-token (per word) matched.
Example:
Let's take these two indexed documents with a single field, using that nGram analyzer:
POST /compound_test/doc/_bulk
{ "index": { "_id": 1 }}
{ "content": "elasticsearch is awesome" }
{ "index": { "_id": 2 }}
{ "content": "some search queries don't perform good" }
Now if I run the following query, I get both results:
"match": {
"content": {
"query": "awesome search",
"minimum_should_match": "100%"
}
}
The query that is constructed from this, could be expressed like this:
(awe OR wes OR eso OR ome) AND (sea OR ear OR arc OR rch)
That's why the second document matches (it contains "some" and "search"). It would even match a document with words that contain the tokens "som" and "rch".
What I actually want is a query where each analyzed token must match (in the best case depending on the minimum-should-match), so something like this:
"match": {
"content": {
"query": "awe wes eso ome sea ear arc rch",
"analyzer": "whitespace",
"minimum_should_match": "100%"
}
}
..without actually creating that query "from hand" / pre-analyzing it on client side.
All settings and data to reproduce that behavior can be found at https://pastebin.com/97QxfaSb
Is there such a possibility?
While writing the question, I accidentally found the answer:
If the ngram analyzer uses a ngram-filter to generate trigrams (as described in the guide), it works the way described above. (I guess because the actual tokens are not the single ngrams but the combination of all created ngrams)
To achieve the wanted behavior, the analyzer must use the ngram tokenizer:
"tokenizer": {
"trigram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"trigrams_with_tokenizer": {
"type": "custom",
"tokenizer": "trigram_tokenizer"
}
}
Using this way to produce tokens will result in the wished result when queering that field.

Auto Suggestions in Elastic Search after 3 letters

I've a search query which does basic search after a complete word is typed in. I'm looking for auto suggestions after 3 letters.
For Example,
Title- samsung galaxy s4
I want to see auto suggestions after "sam" instead of complete word "samsung".
while the ngram filter works, there is a dedicated suggester for this use-case, called the completion suggester, which uses another data structure internal, which will allow you to execute suggestions in the millisecond range, thus being much faster than a regular query use edgengram. Check out the documentation here
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-suggesters-completion.html
You need to use an edgeNGram filter for this.
{
"analysis": {
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete_edge_ngram": {
"filter": ["lowercase"],
"type": "custom",
"tokenizer": "autocomplete_tokenizer"
}
}
}
}
and mapping will be
{
"title_edge_ngram": {
"type": "text",
"analyzer": "autocomplete_edge_ngram",
"search_analyzer": "standard"
}
Or you can use the completion suggester in elasticsearch.
For three character check, you have to do it in your client side itself.

Elasticsearch "simple_query_string" vs. "query_string" field analysis bug?

Recently we discovered that, since we aren't sanitizing search terms as they come into our system, we would get occasional parsing exceptions in Elasticsearch when special characters such as / (forward slash) , etc. were used w/ "query_string". So, we decided to switch to "simple_query_string". However, we discovered that the same analyzers do not appear to be used for each. I reviewed When Analyzers Are Used to see if it indicated there would be a difference between simple and regular query string but it did not, so I'm wondering if this is a bug. For example:
"query_string": { "query": "sales", "fields": [ "title" ] }
will use the analyzer for the "title" field which is our "en_analyzer" (see definition below) and properly stem "sales" to "sale" and find the matching documents. Simply changing "query_string" to "simple_query_string" will not. We have to search for "sale" or add an analyzer to the query, like so:
"simple_query_string": { "query": "sales", "fields": [ "title" ], "analyzer": "en_analyzer" }
Of course, not all our fields are analyzed the same way and so the default behavior described in the documentation I referenced above makes perfect sense and that's what we desire. Is this a bug or does "simple_query_string" just not behave the same way w/ respect to field analysis during a query? We are using ES 1.7.2.
The relevant parts of our definition for "en_analyzer" are:
"en_analyzer": { "type": "custom", "tokenizer": "icu_tokenizer", "filter": [ "icu_normalizer", "en_stop_filter", "en_stem_filter", "icu_folding", "shingle_filter" ], "char_filter": [ "html_strip" ] }
with:
"en_stop_filter": { "type": "stop", "stopwords": [ "_english_" ] }, "en_stem_filter": { "type": "stemmer", "name": "minimal_english" }
Link to my same question on Github ... though I edited this one better after I asked on Github first. So far no response there.
In 1.7.2, simple_query_string will use the default standard analyzer when none is specified and won't use any search analyzer defined on the field being searched. When the documentation doesn't tell, one shall turn to the ultimate source of knowledge, i.e. the source code. In SimpleQueryStringParser.java, the class comment states:
analyzer: analyzer to be used for analyzing tokens to determine which kind of query they should be converted into, defaults to "standard"
And a bit further down in the same class, we can read:
Use standard analyzer by default
And that behavior hasn't changed in the ES 2.x releases. As can be seen in the source code for SimpleQueryStringBuilder.java, if no analyzer is specified in the query, then the standard analyzer is used.
Quoting a comment from the source linked above:
Use standard analyzer by default if none specified
So to answer your question, that's not a bug, but the intended behavior.

Resources