Multi-Term Auto Completion Elasticsearch - elasticsearch

I am using Elasticsearch 7.2.0 and i want to create search suggestion.
For example i have these 3 movies titles:
Avengers: Infinity War Avengers: Infinity War Part 2 Avengers:
Age of Ultron
When i type "aven" should return suggestion like:
avengers
avengers infinity
avengers age
When i type "avengers inf"
avengers infinity war
avengers infinity war part 2
after lots of elasticsearch tutorials i have done this:
Check Cluster
PUT movies
{
"settings": {
"index": {
"analysis": {
"filter": {},
"analyzer": {
"keyword_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"trim"
],
"char_filter": [],
"type": "custom",
"tokenizer": "keyword"
},
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "edge_ngram_tokenizer"
},
"edge_ngram_search_analyzer": {
"tokenizer": "lowercase"
},
"completion_analyzer": {
"tokenizer": "keyword",
"filter": "lowercase"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"keywordstring": {
"type": "text",
"analyzer": "keyword_analyzer"
},
"edgengram": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "edge_ngram_search_analyzer"
},
"completion": {
"type": "completion"
}
},
"analyzer": "standard"
},
"completion_terms": {
"type": "text",
"fielddata": true,
"analyzer": "completion_analyzer"
}
}
}
}
whth the folowing docs:
POST movies/_doc/1
{
"name": "Spider-Man: Homecoming",
"completion_terms": [
"spider-man",
"homecomming"
]
}
POST movies/_doc/2
{
"name": "Ant-man and the Wasp",
"completion_terms": [
"ant-man",
"and",
"the",
"wasp"
]
}
POST movies/_doc/3
{
"name": "Avengers: Infinity War Part 2",
"completion_terms": [
"avangers",
"infinity",
"war",
"part",
"2"
]
}
POST movies/_doc/4
{
"name": "Captain Marvel",
"completion_terms": [
"captain",
"marvel"
]
}
POST movies/_doc/5
{
"name": "Black Panther",
"completion_terms": [
"black",
"panther"
]
}
POST movies/_doc/6
{
"name": "Avengers: Infinity War",
"completion_terms": [
"avangers",
"infinity",
"war"
]
}
POST movies/_doc/7
{
"name": "Thor: Ragnarok",
"completion_terms": [
"thor",
"ragnarok"
]
}
POST movies/_doc/8
{
"name": "Guardians of the Galaxy Vol 2",
"completion_terms": [
"guardians",
"of",
"the",
"galaxy",
"vol",
"2"
]
}
POST movies/_doc/9
{
"name": "Doctor Strange",
"completion_terms": [
"doctor",
"strange"
]
}
POST movies/_doc/10
{
"name": "Captain America: Civil War",
"completion_terms": [
"captain",
"america",
"civil",
"war"
]
}
POST movies/_doc/11
{
"name": "Ant-Man",
"completion_terms": [
"ant-man"
]
}
POST movies/_doc/12
{
"name": "Avengers: Age of Ultron",
"completion_terms": [
"avangers",
"age",
"of",
"ultron"
]
}
POST movies/_doc/13
{
"name": "Guardians of the Galaxy",
"completion_terms": [
"guardians",
"of",
"the",
"galaxy"
]
}
POST movies/_doc/14
{
"name": "Captain America: The Winter Soldier",
"completion_terms": [
"captain",
"america",
"the",
"winter",
"solider"
]
}
POST movies/_doc/15
{
"name": "Thor: The Dark World",
"completion_terms": [
"thor",
"the",
"dark",
"world"
]
}
POST movies/_doc/16
{
"name": "Iron Man 3",
"completion_terms": [
"iron",
"man",
"3"
]
}
POST movies/_doc/17
{
"name": "Marvel’s The Avengers",
"completion_terms": [
"marvels",
"the",
"avangers"
]
}
POST movies/_doc/18
{
"name": "Captain America: The First Avenger",
"completion_terms": [
"captain",
"america",
"the",
"first",
"avanger"
]
}
POST movies/_doc/19
{
"name": "Thor",
"completion_terms": [
"thor"
]
}
POST movies/_doc/20
{
"name": "Iron Man 2",
"completion_terms": [
"iron",
"man",
"2"
]
}
POST movies/_doc/21
{
"name": "The Incredible Hulk",
"completion_terms": [
"the",
"incredible",
"hulk"
]
}
POST movies/_doc/22
{
"name": "Iron Man",
"completion_terms": [
"iron",
"man"
]
}
and the query
POST movies/_search
{
"suggest": {
"movie-suggest-fuzzy": {
"prefix": "avan",
"completion": {
"field": "name.completion",
"fuzzy": {
"fuzziness": 1
}
}
}
}
}
My query return full title not pieces.

It's a great question and shows you have done a lot of research to get it to work, but you are unnecessary making it complex(by trying to handle it completely in ES), I was having exactly the same use-case and solved it using the combination of application side logic with ES.
What you actually need is match query on (n-1) terms and prefix query on the nth search term as you mentioned, in case of aven as its the first and nth term, prefix query would be on it and in case of avengers inf search term, avengers would be on match query and prefix would be on inf term.
I just indexed the documents given by you and tried both the search terms mentioned and it works:
Index creation
{
"mappings": {
"properties": {
"name": {
"type": "text"
}
}
}
}
Index 3 docs
{
"name" : "Avengers: Age of Ultron"
},
{
"name" : "Avengers: Infinity War Part 2"
},
{
"name" : "Avengers: Infinity War"
}
Search query
{
"query": {
"bool": {
"must": [
{
"match": { --> Note match queries on (n-1) terms
"name": "avengers"
}
},
{
"prefix": { --> Prefix query on nth term
"name": "ag"
}
}
]
}
}
}
Basically in your application code, you need to split the search terms based on whitespace and then construct the bool query with match clause on (n-1) terms and prefix query on the nth term.
Note you don't even need to use the edge n-gram analyzer and other complex things while indexing, which would save lot of spaces in your index, but you might wanna put a character limit on prefix query as it might be costly when searching in millions of docs as its not a token to token match as its there in match query.

Related

Elasticsearch - nested types vs collapse/aggs

I have a use case where I need to find the latest data based on some fields.
The fields are:
category.name
category.type
createdAt
For example: search for the newest data where category.name = 'John G.' AND category.type = 'A'. I expect the data with ID = 1 where it matches the criteria and is the newest one based on createdAt field ("createdAt": "2022-04-18 19:09:27.527+0200")
The problem is that category.* is a nested field and I can't aggs/collapse these fields because ES doesn't support it.
Mapping:
PUT data
{
"mappings": {
"properties": {
"createdAt": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZ"
},
"category": {
"type": "nested",
"properties": {
"name": {
"type": "text",
"analyzer": "keyword"
}
}
},
"approved": {
"type": "text",
"analyzer": "keyword"
}
}
}
}
Data:
POST data/_create/1
{
"category": [
{
"name": "John G.",
"level": "A"
},
{
"name": "Chris T.",
"level": "A"
}
],
"createdBy": "John",
"createdAt": "2022-04-18 19:09:27.527+0200",
"approved": "no"
}
POST data/_create/2
{
"category": [
{
"name": "John G.",
"level": "A"
},
{
"name": "Chris T.",
"level": "A"
}
],
"createdBy": "Max",
"createdAt": "2022-04-10 10:09:27.527+0200",
"approved": "no"
}
POST data/_create/3
{
"category": [
{
"name": "Rick J.",
"level": "B"
}
],
"createdBy": "Rick",
"createdAt": "2022-03-02 02:09:27.527+0200",
"approved": "no"
}
I'm looking for either a search query that can handle that in an acceptable performant way, or a new object design without nested type where I could take advantage of aggs/collapse feature.
Any suggestion will be really appreciated.
About your first question,
For example: search for the newest data where category.name = 'John G.' AND category.type = 'A'. I expect the data with ID = 1 where it matches the criteria and is the newest one based on createdAt field ("createdAt": "2022-04-18 19:09:27.527+0200")
I believe you can do something along those lines:
GET /72088168/_search
{
"query": {
"nested": {
"path": "category",
"query": {
"bool": {
"must": [
{
"match": {
"category.name": "John G."
}
},
{
"match": {
"category.level": "A"
}
}
]
}
}
}
},
"sort": [
{
"createdAt": {
"order": "desc"
}
}
],
"size":1
}
For the 2nd matter, it really depends on what you are aiming to do. could merge category.name and category.level in the same field. Such that you document would look like:
{
"category": ["John G. A","Chris T. A"],
"createdBy": "Max",
"createdAt": "2022-04-10 10:09:27.527+0200",
"approved": "no"
}
No more nested needed. Although I agree it feels like using tape to fix your issue.

ElasticSearch Keyword usage with a prefix search

I have a requirement to be able to search a sentence as complete or with prefix. The UI library (reactive search) I am using is generating the query in this way:
"simple_query_string": {
"query": "\"Louis George Maurice Adolphe\"",
"fields": [
"field1",
"field2",
"field3"
],
"default_operator": "or"
}
I am expecting it to returns results for eg.
Louis George Maurice Adolphe (Roche)
but NOT just records containing partial terms like Louis or George
Currently, I have code like this but it only brings the record if I search with complete word Louis George Maurice Adolphe (Roche) but not a prefix Louis George Maurice Adolphe.
{
"settings": {
"analysis": {
"char_filter": {
"space_remover": {
"type": "mapping",
"mappings": [
"\\u0020=>"
]
}
},
"normalizer": {
"lower_case_normalizer": {
"type": "custom",
"char_filter": [
"space_remover"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"field3": {
"type": "keyword",
"normalizer": "lower_case_normalizer"
}
}
}
}
}
Any guidance on the above is appreciated. Thanks.
You are not using the prefix query hence not getting result for prefix search terms, I used same mapping and sample doc, but changed the search query which gives the expected results
Index mapping
{
"settings": {
"analysis": {
"char_filter": {
"space_remover": {
"type": "mapping",
"mappings": [
"\\u0020=>"
]
}
},
"normalizer": {
"lower_case_normalizer": {
"type": "custom",
"char_filter": [
"space_remover"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"field3": {
"type": "keyword",
"normalizer": "lower_case_normalizer"
}
}
}
}
Indexed sample doc
{
"field3" : "Louis George Maurice Adolphe (Roche)"
}
Search query
{
"query": {
"prefix": {
"field3": {
"value": "Louis George Maurice Adolphe"
}
}
}
}
Search result
"hits": [
{
"_index": "normal",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"field3": "Louis George Maurice Adolphe (Roche)"
}
}
]
The underlying issue stems from the fact that you're applying a whitespace remover. What this practically means is that when you ingest your docs:
GET your_index_name/_analyze
{
"text": "Louis George Maurice Adolphe (Roche)",
"field": "field3"
}
they're indexed as
{
"tokens" : [
{
"token" : "louisgeorgemauriceadolphe(roche)",
"start_offset" : 0,
"end_offset" : 36,
"type" : "word",
"position" : 0
}
]
}
So if you indend to use simple_string, you may want to rethink your normalizers.
#Ninja's answer fails when you search for George Maurice Adolphe, i.e. no prefix intersection.

Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

Suppose there is the following mapping with Edge NGram Tokenizer:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
And the following documents are indexed:
POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}
Then searching
{
"query": {
"match": {
"name": {
"query": "HI"
}
}
}
}
yields all with the same score, or TRENDING - HI with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {}
}
}
Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "HI"
}
}
],
"should": [
{
"prefix": {
"name": "HI"
}
}
]
}
},
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {
"highlight_query": {
"match": {
"name": "HI"
}
}
}
}
}
}
In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "HI"
}
},
{
"match_phrase_prefix": {
"name": "HI"
}
}
]
}
}
}
The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.
A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.
The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.
In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.
Here is an example mapping and query for your case:
PUT /tag/
{
"settings": {
"analysis": {
"analyzer": {
"edge_analyzer": {
"tokenizer": "edge_tokenizer"
},
"kw_analyzer": {
"tokenizer": "kw_tokenizer"
},
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer"
},
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"kw_tokenizer": {
"type": "keyword"
},
"edge_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
},
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
},
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"fields": {
"edge": {
"type": "text",
"analyzer": "edge_analyzer"
},
"ngram": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
}
}
And a query:
POST /tag/_search
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"match": {
"name.edge": {
"query": "HI"
}
}
},
"boost": "5",
"boost_mode": "multiply"
}
},
{
"match": {
"name.ngram": {
"query": "HI"
}
}
},
{
"match": {
"name": {
"query": "HI"
}
}
}
]
}
}
}

Query in ElasticSearch to match part of the word

I am trying to write a query in ElasticSearch which matches contiguous characters in the words. So, if my index has "John Doe", I should still see "John Doe" returned by Elasticsearch for the below searches.
john doe
john do
ohn do
john
n doe
I have tried the below query so far.
{
"query": {
"multi_match": {
"query": "term",
"operator": "OR",
"type": "phrase_prefix",
"max_expansions": 50,
"fields": [
"Field1",
"Field2"
]
}
}
}
But this also returns unnessary matches like I will still get "John Doe" when i type john x.
As explained in my comment above, prefix wildcards should be avoided at all costs as your index grows since that will force ES to do full index scans. I'm still convinced that ngrams (more precisely edge-ngrams) is the way to go, so I'm taking a stab at it below.
The idea is to index all the suffixes of the input and then use a prefix query to match any suffix as searching for prefixes doesn't suffer the same performance issues as searching for suffixes. So the idea is to index john doe as follows:
john doe
ohn doe
hn doe
n doe
doe
oe
e
That way, using a prefix query we can match any sub-part of those tokens which effectively achieves the goal of matching partial contiguous words while at the same time ensuring good performance.
The definition of the index would go like this:
PUT my_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"reverse",
"suffixes",
"reverse"
]
}
},
"filter": {
"suffixes": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
Then we can index a sample document:
PUT my_index/doc/1
{
"name": "john doe"
}
And finally all of the following searches will return the john doe document:
POST my_index/_search
{
"query": {
"prefix": {
"name": "john doe"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "john do"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "ohn do"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "john"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "n doe"
}
}
}
This is what worked for me.
Instead of an ngram, index your data as keyword.
And use wildcard regex match to match the words.
"query": {
"bool": {
"should": [
{
"wildcard": { "Field1": "*" + term + "*" }
},
{
"wildcard": { "Field2": "*" + term + "*" }
}
],
"minimum_should_match": 1
}
}
Here is a updated fix
link to the code
more options with tokenizers
create index with
body = {
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}

Using Shingles and Stop words with Elasticsearch and Lucene 4.4

In the index I'm building, I'm interested in running a query, then (using facets) returning the shingles of that query. Here's the analyzer I'm using on the text:
{
"settings": {
"analysis": {
"analyzer": {
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer"
]
}
},
"filter": {
"custom_stemmer" : {
"type": "stemmer",
"name": "english"
},
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "3"
}
}
}
}
}
The major issue is that, with Lucene 4.4, stop filters no longer support the enable_position_increments parameter to eliminate shingles that contain stop words. Instead, I'd get results like..
"red and yellow"
"terms": [
{
"term": "red",
"count": 43
},
{
"term": "red _",
"count": 43
},
{
"term": "red _ yellow",
"count": 43
},
{
"term": "_ yellow",
"count": 42
},
{
"term": "yellow",
"count": 42
}
]
Naturally this GREATLY skews the number of shingles returned. Is there a way post-Lucene 4.4 to manage this without doing post-processing on the results?
Probably not the most optimal solution, but the most blunt would be to add another filter to your analyzer to kill "_" filler tokens. In the example below I called it "kill_fillers":
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer",
"kill_fillers"
],
...
Add "kill_fillers" filter to your list of filters:
"filters":{
...
"kill_fillers": {
"type": "pattern_replace",
"pattern": ".*_.*",
"replace": "",
},
...
}
im not sure if this helps, but in elastic definition of shingles, you can use the parameter filler_token which is by default _. set it to, for example, an empty string:
$indexParams['body']['settings']['analysis']['filter']['shingle-filter']['filler_token'] = "";
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/analysis-shingle-tokenfilter.html

Resources