Auto Suggestions in Elastic Search after 3 letters - elasticsearch

I've a search query which does basic search after a complete word is typed in. I'm looking for auto suggestions after 3 letters.
For Example,
Title- samsung galaxy s4
I want to see auto suggestions after "sam" instead of complete word "samsung".

while the ngram filter works, there is a dedicated suggester for this use-case, called the completion suggester, which uses another data structure internal, which will allow you to execute suggestions in the millisecond range, thus being much faster than a regular query use edgengram. Check out the documentation here
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-suggesters-completion.html

You need to use an edgeNGram filter for this.
{
"analysis": {
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete_edge_ngram": {
"filter": ["lowercase"],
"type": "custom",
"tokenizer": "autocomplete_tokenizer"
}
}
}
}
and mapping will be
{
"title_edge_ngram": {
"type": "text",
"analyzer": "autocomplete_edge_ngram",
"search_analyzer": "standard"
}
Or you can use the completion suggester in elasticsearch.
For three character check, you have to do it in your client side itself.

Related

Elastic search with Java: exclude matches with random leading characters in a letter

I am new to using elastic search. I managed to get things working somewhat close to what I intended. I am using the following configuration.
{
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3,
"output_unigrams": true,
"token_separator": ""
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"shingle_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"shingle_index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter",
"autocomplete_filter"
]
}
}
}
}
I have this applied over multiple fields and doing a multi match query.
Following is the java code:
NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(QueryBuilders.multiMatchQuery(i)
.field("title")
.field("alias")
.fuzziness(Fuzziness.ONE)
.type(MultiMatchQueryBuilder.Type.BEST_FIELDS))
.build();
The problem is it matches with fields that have letters with some leading characters.
For example, if my search input is "ron" I want it to match with "ron mathews", but I don't want it match with "iron". How can I make sure that I am matching with letters having no leading characters?
Update-1
Turning off fuzzy transposition seems to improve search results. But I think we can make it better.
You probably want to score "ron" higher than "ronaldo" and the exact match of complete field "ron" even higher so the best option here would be to use few subfields with standard and keyword analyzers and boost those fields in your multi_match query.
Also, as you figured out yourself, be careful with the fuzziness. Might make sense to run 2 queries in a should with one being fuzzy and another boosted so that exact matches are ranked higher.

Extend Elasticsearch's standard Analyzer with additional characters to tokenize on

I basically want the functionality of the inbuilt standard analyzer that additionally tokenizes on underscores.
Currently the standard analyzer will keep brown_fox_has as a singular token but I want [brown, fox, has] instead. The simple analyzer loses some functionality over the standard one, so I want to keep the standard as much as possible.
The docs only shows how you would add filters and other non-tokenizer changes, but I want to keep all of the standard tokenizer, while adding the additional underscore.
I could create a character filter to map _ to - and the standard tokenizer will do the job for me, but is there a better way?
es.indices.create(index="mine", body={
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
# "tokenize_on_chars": ["_"], # i want this to work with the standard tokenizer without using char group
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
},
}
})
res = es.indices.analyze(index="mine", body={
"field": "text",
"text": "the quick brown_fox_has to be split"
})
Use normalizer and define it along with your preferred standard tokenizer
POST /_analyze
{
"char_filter": {
"type": "mapping",
"mappings": [
"_ =>\\u0020" // replace underscore with whitespace
]
},
"tokenizer": "standard",
"text": "the quick brown_fox_has to be split"
}

How to handle auto completion on multi word text?

My input text is a multiword english text and I have the requirement to implement a autocompletion feature for that text.
I initially looked at search completion suggesters only to figure out that those can only match the first characters of the input. This is fine for auto completion of product names or address but not very useful when requiring a auto completion on any word in the input text.
After that I setup an edge_ngram analyzer and query to locate those documents which contain the input string. That works just fine but I don't know how to use this information to provide options for my auto completion.
I could use a highlighter in order to show the words which match the query. That data could in turn be used to setup a list of options. This solution seems rather hacky and not very elegant and I wonder how this problem is usually solved?
I'm unfortunately not able to maintain another field which could include the auto completion options for the documents.
I'm currently using highlight information of the query in order to construct the autocomplete options.
My Query:
{
"query": {
"match": {
"fields.content.auto": {
"query": "content co",
"analyzer": "standard"
}
}
},
"highlight": {
"fields": {
"fields.content.auto": {
"fragment_size": 0,
"number_of_fragments": 10,
"pre_tags" : [ "%ha%" ],
"post_tags" : [ "%he%" ]
}
}
},
"_source": ["uuid", "language"]
}
My auto field used the autocomplete analyzer.
"auto": {
"type": "string",
"analyzer": "autocomplete"
}
And this is the index configuration that I'm using:
{
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_english_"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stop",
"autocomplete_filter"
]
}
}
}
}
The solution was mainly inspired by the Search-as-you-type post.
I process the response JSON in order to get the autocomplete options.
The highlight information is used to extract all found tokens. These tokens are next used to construct the potential autocomplete phrase by also comparing it to the phrase that the user has already entered. The neat thing is that a stop word filter can be applied and thus stopwords will never be highlighted and in turn never be used for autocomplete suggestions.
A PoC Java code of this processor can be found here
I'm not yet sure whether I'll run with this solution but I want to share it anyway.
I think your best option is to create a dedicated index for storing just the suggestions using the edge_ngram analyzer. If you use the completion suggesters you need to explicitly define your actual suggestions anyway. The completion suggester is also document centric in ES 5.x so if you index multiple documents with the same suggestions you will get duplicate suggestions returned on a match. There is a de-duplication option in ES 6, but that has only just been released.
If you have a dedicated suggestion index you can use a hash of the suggestion as a document ID to avoid duplicates. You can start indexing document titles and other useful meta data as suggestions. Later on you could include historical searches entered by users that are seen as successful due to the user ultimately clicking on or purchasing the returned results.

Tokenize a big word into combination of words

Suppose I have Super Bowl is the value of a document's property in the elasticsearch. How can the term query superbowl match Super Bowl?
I read about letter tokenizer and word delimiter but both don't seem to solve my problem. Basically I want to be able to convert combination of a large word into meaningful combination of words.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-letter-tokenizer.html
I know this is quite late but you could use synonym filter
You could define that super bowl is the same as "s bowl", "SuperBowl" etc.
There are ways to do this without changing what you actually index. For example, if you are using at least 5.2 (where normalizers were introduced), but it can also be earlier version but 5.x makes it easier, you can define a normalizer to lowercase your text and not change it and then use a fuzzy query at search time to account for the space between super and bowl. My solution though is specific to this example you have given. As it is with Elasticsearch most of time, one needs to think about what kind of data goes into Elasticsearch and what it is required at search time.
In any case, if you are interested in an approach here it is:
DELETE test
PUT /test
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
}
}
POST test/test/1
{"title":"Super Bowl"}
GET /test/_search
{
"query": {
"fuzzy": {
"title.keyword": "superbowl"
}
}
}

Can I have multiple filters in an Elasticsearch index's settings?

I want an Elasticsearch index that simply stores "names" of features. I want to be able to issue phonetic queries and also type-ahead style queries separately. I would think I would be able to create one index with two analyzers and two filters; each analyzer could use one of the filters. But I do not seem to be able to do this.
Here is the index settings json I'm trying to use:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
}
},
"analyzer": {
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
},
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
}
When I attempt to create an index with these settings:
http://hostname:9200/index/type
I get an HTTP 400, saying
Custom Analyzer [phonetic_analyzer] failed to find filter under name [double_metaphone_filter]
Don't get me wrong, I fully realize what that sentence means. I looked and looked for an erroneous comma or quote but I don't see any. Otherwise, everything is there and formatted correctly.
If I delete the phonetic analyzer, the index is created but ONLY with the autocomplete analyzer and ngram filter.
If I delete the ngram filter, the index is created but ONLY with the phonetic analyzer and phonetic filter.
I have a feeling I'm missing a fundamental concept of ES, like only one analyzer per index, or one filter per index, or I must have some other logical dependencies set up correctly, etc. It sure would be nice to have a logical diagram or complete API spec of the Elasticsearch infrastructure, i.e. any index can have 1..n analyzers, only 1 filter, query must need any one of bool, match, etc. But that unicorn does not seem to exist.
I see tons of documentation, blog posts, etc on how to do each of these functionalities, but with only one analyzer and one filter on the index. I'd really like to do this dual functionality on one index (for reasons out of scope).
Can someone offer some help, advice here?
You are just missing the proper formatting for your settings object. You cannot have two analyzer or filter keys, as there can only be one value per key in this settings map object. Providing a list of your filters seems to work just fine. When you were creating your index object, the second key was overriding the first.
Look here:
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
},
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
},
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
}
}
}
I downloaded the plugin to confirm this works.
You can now test this out at the _analyze enpoint with a payload:
{
"analyzer":"autocomplete_analyzer",
"text":"Jonnie Smythe"
}

Resources