Custom sorting in elastic search - elasticsearch

I have some documents in elastic search with completion suggester. I search for some value like Stack, the results are shown in the order given below:
Stack Overflow
Stack-Overflow
Stack
StackOver
StackOverflow
I want the result to be displayed in the order:
Stack
StackOver
StackOverflow
Stack Overflow
Stack-Overflow
i.e, the exacts matches should come first instead of results which space or special characters.
TIA

It all depends on the way you are analysing the string you are querying upon. I will suggest that you apply more than one analyser on the same string field. Below is an example of the mapping of the "name" field over which you want auto complete/suggester feature:
"name": {
"type": "string",
"analyzer": "keyword_analyzer",
"fields": {
"name_ac": {
"type": "string",
"index_analyzer": "string_autocomplete_analyzer",
"search_analyzer": "keyword_analyzer"
}
}
}
Here, keyword_analyzer and string_autocomplete_analyzer are analysers defined in your index settings. Below is an example:
"keyword_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
"string_autocomplete_analyzer": {
"type": "custom",
"filter": [
"lowercase"
,
"autocomplete"
],
"tokenizer": "whitespace"
}
Here autocomplete is an analysis filter:
"autocomplete": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "10"
}
After having set this, when searching in Elasticsearch for the auto suggestions, you can make use of multiMatch queries instead of the normal match queries and here you provide boosts to individual fields in the multiMatch. Below is a example in java:
QueryBuilders.multiMatchQuery(yourSearchString,"name^3","name_ac");
You may need to alter the boost (^3) as per your needs.
If even this does not satisfy your requirements, you can look at having one more analyser which analyse the string based on first word and include that field in the multiMatch. Below is an example of such an analyser:
"first_word_name_analyzer": {
"type": "custom",
"filter": [
"lowercase"
,
"whitespace_merge"
,
"edgengram"
],
"tokenizer": "keyword"
}
With these analysis filters:
"whitespace_merge": {
"pattern": "\s+",
"type": "pattern_replace",
"replacement": " "
},
"edgengram": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "32"
}
You may have to do some trials on the boost values in order to reach the most optimum results based on your requirements. Hope this helps.

Related

Elasticsearch autocomplete on large text field

I am using ES 7.x version and basically requirement is to provide autosuggest / type ahead on large text field which have file/document content.
I have explored multiple way and for all it returns the entire source document or specific field if I restrict using _source. I have tried out edge ngram or n-gram tokenizer, Prefix Query, Completion suggestor.
Below is sample Document (content field might have 1000s sentences):
{
"content":"Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.
Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic).
Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of free and open tools for data ingestion, enrichment, storage, analysis, and visualization. Commonly referred to as the ELK Stack (after Elasticsearch, Logstash, and Kibana),
the Elastic Stack now includes a rich collection of lightweight shipping agents known as Beats for sending data to Elasticsearch."
}
Below is expected output:
Search query: el
Output: ["elasticsearch","elastic","elk"]
Search Query: analytics e
Output: ["analytics engine"]
Currently I am not able to achieve above output using the OOTB functionality. So I have used highlighting functionality of elasticsearch and applied regex on result and created unique list of suggestion using Java.
Below is my current implement using highlight functionality.
Index Mapping:
PUT index
{
"settings": {
"index": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"analysis": {
"filter": {
"stop_filter": {
"type": "stop",
"stopwords": "_english_"
},
"ngram_filter": {
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
],
"min_gram": "1",
"type": "edge_ngram",
"max_gram": "12"
}
},
"analyzer": {
"text_english": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": [
"lowercase",
"stop_filter"
]
},
"whitespace_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
},
"ngram_analyzer": {
"filter": [
"lowercase",
"stop_filter",
"ngram_filter"
],
"type": "custom",
"tokenizer": "letter"
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"autocorrect": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
},
"analyzer": "text_english"
}
}
}
}
Below is Elasticsearch query which executed from Java
POST autosuggest/_search
{
"_source": "content.autocorrect",
"query": {
"match_phrase": {
"content.autocorrect": "analytics e"
}
},
"highlight": {
"fields": {
"content.autocorrect": {
"fragment_size": 500,
"number_of_fragments": 1
}
}
}
}
We have applied regex pattern on above query result.
Please let me know if there is any way to achieve without above workaround.
The completion suggester is not the right tool for your use case. As well ngrams and edge ngrams might be overkill depending on how much content you have.
Have you tried the match_phrase_prefix query which matches full tokens next to one another and the last one as a prefix?
The query below is very simple and should work the way you expect.
POST test/_search
{
"query": {
"match_phrase_prefix": {
"content": "analytics e"
}
}
}

Elasticsearch typeahead query optimization

I am currently working on typeahead support (with contains, not just starts-with) for over 100.000.000 entries (and that number could grow arbitrarily) using ElasticSearch.
The current setup works, but I was wondering if there is a better approach to it.
I'm using AWS Elasticsearch, so I don't have full control over the cluster.
My index is defined as follows:
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"edge_ngram_analyzer": {
"tokenizer": "edge_ngram_tokenizer",
"filter": "lowercase"
},
"search_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 300,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation",
"whitespace"
]
},
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 300,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation",
"whitespace"
]
}
}
}
},
"mappings": {
"account": {
"properties": {
"tags": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "search_analyzer"
},
"tags_prefix": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "search_analyzer"
},
"tenantId": {
"type": "text",
"analyzer": "keyword"
},
"referenceId": {
"type": "text",
"analyzer": "keyword"
}
}
}
}
}
The structure of the documents is:
{
"tenantId": "1234",
"name": "A NAME",
"referenceId": "1234567",
"tags": [
"1234567",
"A NAME"
],
"tags_prefix": [
"1234567",
"A NAME"
]
}
The point behind the structure is that documents have searcheable fields, over which typeahead works, it's not over everything in the document, so it could be things not even in the document itself.
The search query is:
{
"from": 0,
"size": 10,
"highlight": {
"fields": {
"tags": {}
}
},
"query": {
"bool": {
"must": {
"multi_match": {
"query": "a nam",
"fields": ["tags_prefix^100", "tags"]
}
},
"filter": {
"term": {
"tenantId": "1234"
}
}
}
}
}
I'm doing a multi_match because, while I need typeahead, the results that have the match at the start need to come back first, so I followed the recommendation in here
The current setup is 10 shards, 3 master nodes (t2.mediums), 2 data/ingestion nodes (t2.mediums) with 35GB EBS disk on each, which I know is tiny given the final needs of the system, but useful enough for experimenting.
I have ~6000000 records inserted, and the response time with a cold cache is around 300ms.
I was wondering if this is the right approach or are there some optimizations I could implement to the index/query to make this more performant?
First, I think that the solution you build is good, and the optimisations you are looking for should only be considered if you have an issue with the current solution, meaning the queries are too slow. No need for pre-mature optimisations.
Second, I think that you don't need to provide the tags_prefix in your docs. all you need is to use the edge_ngram_tokenizer on the tags field, which will create the desired prefix tokens for the search to work. you can use multi fields in order to have multiple tokenizers for the same 'tags' field.
Third, use the edge_ngram_tokenizer settings carefully, especially the 'min_gram' and 'max_gram' settings. the reason is that having too high max_gram will:
a. create too many prefix tokens, will use too much space
b. decrease the index rate, as indexing takes longer
c. is not useful - you don't expect auto-complete to take into account 300 prefix characters. a better max prefix token settings should be (to my opinion) in the range of 10-20 characters max (or even less).
Good luck!

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

implementation of AutoSuggestion feature of apps like alibaba using elastic search

I am trying to implement "autosuggestion" feature in my application where when user type a set of letters she should be able to view a list of suggestions for the given input , i would like to have my feature working as similar to how Alibaba or similar sites work .
I am using Elastic search and java.Can any one help me or give any suggestions on how to implement this functionality.
suggestins in elastic search can be implemented using 1)prefix match 2)completion suggesters 3)Edge NGrams.
in this approach i choose "Ngrams analyzer" for auto suggestion , start by defining a nGram filter and then link the filter to custom analyzer and apply analyzer to fields which we choose to provide suggestions.
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
once we have mapping set , we can assign the analyzer(nGram_analyzer) to fileds which take part of suggestions.
Next step would be to write a query to extartc suggestions , that would be as follows.
"query": {
"match": {
"_all": {
"query": "gates",
"operator": "and"(optional , needed when we have multiple words or a senetence)
}
}
}
The below material would help to understand in depth about ngram tokenizer
https://www.elastic.co/guide/en/elasticsearch/reference/1.6/analysis-ngram-tokenizer.html
PS : each approach listed above will have its own advantages and backdrops.

How can I get results that don't fully match using ElasticSearch?

If a user types
jewelr
I want to get results for
jewelry
I am using a multi_match query.
You could use EdgeNGram tokenizer:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenizer/
Specify an index time analyzer using this,
"analysis": {
"filter": {
"fulltext_ngrams": {
"side": "front",
"max_gram": 15,
"min_gram": 3,
"type": "edgeNGram"
}
},
"analyzer": {
"fulltext_index": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"asciifolding",
"fulltext_ngrams"
],
"type": "custom",
"tokenizer": "standard"
}
}
Then either specify as default index analyzer, or for a specific field mapping.
When indexing a field with value jewelry, with a 3/15 EdgeNGram, all combinations will be stored:
jew
jewe
jewel
jewelr
jewelry
Then a search for jewelr will get a match in that document.

Resources