Elasticsearch sort by text field keyword - sorting

I have index with this settings
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter",
"asciifolding",
"elision",
"standard"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "32"
}
}
}
and have mapping for the name field
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
now I have several examples of names in documents. Name is one field with first name and last name inside.
--макс---
-макс -
{something} макс
макс {something}
I am using this query to find the documents with that name with alphabetical sorting
{
"query": {
"match": {
"name": {
"query": "макс",
"operator" : "and"
}
}
},
"sort": [
{"name.keyword" : "asc"}
]
}
it is bringing results as I wrote. but I expect that макс {something} will come for the first position than others because it is starting with a query which I wrote.
Can somebody help be there

So the query is by default scoring documents based on "how well they matched", this score is used to rank the "best matches first". But as soon as you define an sort you are saying ignore the query score and only using this field to rank the results. Now the results are still restricted to only documents matching the query but the idea of best match is lost unless you keep the special value _score in your sort statement somewhere.
Like this:
"sort": [
{
"productLine.keyword": {
"order": "desc"
}
},
{
"_score": {
"order": "desc"
}
}
]
Maybe you can just remove the sort and get the results you want based on default score sorting. Include a few example documents to make this fully reproducible if you want more support from the SO community

Related

Configure highlighted part in the elasticsearch

Main question
The user is looking for a name and enters the part of the it, let's say au, and the document with the text paul is found.
I would like to have the doc highlighted like p<em>au</em>l.
How can I achieve it if I have a complex search query (combination of match, prefix, wildcard to rule relevance)?
Sub question
When do highlight settings from documentation for type, boundary_scanner and boundary_chars come into play? As per my tests described below, these settings don't change highlighted part.
Try 1: Wildcard query with default analyzer
PUT myindex
{
"mappings": {
"properties": {
"name": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
}
POST myindex/_doc/1
{
"name": "paul"
}
GET myindex/_search
{
"query": {
"wildcard": {"name": "*au*"}
},
"highlight": {
"fields": {
"name": {}
},
"type": "fvh",
"boundary_scanner": "chars",
"boundary_chars": "abcdefghijklmnopqrstuvwxyz.,!? \t\n"
}
}
This kind of search returns highlight <em>paul</em> but I need to get p<em>au</em>l.
Try 2: Match query with NGRAM analyzer
This one works as described in SO question: Highlighting part of word in elasticsearch
PUT myindexngram
{
"settings": {
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "3",
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"index_ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"search_term_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "index_ngram_analyzer",
"term_vector": "with_positions_offsets"
}
}
}
}
POST myindexngram/_doc/1
{
"name": "paul"
}
GET myindexngram/_search
{
"query": {
"match": {"name": "au"}
},
"highlight": {
"fields": {
"name": {}
}
}
}
This highlights p<em>au</em>l as desired but:
Highlighting depends on the query type, so combining match and wildcard will again result in <em>paul</em>.
Highlighting is not affected at all on type, boundary_scanner and boundary_chars settings.
Elastic version 7.13.4
Response from Elasticsearch team:
A highlighter works on terms, so only full terms can be highlighted - whatever are the terms in your index. In your second example, au could be highlighted, because it it a term in the index, which is not the case for your first example.
There is also an option to define your own highlight_query that could be different from the main query, but this could lead to unpredictable highlights.
https://discuss.elastic.co/t/configure-highlighted-part/295164

Elasticsearch typeahead query optimization

I am currently working on typeahead support (with contains, not just starts-with) for over 100.000.000 entries (and that number could grow arbitrarily) using ElasticSearch.
The current setup works, but I was wondering if there is a better approach to it.
I'm using AWS Elasticsearch, so I don't have full control over the cluster.
My index is defined as follows:
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"edge_ngram_analyzer": {
"tokenizer": "edge_ngram_tokenizer",
"filter": "lowercase"
},
"search_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 300,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation",
"whitespace"
]
},
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 300,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation",
"whitespace"
]
}
}
}
},
"mappings": {
"account": {
"properties": {
"tags": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "search_analyzer"
},
"tags_prefix": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "search_analyzer"
},
"tenantId": {
"type": "text",
"analyzer": "keyword"
},
"referenceId": {
"type": "text",
"analyzer": "keyword"
}
}
}
}
}
The structure of the documents is:
{
"tenantId": "1234",
"name": "A NAME",
"referenceId": "1234567",
"tags": [
"1234567",
"A NAME"
],
"tags_prefix": [
"1234567",
"A NAME"
]
}
The point behind the structure is that documents have searcheable fields, over which typeahead works, it's not over everything in the document, so it could be things not even in the document itself.
The search query is:
{
"from": 0,
"size": 10,
"highlight": {
"fields": {
"tags": {}
}
},
"query": {
"bool": {
"must": {
"multi_match": {
"query": "a nam",
"fields": ["tags_prefix^100", "tags"]
}
},
"filter": {
"term": {
"tenantId": "1234"
}
}
}
}
}
I'm doing a multi_match because, while I need typeahead, the results that have the match at the start need to come back first, so I followed the recommendation in here
The current setup is 10 shards, 3 master nodes (t2.mediums), 2 data/ingestion nodes (t2.mediums) with 35GB EBS disk on each, which I know is tiny given the final needs of the system, but useful enough for experimenting.
I have ~6000000 records inserted, and the response time with a cold cache is around 300ms.
I was wondering if this is the right approach or are there some optimizations I could implement to the index/query to make this more performant?
First, I think that the solution you build is good, and the optimisations you are looking for should only be considered if you have an issue with the current solution, meaning the queries are too slow. No need for pre-mature optimisations.
Second, I think that you don't need to provide the tags_prefix in your docs. all you need is to use the edge_ngram_tokenizer on the tags field, which will create the desired prefix tokens for the search to work. you can use multi fields in order to have multiple tokenizers for the same 'tags' field.
Third, use the edge_ngram_tokenizer settings carefully, especially the 'min_gram' and 'max_gram' settings. the reason is that having too high max_gram will:
a. create too many prefix tokens, will use too much space
b. decrease the index rate, as indexing takes longer
c. is not useful - you don't expect auto-complete to take into account 300 prefix characters. a better max prefix token settings should be (to my opinion) in the range of 10-20 characters max (or even less).
Good luck!

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

elasticsearch: giving score to ngrams

I am trying to use elasticsearch to do a name search matching using ngrams,
The technique I am trying to implement is as follow:
input: a name that needs to be match to the db.
output: all potential name matching from my db of names.
The way I try to do that is as follow, I split the name to ngrams with length of 3-5.
I then collect all the names that match those ngrams from the db.
Then I go over the ngrams and sort them by there reverse frequency,
meaning that common ngrams will get the lowest score.
for example, if I use it on a company name like "my company inc" I will give the "inc" ngram the lowest score because inc appears in a lot of company names.
The way I calculate the score is by doing: 1/(count appearences of the ngram in all my db), that way I will have the "strongest" ngrams as the ones that appear the least.
I implemented this in a python script, but I want to use the power of elastic to do the same for me,
I know about the ngram tokenizer, but is there a way to tell him to do the score I do?
As far as I know, when I do a matching now, it will score the result by how much of the ngrams in the query match the ngrams in the word he has in the db
this is the mapping I use:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": ["letter", "digit"]
}
}
}
},
"mappings": {
"names": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
}
},
"analyzer": "my_analyzer"
},
"id": {
"type": "long"
}
}
}
}
}
this is the query I do:
GET /names/_search
{
"query": {
"match" : { "name" : "my company inc"}
}
}
The query that you would want to use is this:
{
"query": {
"common": {
"name": {
"query": "my company inc",
"cutoff_frequency": 0.001
}
}
}
}
Common terms query returns the relevance score based only on important terms (important nGrams) i.e. terms with less frequency. Here, the words that have a document frequency greater than 0.1% will be considered as common words and will not affect the relevance score.
Alternatively, if you already have a predefined list of stopwords (inc, pvt, ltd), then you can always use a custom stop words filter in your analyzer to filter them out for generating hits.
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": [
"custom_stop_token_filter"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": ["letter", "digit"]
}
},
"filter": {
"custom_stop_token_filter": {
"type": "stop",
"stopwords": [
"inc",
"pvt",
"ltd"
]
}
}
}
}
}
For more info:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-common-terms-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html

ngrams ins elasticsearch are not working

I use elasticsearch ngram
"analysis": {
"filter": {
"desc_ngram": {
"type": "ngram",
"min_gram": 3,
"max_gram": 8
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "desc_ngram", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
And I have 2 objects here
{
"name": "Shana Calandra",
"username": "shacalandra",
},
{
"name": "Shana Launer",
"username": "shalauner",
},
And using this query
{
query: {
match: {
_all: "Shana"
}
}
}
When I search with this query, it returns me both documents, but I cant search by part of word here, for example I cant use "Shan" instead of "Shana" in query because it doesnt return anything.
Maybe my mapping is wrong, I cant understand problem is on mapping or on query
If you specify
"mappings": {
"test": {
"_all": {
"index_analyzer": "index_ngram",
"search_analyzer": "search_ngram"
},
for your mapping of _all field then it will work. _all has its own analyzers and I suspect you used the analyzers just for name and username and not for _all.

Resources