Elasticsearch sort settings on index giving strange results - sorting

I have an index set up like so:
PUT items
{
"settings": {
"index": {
"sort.field": ["popularity", "title_keyword"],
"sort.order": ["desc", "asc"]
},
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 15,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
},
"title_keyword": {
"type": "keyword"
},
"popularity": {
"type": "integer"
},
"visibility": {
"type": "keyword"
}
}
}
}
With the following data:
POST items/_doc/1
{
"title": "The Arbor",
"popularity": 5,
"title_keyword": "The Arbor",
"visibility": "public"
}
POST items/_doc/2
{
"title": "The Canon",
"popularity": 10,
"title_keyword": "The Canon",
"visibility": "public"
}
POST items/_doc/3
{
"title": "The Brew",
"popularity": 15,
"title_keyword": "The Brew",
"visibility": "public"
}
I run this query on the data:
GET items/_search
{
"size": 3,
"query": {
"bool": {
"must": [
{
"match": {
"title": {
"query": "the",
"operator": "and"
}
}
},
{
"match": {
"visibility": "public"
}
}
]
}
},
"highlight": {
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"],
"fields": {
"title": {}
}
}
}
It seems to match the records correctly on the word the but the sorting does not seem to work. I would expect it to be sorted by popularity as defined and the results would be The Arbor, The Brew, The Canon in that order but the results I get are as follows:
{
"took" : 11,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.27381438,
"hits" : [
{
"_index" : "items",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.27381438,
"_source" : {
"title" : "The Brew",
"popularity" : 15,
"title_keyword" : "The Brew",
"visibility" : "public"
},
"highlight" : {
"title" : [
"<mark>The</mark> Brew"
]
}
},
{
"_index" : "items",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.26392496,
"_source" : {
"title" : "The Arbor",
"popularity" : 5,
"title_keyword" : "The Arbor",
"visibility" : "public"
},
"highlight" : {
"title" : [
"<mark>The</mark> Arbor"
]
}
},
{
"_index" : "items",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.26392496,
"_source" : {
"title" : "The Canon",
"popularity" : 10,
"title_keyword" : "The Canon",
"visibility" : "public"
},
"highlight" : {
"title" : [
"<mark>The</mark> Canon"
]
}
}
]
}
}
Does defining the sort fields and orders when creating the index, under the settings, automatically sort the results? It seems to be sorting by score and not the popularity. If I include the sort options in the query it gives me the correct results back:
GET items/_search
{
"size": 3,
"sort": [
{
"popularity": {
"order": "desc"
}
},
{
"title_keyword": {
"order": "asc"
}
}
],
"query": {
"bool": {
"must": [
{
"match": {
"title": {
"query": "the",
"operator": "and"
}
}
},
{
"match": {
"visibility": "public"
}
}
]
}
},
"highlight": {
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"],
"fields": {
"title": {}
}
}
}
I read that including the sort in the query like this is inefficient and to include it in the settings. Am I not doing something when creating the index to make it sort by popularity by default? Does including the sort options in the query result in inefficient queries? Or do I actually need to include it in every query?
Hopefully this makes sense! Thanks

Index sorting defines how segments are sorted in a shard, this is not related to the sorting of search results. You can use a sorted index, if you often have searches that are sorted with the same criteria, then the index sort speeds up the search.
If your search has a different sort than the index or no sort at all, the index sort is not relevant.
Please see the documentation for index sorting and especially the part that explains how index sorting is used.

Related

How highlight partial word when using edge_ngram filter

I am using the edge_ngram filter in my analyzer, e.g. I index a word "EVA京", it will be mapped to an array [E, EV, EVA, 京]. And then I search "EV", of cause "EVA京" can be recalled. But the highlight works wrong! The result of highlight is "<em>EVA</em>京", but not "<em>EV</em>A京".
Can someone give me a hint how to correct the highlight result?
My index settings and mappings:
PUT my-index-001
{
"settings": {
"analysis": {
"filter": {
"edge_ngram_1_100": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 100
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "edge_ngram_1_100"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
And then add a sentence:
PUT my-index-001/_doc/1
{
"name": "EVA新世纪福音战士"
}
And then I search:
GET my-index-001/_search
{
"query": {
"match": {
"name": {
"query": "EV"
, "operator": "and"
}
}
}
, "highlight": {
"fields": {
"name": {
}
}
}
}
The result of searching is:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.3133171,
"hits" : [
{
"_index" : "my-index-001",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.3133171,
"_source" : {
"name" : "EVA新世纪福音战士"
},
"highlight" : {
"name" : [
"<em>EVA</em>新世纪福音战士"
]
}
}
]
}
}
If you need to highlight the result of partial word match of the document, then you can achieve this in the following way:
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"name":"EVA新世纪福音战士"
}
Search Query:
{
"query": {
"match": {
"name": {
"query": "EV"
}
}
},
"highlight": {
"fields": {
"*": {
"type": "plain",
"fragment_size": 20,
"pre_tags": "<span class='bold'>",
"post_tags": "</span>",
"number_of_fragments": 1
}
}
}
}
Search Result:
"hits": [
{
"_index": "65821975",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"name": "EVA新世纪福音战士"
},
"highlight": {
"name": [
"<span class='bold'>EV</span>A新世纪福音战士"
]
}
}
]

Two filters (RANGE) in different fields in elasticsearch

I am a beginner in elasticsarch and I wanted this query below to work with the two filters, having two range of different fields, but only the first range is working.
This filter is working normally:
"range" : {"pgrk" : { "gte" : 1, "lte" : 10} }
Could someone tell me why this second filter below doesn't work?
"should" : {
"range" : {"url_length" : { "lte" : 100 } }
--------------------------Follow my query below with the two filters--------------------------
{
"from" : 0, "size" : 10,
"sort" : [
{ "pgrk" : {"order" : "desc"} },
{ "url_length" : {"order" : "asc"} }
],
"query": {
"bool": {
"must": {
"multi_match" : {
"query": "netflix",
"type": "cross_fields",
"fields": [ "titulo", "descricao", "url" ],
"operator": "and"
}
},
"filter": {
"range" : {"pgrk" : { "gte" : 1, "lte" : 10} }
},
"should" : {
"range" : {"url_length" : { "lte" : 100 } }
}
}
}
}
Not sure, what is your requirement as index mapping and sample documents are not provided but I created my own mapping and sample documents to show you how to create multiple range queries in filter context.
Please comment, so that I can modify if its results are not according to your requirements.
Index Def
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"url": {
"type": "keyword"
},
"pgrk": {
"type": "integer"
},
"url_length": {
"type": "integer"
}
}
}
}
Index sample docs
{
"title": "netflix",
"url" : "www.netflix.com", --> this shouldn't match as `pgrk > 10`
"pgrk": 12,
"url_length" : 50
}
{
"title": "Netflix", --> this should match both filetrs
"url" : "www.netflix.com",
"pgrk": 8,
"url_length" : 50
}
{
"title": "Netflix", --> this should match both filetrs
"url" : "www.netflix",
"pgrk": 5,
"url_length" : 50
}
{
"title": "netflix",
"url" : "www.netflix",
"pgrk": 5,
"url_length" : 80. --> note pgrk has same 5 as prev and url_length is diff
}
Search query
{
"from": 0,
"size": 10,
"sort": [
{
"pgrk": {
"order": "desc"
}
},
{
"url_length": {
"order": "asc"
}
}
],
"query": {
"bool": {
"must": {
"multi_match": {
"query": "netflix",
"type": "cross_fields",
"fields": [
"title",
"url"
],
"operator": "and"
}
},
"filter": [ --> note filter array to have multiple range queries in filter context
{
"range": {
"pgrk": {
"gte": 1,
"lte" : 10
}
}
},
{
"range": {
"url_length": {
"lte": 100
}
}
}
]
}
}
}
And search result which brings only three docs (even 2 has same pgrk value)
"hits": [
{
"_index": "so_range",
"_type": "_doc",
"_id": "1",
"_score": null,
"_source": {
"title": "netflix",
"url": "www.netflix.com",
"pgrk": 8,
"url_length": 50
},
"sort": [
8,
50
]
},
{
"_index": "so_range",
"_type": "_doc",
"_id": "3",
"_score": null,
"_source": {
"title": "netflix",
"url": "www.netflix",
"pgrk": 5,
"url_length": 50
},
"sort": [
5,
50
]
},
{
"_index": "so_range",
"_type": "_doc",
"_id": "4",
"_score": null,
"_source": {
"title": "netflix",
"url": "www.netflix",
"pgrk": 5,
"url_length": 80
},
"sort": [
5,
80
]
}
]

How do I get Elasticsearch to highlight a partial word from a search_as_you_type field?

I'm having trouble setting up a search_as_you_type field with highlighting following the guide here https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-as-you-type.html
I'll leave a series of commands to reproduce what I'm seeing. Hopefully somebody can weigh in on what I'm missing :)
create mapping
PUT /test_index
{
"mappings": {
"properties": {
"plain_text": {
"type": "search_as_you_type",
"index_options": "offsets",
"term_vector": "with_positions_offsets"
}
}
}
}
insert document
POST /test_index/_doc
{
"plain_text": "This is some random text"
}
search for document
GET /snippets_test/_search
{
"query": {
"multi_match": {
"query": "rand",
"type": "bool_prefix",
"fields": [
"plain_text",
"plain_text._2gram",
"plain_text._3gram",
"plain_text._index_prefix"
]
}
},
"highlight" : {
"fields" : [
{
"plain_text": {
"number_of_fragments": 1,
"no_match_size": 100
}
}
]
}
}
response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "rLZkjm8BDC17cLikXRbY",
"_score" : 1.0,
"_source" : {
"plain_text" : "This is some random text"
},
"highlight" : {
"plain_text" : [
"This is some random text"
]
}
}
]
}
}
The response I get back does not have the highlighting I expect
Idealy the highlight is: This is some <em>ran</em>dom text
In order to achieve highlighting of n-grams (chars) you'll need:
a custom ngram tokenizer. By default the maximum difference between min_gram and max_gram is 1, so in my example highlighting will work only for the search terms with length 3 or 4. You can change this and creating more n-grams by setting a higher value for index.max_ngram_diff .
a custom analyzer based on the custom tokenizer
in mapping add "plain_text.highlight" field
Here's the configuration:
{
"settings": {
"analysis": {
"analyzer": {
"partial_words" : {
"type": "custom",
"tokenizer": "ngrams",
"filter": ["lowercase"]
}
},
"tokenizer": {
"ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4
}
}
}
},
"mappings": {
"properties": {
"plain_text": {
"type": "text",
"fields": {
"shingles": {
"type": "search_as_you_type"
},
"ngrams": {
"type": "text",
"analyzer": "partial_words",
"search_analyzer": "standard",
"term_vector": "with_positions_offsets"
}
}
}
}
}
}
the query:
{
"query": {
"multi_match": {
"query": "rand",
"type": "bool_prefix",
"fields": [
"plain_text.shingles",
"plain_text.shingles._2gram",
"plain_text.shingles._3gram",
"plain_text.shingles._index_prefix",
"plain_text.ngrams"
]
}
},
"highlight" : {
"fields" : [
{
"plain_text.ngrams": { }
}
]
}
}
and the result:
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "FkHLVHABd_SGa-E-2FKI",
"_score": 2,
"_source": {
"plain_text": "This is some random text"
},
"highlight": {
"plain_text.ngrams": [
"This is some <em>rand</em>om text"
]
}
}
]
Note: in some cases, this config might be expensive for memory usage and storage.

Elastic terms query with emails not working

I want to pass a list of emails in Elastic Search Query, So I tried below query to achieve that, but didn't get any result.
{
"query": {
"terms": {
"email": [ "andrew#gmail.com", "michel#gmail.com" ]
}
}
}
When I used id instead of emails, that worked !
{
"query": {
"terms": {
"id": [ 43, 67 ]
}
}
}
Could you please explain what's wrong with my email query and how make it works
If you want to recognize email addresses as single tokens you should use uax_url_email tokenizer.
UAX URL Email Tokenizer
A working example:
Mappings
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"filter": ["lowercase", "stop"]
}
},
"tokenizer": {
"my_tokenizer":{
"type": "uax_url_email"
}
}
}
},
"mappings": {
"properties": {
"email": {
"type": "text",
"analyzer": "my_email_analyzer",
"search_analyzer": "my_email_analyzer",
"fields": {
"keyword":{
"type":"keyword"
}
}
}
}
}
}
POST few documents
POST my_index/_doc/1
{
"email":"andrew#gmail.com"
}
POST my_index/_doc/2
{
"email":"michel#gmail.com"
}
Search Query
GET my_index/_search
{
"query": {
"multi_match": {
"query": "andrew#gmail.com michel#gmail.com",
"fields": ["email"]
}
}
}
Results
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931472,
"_source" : {
"email" : "andrew#gmail.com"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.6931472,
"_source" : {
"email" : "michel#gmail.com"
}
}
]
}
Another option is to use keyword type.
Search Query
GET my_index/_search
{
"query": {
"terms": {
"email.keyword": [
"andrew#gmail.com",
"michel#gmail.com"
]
}
}
}
In my opinion using the uax_url_email tokenizer is a better solution.
Hope this helps

How to filter by the size of an array in nested type?

Let's say I have the following type:
{
"2019-11-04": {
"mappings": {
"_doc": {
"properties": {
"labels": {
"type": "nested",
"properties": {
"confidence": {
"type": "float"
},
"created_at": {
"type": "date",
"format": "strict_date_optional_time||date_time||epoch_millis"
},
"label": {
"type": "keyword"
},
"updated_at": {
"type": "date",
"format": "strict_date_optional_time||date_time||epoch_millis"
},
"value": {
"type": "keyword",
"fields": {
"numeric": {
"type": "float",
"ignore_malformed": true
}
}
}
}
},
"params": {
"type": "object"
},
"type": {
"type": "keyword"
}
}
}
}
}
}
And I want to filter by the size/length of the labels array. I've tried the following (as the official docs suggest):
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "doc['labels'].size > 10"
}
}
}
}
}
}
but I keep getting:
{
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
"doc['labels'].size > 10",
" ^---- HERE"
],
"script": "doc['labels'].size > 10",
"lang": "painless"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "2019-11-04",
"node": "kk5MNRPoR4SYeQpLk2By3A",
"reason": {
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
"doc['labels'].size > 10",
" ^---- HERE"
],
"script": "doc['labels'].size > 10",
"lang": "painless",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "No field found for [labels] in mapping with types []"
}
}
}
]
},
"status": 500
}
I'm afraid that is not something possible, because the field labels is not a field that ES saves or albiet creates an inverted index on.
Doc doc['fieldname'] is only applicable on the fields on which inverted index is created and Elasticsearch's Query DSL too only works on fields on which inverted index gets created and unfortunately nested type is not a valid field on which inverted index is created.
Having said so, I have the below two ways of doing this.
For the sake of simplicity, I've created sample mapping, documents and two possible solutions which may help you.
Mapping:
PUT my_sample_index
{
"mappings": {
"properties": {
"myfield": {
"type": "nested",
"properties": {
"label": {
"type": "keyword"
}
}
}
}
}
}
Sample Documents:
// single field inside 'myfield'
POST my_sample_index/_doc/1
{
"myfield": {
"label": ["New York", "LA", "Austin"]
}
}
// two fields inside 'myfield'
POST my_sample_index/_doc/2
{
"myfield": {
"label": ["London", "Leicester", "Newcastle", "Liverpool"],
"country": "England"
}
}
Solution 1: Using Script Fields (Managing at Application Level)
I have a workaround to get what you want, well not exactly but would help you filter out on your service layer or application.
POST my_sample_index/_search
{
"_source": "*",
"query": {
"bool": {
"must": [
{
"match_all": {}
}
]
}
},
"script_fields": {
"label_size": {
"script": {
"lang": "painless",
"source": "params['_source']['labels'].size() > 1"
}
}
}
}
You would notice that in response a separate field label_size gets created with true or false value.
A sample response is something like below:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_sample_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"myfield" : {
"label" : [
"New York",
"LA",
"Austin"
]
}
},
"fields" : {
"label_size" : [ <---- Scripted Field
false
]
}
},
{
"_index" : "my_sample_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"myfield" : {
"country" : "England",
"label" : [
"London",
"Leicester",
"Newcastle",
"Liverpool"
]
}
},
"fields" : { <---- Scripted Field
"label_size" : [
true <---- True because it has two fields 'labels' and 'country'
]
}
}
]
}
}
Note that only second document makes sense as it has two fields i.e. country and labels. However if you only want the docs with label_size with true, that'd would have to be managed at your application layer.
Solution 2: Reindexing with labels.size using Script Processor
Create a new index as below:
PUT my_sample_index_temp
{
"mappings": {
"properties": {
"myfield": {
"type": "nested",
"properties": {
"label": {
"type": "keyword"
}
}
},
"labels_size":{ <---- New Field where we'd store the size
"type": "integer"
}
}
}
}
Create the below pipeline:
PUT _ingest/pipeline/set_labels_size
{
"description": "sets the value of labels size",
"processors": [
{
"script": {
"source": """
ctx.labels_size = ctx.myfield.size();
"""
}
}
]
}
Use Reindex API to reindex from my_sample_index index
POST _reindex
{
"source": {
"index": "my_sample_index"
},
"dest": {
"index": "my_sample_index_temp",
"pipeline": "set_labels_size"
}
}
Verify the documents in my_sample_index_temp using GET my_sample_index_temp/_search
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_sample_index_temp",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"labels_size" : 1, <---- New Field Created
"myfield" : {
"label" : [
"New York",
"LA",
"Austin"
]
}
}
},
{
"_index" : "my_sample_index_temp",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"labels_size" : 2, <----- New Field Created
"myfield" : {
"country" : "England",
"label" : [
"London",
"Leicester",
"Newcastle",
"Liverpool"
]
}
}
}
]
}
}
Now you can simply use this field labels_size in your query and its way easier and not to mention efficient.
Hope this helps!
You can solve it with a custom score approach:
GET 2019-11-04/_search
{
"min_score": 0.1,
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"script_score": {
"script": {
"source": "params['_source']['labels'].length > 10 ? 1 : 0"
}
}
}
]
}
}
}

Resources