Why is ascending geo distance sorting faster than descending geo distance sorting - sorting

I'm using Elasticsearch 6.6 and have an index (1 shard, 1 replica) with the geonames (https://www.geonames.org/) dataset indexed (indexsize =1.3 gb, 11.8 mio geopoints).
I was playing around a bit with the geo distance sorting query, sorting the whole index for some origin points. So after some testing I saw that sorting ascending is always faster than sorting descending. here is an example query (i also tested with bigger "size"-parameter):
POST /geonames/_search?request_cache=false
{
"size":1,
"sort" : [
{
"_geo_distance" : {
"location" : [8, 49],
"order" : "asc",
"unit" : "m",
"mode" : "min",
"distance_type" : "arc",
"ignore_unmapped": true
}
}
]
}
Here is the answer for ascending sorting (with explain and profile True):
{
"took" : 1374,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 11858060,
"max_score" : null,
"hits" : [
{
"_shard" : "[geonames][0]",
"_node" : "qXTymyB9QLmxhPtGEtA_mA",
"_index" : "geonames",
"_type" : "doc",
"_id" : "L781LmkBrQo0YN4qP48D",
"_score" : null,
"_source" : {
"id" : "3034701",
"name" : "ForĂȘt de Wissembourg",
"location" : {
"lat" : "49.00924",
"lon" : "8.01542"
}
},
"sort" : [
1523.4121312414704
],
"_explanation" : {
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
}
}
]
},
"profile" : {
"shards" : [
{
"id" : "[qXTymyB9QLmxhPtGEtA_mA][geonames][0]",
"searches" : [
{
"query" : [
{
"type" : "MatchAllDocsQuery",
"description" : "*:*",
"time_in_nanos" : 265223567,
"breakdown" : {
"score" : 0,
"build_scorer_count" : 54,
"match_count" : 0,
"create_weight" : 10209,
"next_doc" : 253091268,
"match" : 0,
"create_weight_count" : 1,
"next_doc_count" : 11858087,
"score_count" : 0,
"build_scorer" : 263948,
"advance" : 0,
"advance_count" : 0
}
}
],
"rewrite_time" : 1097,
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time_in_nanos" : 1044167746,
"children" : [
{
"name" : "SimpleFieldCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 508296683
}
]
}
]
}
],
"aggregations" : [ ]
}
]
}
}
and here for descending, just switched the parameter from asc to desc (also with profile and explain):
{
"took" : 2226,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 11858060,
"max_score" : null,
"hits" : [
{
"_shard" : "[geonames][0]",
"_node" : "qXTymyB9QLmxhPtGEtA_mA",
"_index" : "geonames",
"_type" : "doc",
"_id" : "Mq80LmkBrQo0YN4q11bA",
"_score" : null,
"_source" : {
"id" : "4036351",
"name" : "Bollons Seamount",
"location" : {
"lat" : "-49.66667",
"lon" : "-176.16667"
}
},
"sort" : [
1.970427111052182E7
],
"_explanation" : {
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
}
}
]
},
"profile" : {
"shards" : [
{
"id" : "[qXTymyB9QLmxhPtGEtA_mA][geonames][0]",
"searches" : [
{
"query" : [
{
"type" : "MatchAllDocsQuery",
"description" : "*:*",
"time_in_nanos" : 268521404,
"breakdown" : {
"score" : 0,
"build_scorer_count" : 54,
"match_count" : 0,
"create_weight" : 9333,
"next_doc" : 256458664,
"match" : 0,
"create_weight_count" : 1,
"next_doc_count" : 11858087,
"score_count" : 0,
"build_scorer" : 195265,
"advance" : 0,
"advance_count" : 0
}
}
],
"rewrite_time" : 1142,
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time_in_nanos" : 1898324618,
"children" : [
{
"name" : "SimpleFieldCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 1368306442
}
]
}
]
}
],
"aggregations" : [ ]
}
]
}
}
So my question is, why is it like this ? As I understood Es calculates the distance from the origin point to every other point and then sorts them. So why is the descending sorting so much slower ?

Asking the same question on the Elasticsearch board and getting an answer.
So apparantly Elasticsearch uses differnt searching strategies/algorithms for ascending end descending distance sorting.
For the descending sorting it calculates the distance from the origin to every point end then sorts.
For the ascending sorting it uses boundingboxes to filter points near the origin and only calculate the distances for points inside the boundingboxes.

Related

Why is the total query time so much longer than the single shard time in Elastic Search?

es version:7.3.2.
The total query time so much longer than the single shard time.
This problem can only occur if the same piece of data has not been requested for a long time, and then the data is requested now.
This problem also disappears when data is no longer written to the index in real time.
I would like to know how to troubleshoot this problem.
Thanks!
request:
GET friend_relation_realtime_v2/_search?human=true {
"query": {
"bool": {
"filter": {
"term": {
"user_id": "544799000"
}
}
}
}
}
result:
{
"took" : 1277,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 233,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [
.........
]
},
"profile" : {
"shards" : [
{
"id" : "[2mYeMFE1RO2Uu2pi63sMNQ][friend_relation_realtime_v2][3]",
"searches" : [
{
"query" : [
{
"type" : "BoostQuery",
"description" : "(ConstantScore(user_id:544799000))^0.0",
"time" : "315.2micros",
"time_in_nanos" : 315291,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 4262,
"match" : 0,
"next_doc_count" : 19,
"score_count" : 19,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 10573,
"advance_count" : 3,
"score" : 1339,
"build_scorer_count" : 26,
"create_weight" : 5623,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 293426
},
"children" : [
{
"type" : "TermQuery",
"description" : "user_id:544799000",
"time" : "301.4micros",
"time_in_nanos" : 301455,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 1582,
"match" : 0,
"next_doc_count" : 19,
"score_count" : 0,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 9888,
"advance_count" : 3,
"score" : 0,
"build_scorer_count" : 26,
"create_weight" : 2994,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 286942
}
}
]
}
],
"rewrite_time" : 2381,
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time" : "19micros",
"time_in_nanos" : 19029,
"children" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time" : "9.1micros",
"time_in_nanos" : 9134
}
]
}
]
}
],
"aggregations" : [ ]
},
{
"id" : "[2mYeMFE1RO2Uu2pi63sMNQ][friend_relation_realtime_v2][4]",
"searches" : [
{
"query" : [
{
"type" : "BoostQuery",
"description" : "(ConstantScore(user_id:544799000))^0.0",
"time" : "320.9micros",
"time_in_nanos" : 320910,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 4158,
"match" : 0,
"next_doc_count" : 24,
"score_count" : 24,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 9283,
"advance_count" : 2,
"score" : 1345,
"build_scorer_count" : 31,
"create_weight" : 10394,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 295648
},
"children" : [
{
"type" : "TermQuery",
"description" : "user_id:544799000",
"time" : "298.3micros",
"time_in_nanos" : 298395,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 1811,
"match" : 0,
"next_doc_count" : 24,
"score_count" : 0,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 8764,
"advance_count" : 2,
"score" : 0,
"build_scorer_count" : 31,
"create_weight" : 3754,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 284008
}
}
]
}
],
"rewrite_time" : 4100,
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time" : "33.7micros",
"time_in_nanos" : 33781,
"children" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time" : "10.2micros",
"time_in_nanos" : 10214
}
]
}
]
}
],
"aggregations" : [ ]
},
{
"id" : "[I0cNQW50Q3C_kt28KVSVsQ][friend_relation_realtime_v2][1]",
"searches" : [
{
"query" : [
{
"type" : "BoostQuery",
"description" : "(ConstantScore(user_id:544799000))^0.0",
"time" : "384.6micros",
"time_in_nanos" : 384608,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 5840,
"match" : 0,
"next_doc_count" : 33,
"score_count" : 31,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 27664,
"advance_count" : 4,
"score" : 1749,
"build_scorer_count" : 26,
"create_weight" : 19208,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 330052
},
"children" : [
{
"type" : "TermQuery",
"description" : "user_id:544799000",
"time" : "338.5micros",
"time_in_nanos" : 338550,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 2227,
"match" : 0,
"next_doc_count" : 33,
"score_count" : 0,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 24780,
"advance_count" : 4,
"score" : 0,
"build_scorer_count" : 26,
"create_weight" : 3957,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 307522
}
}
]
}
],
"rewrite_time" : 7897,
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time" : "45.1micros",
"time_in_nanos" : 45124,
"children" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time" : "22.1micros",
"time_in_nanos" : 22110
}
]
}
]
}
],
"aggregations" : [ ]
},
.............
]
}
}
When you run profiler in the search the total time of your query would be much higher than usual run, best way to check the total time is to run the query without profiler and compare the time with each node.

elasticsearch reducing result to one column - return only 1 value for each document

I try to reduce the json result of elasticsearch to only the column or columns i suggested to get. Is there any way?
When I use the following command, I get the result nasted into "_source":
{
"from": "0", "size":"2",
"_source":["id"],
"query": {
"match_all": {}
}
}
'
and there is no need for my use case.
I get this result:
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "indexer_1",
"_type" : "type_indexer_1",
"_id" : "38142",
"_score" : 1.0,
"_source" : {
"id" : 38142
}
},
{
"_index" : "indexer_1",
"_type" : "type_indexer_1",
"_id" : "38147",
"_score" : 1.0,
"_source" : {
"id" : 38147
}
}
]
}
}
What I would like to have:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 1.0,
"hits" : [
{
"id" : 38142
},
{
"id" : 38147
}
]
}
}
And this json-result I would love:
{
{
"id" : 38142
},
{
"id" : 38147
}
}
Is there any way out of the box in ES to reduce the result set?
you can filter the output JSON look at the documentation : response filtering
GET /index/_search?filter_path=hits.hits._id
{
"from": "0",
"size":"2",
"_source":["id"],
"query": {
"match_all": {}
}
}

Elastic search match phrase query with single token

So I am trying to understand how the match_phrase query works under certain circumstances
with elastic search [We have version 6.8 set up as of now ] . When I give it a string with multiple tokens it shows while profiling its running a phrase query but when I run it with a single token while profiling it shows its running a termsquery internally . I am trying to understand shouldn't it be independent of the input and if the positioning of terms is not correct fail to return a match ? Attaching queries and o/p -
Query with multiple tokens -
GET potato_testv3/_search
{"profile": "true",
"query": {
"bool": {
"must": [
{ "match_phrase": { "skill_set": {"query":"potato farmer"} }}
]
}
}
}
Output of the above -
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "potato_testv3",
"_type" : "recruiterinsightsv11",
"_id" : "4RShdnkBc8OOeUFVkncD",
"_score" : 0.5753642,
"_source" : {
"skill_set" : [
"silly webdriver",
"uft",
"uft/qtp",
"potato farmer"
]
}
}
]
},
"profile" : {
"shards" : [
{
"id" : "[5QVxJbTCSU-ruYT9EHsujA][potato_testv3][0]",
"searches" : [
{
"query" : [
{
"type" : "PhraseQuery",
"description" : """skill_set:"potato farmer"""",
"time_in_nanos" : 338986,
"breakdown" : {
"score" : 15362,
"build_scorer_count" : 2,
"match_count" : 1,
"create_weight" : 55661,
"next_doc" : 74248,
"match" : 39624,
"create_weight_count" : 1,
"next_doc_count" : 2,
"score_count" : 1,
"build_scorer" : 154084,
"advance" : 0,
"advance_count" : 0
}
}
],
"rewrite_time" : 3932,
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time_in_nanos" : 48431,
"children" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 19840
}
]
}
]
}
],
"aggregations" : [ ]
}
]
}
}
Query with single token -
GET potato_testv3/_search
{"profile": "true",
"query": {
"bool": {
"must": [
{ "match_phrase": { "skill_set": {"query":"potato"} }}
]
}
}
}
Output of above -
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "potato_testv3",
"_type" : "recruiterinsightsv11",
"_id" : "4RShdnkBc8OOeUFVkncD",
"_score" : 0.2876821,
"_source" : {
"skill_set" : [
"silly webdriver",
"uft",
"uft/qtp",
"potato farmer"
]
}
}
]
},
"profile" : {
"shards" : [
{
"id" : "[TeKxvYLJQfG_GVtD3bmpiw][potato_testv3][0]",
"searches" : [
{
"query" : [
{
"type" : "TermQuery",
"description" : "skill_set:potato",
"time_in_nanos" : 52214,
"breakdown" : {
"score" : 11310,
"build_scorer_count" : 2,
"match_count" : 0,
"create_weight" : 30974,
"next_doc" : 1314,
"match" : 0,
"create_weight_count" : 1,
"next_doc_count" : 2,
"score_count" : 1,
"build_scorer" : 8610,
"advance" : 0,
"advance_count" : 0
}
}
],
"rewrite_time" : 3761,
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time_in_nanos" : 20912,
"children" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 15758
}
]
}
]
}
],
"aggregations" : [ ]
}
]
}
}
In case if it helps , schema of the index used -
{
"potato_testv3" : {
"mappings" : {
"recruiterinsightsv11" : {
"dynamic" : "false",
"properties" : {
"skill_set" : {
"type" : "text",
"norms" : false,
"fielddata" : true
}
}
}
}
}
}
You are executing the same match_phrase query, once with a search string made up of multiple terms, once with a search string of a single token.
When executing an Elasticsearch query, Elasticsearch will optimise the query and translate it to the relevant queries on Lucene level. A phrase-query is more expensive to execute as
all terms of the search string need to match, and on top of that
the positions of the terms in a matching document need to be in the very same order as in the search string
If your search string only consist of a single term Elasticsearch can skip all of that extra effort and simply query for documents matching that single search term. What you observe therefore, is making perfect sense. It shows you how Elasticsearch is optimising the query while executing it.

Elasticsearch partitioned indices skipped versus match no docs query

We're having indices that are partitioned by year, e.g.:
items-2019
items-2020
Consider the following data:
POST items-2019/_doc
{
"#timestamp": "2019-01-01"
}
POST items-2020/_doc
{
"#timestamp": "2020-01-01"
}
POST /_aliases
{
"actions": [
{
"add": {
"index": "items-*",
"alias": "items"
}
}
]
}
Now when I query data and explicitly sort results, it would skip items-2020 shards:
GET items/_search
{
"query": {
"range": {
"#timestamp": {
"lt": "2020-01-01"
}
}
},
"sort": {
"#timestamp": "desc"
}
}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 1, <--- skipped
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "items-2019",
"_type" : "_doc",
"_id" : "BTdSb3UBRFH0Yqe1vm_W",
"_score" : null,
"_source" : {
"#timestamp" : "2019-01-01"
},
"sort" : [
1546300800000
]
}
]
}
}
However when I don't sort results explicitly, it wouldn't skip the shards, however ES would issue a MatchNoDocsQuery:
GET items/_search
{
"profile": "true",
"query": {
"range": {
"#timestamp": {
"lt": "2020-01-01"
}
}
}
}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0, <--- nothing skipped
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "items-2019",
"_type" : "_doc",
"_id" : "BTdSb3UBRFH0Yqe1vm_W",
"_score" : 1.0,
"_source" : {
"#timestamp" : "2019-01-01"
}
}
]
},
"profile" : {
"shards" : [
{
"id" : "[Axyv60mYQEGAREa2TwbgMQ][items-2019][0]",
"searches" : [
{
"query" : [
{
"type" : "ConstantScoreQuery",
"description" : "ConstantScore(DocValuesFieldExistsQuery [field=#timestamp])",
"time_in_nanos" : 69525,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 3766,
"match" : 0,
"next_doc_count" : 1,
"score_count" : 1,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 4123,
"advance_count" : 1,
"score" : 1123,
"build_scorer_count" : 2,
"create_weight" : 29745,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 30768
},
"children" : [
{
"type" : "DocValuesFieldExistsQuery",
"description" : "DocValuesFieldExistsQuery [field=#timestamp]",
"time_in_nanos" : 18317,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 1474,
"match" : 0,
"next_doc_count" : 1,
"score_count" : 0,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 1541,
"advance_count" : 1,
"score" : 0,
"build_scorer_count" : 2,
"create_weight" : 1184,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 14118
}
}
]
}
],
"rewrite_time" : 4660,
"collector" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 22374
}
]
}
],
"aggregations" : [ ]
},
{
"id" : "[Axyv60mYQEGAREa2TwbgMQ][items-2020][0]",
"searches" : [
{
"query" : [
{
"type" : "MatchNoDocsQuery",
"description" : """MatchNoDocsQuery("User requested "match_none" query.")""", <-- here
"time_in_nanos" : 4166,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 0,
"match" : 0,
"next_doc_count" : 0,
"score_count" : 0,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 0,
"advance_count" : 0,
"score" : 0,
"build_scorer_count" : 1,
"create_weight" : 1791,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 2375
}
}
],
"rewrite_time" : 4353,
"collector" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 12887
}
]
}
],
"aggregations" : [ ]
}
]
}
}
So there are couple of questions here:
Does skipping truly skip shards?
How are skipped shards and MatchNoDocsQuery different?
What's the cost of MatchNoDocsQuery?
How does sorting allow shards to be skipped?
If we sort results, do we really completely skip shards and not even touch them during search?
That's a great deal of questions bundled into one, but here's my attempt:
Does skipping truly skip shards?
How does sorting allow shards to be skipped?
If we sort results, do we really completely skip shards and not even touch them during search?
Yes, ES tries to be smart enough to figure out which shards to hit before actually sending the query to those shards. The _search_shards API helps here but not only as can be seen from the explanation in this issue.
If you search issues for the keywords can_match, skip and shard you'll find plenty of other optimizations implemented all over the place that aim at making ES execution plan smarter and faster.
If you want to see how this is coded, you can start in the SearchService.canMatch() method. That's where the service can decide whether the query can be rewritten to MatchNoDocsQuery. If you add a suggest or global aggregation (which must visit all documents no matter what), you'll see that shards are not skipped any more, even with the sort present.
What's the cost of MatchNoDocsQuery?
I wouldn't worry about it, as it's not only negligible, but out of your hands.
How does sorting allow shards to be skipped?
As explained in the issue #51852 I linked above, This change will rewrite the shard queries to match none if the bottom sort value computed in prior shards is better than all values in the shard. In other words, ES is smart enough to know which will contain valid hits or not depending on the sort value. In your case, since the sort on the timestamp excludes all values from 2020, ES knows that the shard(s) from the 2020 index can be excluded since none will match.
Another possibility is to leverage index sorting so that terms are sorted at indexing time. Terms are sorted in each segment of the index but also every time segments are merged, the new merged set of terms needs to be resorted again, so this can have performance implications. Test before use!

elasticsearch query spend all time in create_weight

I have two elasticsearch clusters with the same indexes and data on each cluster.
The same simple query takes milliseconds on cluster A but it takes more than 10 seconds on cluster B.
I used the _profileAPI and on the cluster B, I can see elasticsearch spend a huge time on create_weight operation.
{
"id" : "[dj3LJZL1RNuPEP7S0ZXFVQ][index_2018_12][3]",
"searches" : [
{
"query" : [
{
"type" : "TermQuery",
"description" : "n:8096344531",
"time" : "441.2ms",
"time_in_nanos" : 441271696,
"breakdown" : {
"score" : 0,
"build_scorer_count" : 20,
"match_count" : 0,
"create_weight" : 441255457,
"next_doc" : 0,
"match" : 0,
"create_weight_count" : 1,
"next_doc_count" : 0,
"score_count" : 0,
"build_scorer" : 16218,
"advance" : 0,
"advance_count" : 0
}
}
],
"rewrite_time" : 3967,
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time" : "79.4micros",
"time_in_nanos" : 79420,
"children" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time" : "42.1micros",
"time_in_nanos" : 42166
}
]
}
]
}
],
"aggregations" : [ ]
},
{
"id" : "[dj3LJZL1RNuPEP7S0ZXFVQ][index_2018_12][4]",
"searches" : [
{
"query" : [
{
"type" : "TermQuery",
"description" : "n:8096344531",
"time" : "296.7ms",
"time_in_nanos" : 296795143,
"breakdown" : {
"score" : 0,
"build_scorer_count" : 15,
"match_count" : 0,
"create_weight" : 296779276,
"next_doc" : 0,
"match" : 0,
"create_weight_count" : 1,
"next_doc_count" : 0,
"score_count" : 0,
"build_scorer" : 15851,
"advance" : 0,
"advance_count" : 0
}
}
],
"rewrite_time" : 2947,
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time" : "54.7micros",
"time_in_nanos" : 54776,
"children" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time" : "18.6micros",
"time_in_nanos" : 18642
}
]
}
]
}
],
"aggregations" : [ ]
},
{
"id" : "[dj3LJZL1RNuPEP7S0ZXFVQ][index_2019_01][3]",
"searches" : [
{
"query" : [
{
"type" : "TermQuery",
"description" : "n:8096344531",
"time" : "173.2ms",
"time_in_nanos" : 173260750,
"breakdown" : {
"score" : 0,
"build_scorer_count" : 17,
"match_count" : 0,
"create_weight" : 173247380,
"next_doc" : 0,
"match" : 0,
"create_weight_count" : 1,
"next_doc_count" : 0,
"score_count" : 0,
"build_scorer" : 13352,
"advance" : 0,
"advance_count" : 0
}
}
],
"rewrite_time" : 4606,
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time" : "47.5micros",
"time_in_nanos" : 47584,
"children" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time" : "15.8micros",
"time_in_nanos" : 15809
}
]
}
]
}
],
"aggregations" : [ ]
},
...
On cluster B the TermQuery takes between 100ms and 500ms on each shard whereas it takes only a few microseconds on cluster A
What can I do to fix this ?
I solved my own problem, so I post it here !
In fact cluster B indexes were created by a snapshot restore of indexes of cluster A (that's why I add exactly the same data in each cluster). I think that's why the indexes where segmented.
To solve the slowness issue I had to do a forcemerge on each index:
POST /index_*/_forcemerge?max_num_segments=1

Resources