Elasticsearch _search query always runs on every index

Elasticsearch _search query always runs on every index - elasticsearch

I'm having an issue with a Kibana Dashboard, which complains with multiple Courier Fetch: xxx of 345 shards failed. warning messages every time I reload it.
Okay, I'm asking for data spanning over the last 15 minutes, and I have an index per day. There is no way today's index contains 345 shards. So, why does my query span over so many shards ?
Things I have checked :
Number of indices and of shards per index :
I checked this using the _cat/indices endpoint : After filtering out indices I didn't create myself (such as kibana's indices, basically everything that starts with a dot), I have 69 indices, each containing 5 shards (adding up to a total of 345 shards). That's what I was expecting.
This basically means that my search is executed on all of my indices.
I'm not writing new data to old indices :
Here is a query for last hour's records on today's index1 :
GET 20181027_logs/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1543326215000,
"lte": 1543329815000,
"format": "epoch_millis"
}
}
}
]
}
}
}
Answer (truncated) :
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1557,
Same query without restricting the index :
GET *_logs/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1543326215000,
"lte": 1543329815000,
"format": "epoch_millis"
}
}
}
]
}
}
}
Answer (truncated) :
{
"took": 24,
"timed_out": false,
"_shards": {
"total": 345,
"successful": 345,
"failed": 0
},
"hits": {
"total": 1557,
We can see that the second query returns exactly the same results than the first one, but searches through every index.
My timestamp field is indexed :
By default, every field in elasticsearch is indexed, but I still double-checked it :
GET 20181027_logs/_mapping
{
"20181027_logs": {
"mappings": {
"logs": {
"properties": {
[…]
"timestamp": {
"type": "date"
}
[…]
While a non-indexed field would give2 :
"timestamp": {
"type": "date",
"index": false
}
Remaining leads
At this point, I have really no idea what could be the issue.
Just as a side note : The timestamp field is not the insertion date of the event, but the date at which the event actually happened. Regardless of this timestamp, the events are inserted in the latest index.
This means that every index can have events corresponding to past dates, but no future dates.
In this precise case, I don't see how this could matter : since we're only querying for the last 15 minutes, the data can only be in the last index no matter what happens.
Elasticsearch and Kibana version : 5.4.3
Thanks for reading this far, and any help would be greatly appreciated !
1 : There's a mistake in index naming, causing an offset between the index name and the actual corresponding date, but it should not matter here.
2 : This was checked on another elastic cluster, of the same version, with some fields explicitly opted out of indexing

TL;DR
I finally solved the issue simply by reducing the number of shards.
Full disclosure
When using the dev tools on kibana, I could find many errors on the _msearch endpoint :
{
"shard": 2,
"index": "20180909_logs",
"node": "FCv8yvbyRhC9EPGLcT_k2w",
"reason": {
"type": "es_rejected_execution_exception",
"reason": "rejected execution of org.elasticsearch.transport.TransportService$7#754fe283 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#16a14433[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 16646]]"
}
},
Which basically proves that I'm flooding my ES server with too many parallel requests on too many shards.
From what I could understand, apparently it's normal for kibana to query against every single index of my index pattern, event if some of them don't contain any fresh data (ES is supposed to query them anyway, and conclude that they don't contain any data in almost no time since the timestamp field is indexed)
From there, I had a few options :
1: Reduce the data retention
2: Reduce the number of parallel requests I am doing
3: Add nodes to my cluster
4: Restructure my data to use fewer shards
5: Increase the size of the search queue
1 and 2 are not an option in my case.
5 would probably work, but is apparently highly recommended against (from what I could understand, in most cases, this error is only the symptom of deeper issues, that should be fixed instead)
This is a 160GB single-node cluster, with (now) more than 350 shards. This makes an extremely low average size per shard, so I decided to first try number 4 : Reindex my data to use fewer shards.
How I dit it
Use a single shard per index :
I created the following index pattern :
PUT _template/logs {
"template": "*_logs",
"settings": {
"number_of_shards": 1
}
}
Now, all my future indices will have a single shard.
I still need to reindex or merge the existing indices, but this has to be done with the next point anyway.
Switch to monthly indices (instead of daily)
I modified the code that inserts data into ES to use a month-based index name (such as 201901_monthly_logs, and then reindexed every old index to the corresponding one in the new pattern :
POST _reindex
{
"source": {
"index": "20181024_logs"
},
"dest": {
"index": "201810_monthly_logs"
}
}
Enjoy !
This being done, I was down to 7 indices (and 7 shards as well).
All that was left was changing the index pattern from _logs to _monthly_logs in my kibana visualisations.
I haven't had any issue since this time, I'll just wait a bit more, then delete my old indices.

Related

Elasticsearch - Reindex whole cluster using pattern for new index name

I have an index with thousands of indices, with 5 shards per index.
I would like to reindex them with only 1 shard per index.
Is there a build in solution in Elastic to reindex for instance all the indices by adding "-reindexed" to each index ?

Looks like you want to dynamically change the index names while reindexing.
Let's understand this with an example:
1) Add some indices:
POST sample/_doc/1
{
"test" : "sample"
}
POST sample1/_doc/1
{
"test" : "sample"
}
POST sample2/_doc/1
{
"test" : "sample"
}
2) Use Reindex API to dynamically change the index names while reindexing multiple indices:
POST _reindex
{
"source": {
"index": "sample*"
},
"dest": {
"index": ""
},
"script": {
"inline": "ctx._index = ctx._index + '-reindexed'"
}
}
The above request will reindex all the indices starting with sample and add -reindexed in their indexNames. So that means sample, sample1 and sample2 will be reindexed as sample-reindexed, sample1-reindexed and sample2-reindexed all with this one request.
In order to set up the destination indices with one shard you need to
create those indices before reindexing.
Hope that helps.

You could do a simple reindex but I'd also recommend you take a look at the Shrink Index API:
https://www.elastic.co/guide/en/elasticsearch/reference/7.0/indices-shrink-index.html
The documentation above links to v7.0, but this has been around for many iterations.
In your example, you would do something similar to the following:
First, reallocate copies of all primary or replica shards to a single node and prevent any future write-access while the shrink operations are being performed.
PUT /my_source_index/_settings
{
"settings": {
"index.routing.allocation.require._name": "shrink_node_name",
"index.blocks.write": true
}
}
Initiate the shrink operation, clear the index settings set in the previous command, and update your primary and replica settings on the target index:
POST my_source_index/_shrink/my_target_index-reindexed
{
"settings": {
"index.routing.allocation.require._name": null,
"index.blocks.write": null,
"index.number_of_replicas": 1,
"index.number_of_shards": 1,
"index.codec": "best_compression"
}
}
Note the above is also allocating a replica shard - if you don't want this, ensure you set this to 0.
You would want to set up a script of some sort to iterate through the list of source indices one by one.

Reindex fail due to SearchContextMissingException

My company is using elasticsearch 2.3.4.
We have a cluster that contains 38 ES nodes, and we've been having a problem with reindexing some of our data lately...
We've reindexed before very large indexes and had no problems, but recently, when trying to reindex much smaller indexed (less than 10GB) - we get : "SearchContextMissingException [No search context found for id [XXX]]".
We have no idea what's causing this problem or how to fix it. We'd like some guidance.
Has anyone saw this exception before?

From github comments on issues related to this , i think this can be avoided by changing batch size :
From documentation:
By default _reindex uses scroll batches of 1000. You can change the batch size with the size field in the source element:
POST _reindex
{
"source": {
"index": "source",
"size": 100
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}

I had the same problem with an index that holds many huge documents. I had to reduce the batch size down to 10. (100 and 50 both didn't work).
This was the request that worked in the end:
POST _reindex?slices=5&refresh
{
"source": {
"index": "source_index",
"size": 10
},
"dest": {
"index": "dest_index"
}
}
You should also set the slices to the number of shards you have in your index.

Different results for same query in Elasticsearch Cluster

I have created a Elasticsearch cluster with 3 nodes , having 3 shards and 2 replicas.
The same query fetch different results when hit to the same index with same data.
Right now the results are basically sorted by the _score field desc (I think its the default way of sorting) and requirement also wants that the result be sorted in desc order of there score.
So here my question is why does same query yield different result, and then how can this be corrected to have same result every time with same query.
query attached
{
"from": 0,
"size": 10,
"query": {
"bool": {
"must": {
"bool": {
"must": {
"terms": {
"context": [
"my name"
]
}
},
"should": {
"multi_match": {
"query": "test",
"fields": [
"field1^2",
"field2^2",
"field3^3"
]
}
},
"minimum_should_match": "1"
}
},
"filter": {
"bool": {
"must": [
{
"terms": {
"audiencecomb": [
"1235"
]
}
},
{
"terms": {
"consumablestatus": [
"1"
]
}
}
],
"minimum_should_match": "1"
}
}
}
}
}

One of the possible reasons could be distributed IDF, by default Elastic uses local IDF on each shard, to save some performance which will lead to different idfs across the cluster. So, you should try ?search_type=dfs_query_then_fetch, which will explicitly asks Elastic to compute global IDF.
However, for performance reasons, Elasticsearch doesn’t calculate the
IDF across all documents in the index. Instead, each shard calculates
a local IDF for the documents contained in that shard.
Because our documents are well distributed, the IDF for both shards
will be the same. Now imagine instead that five of the foo documents
are on shard 1, and the sixth document is on shard 2. In this
scenario, the term foo is very common on one shard (and so of little
importance), but rare on the other shard (and so much more important).
These differences in IDF can produce incorrect results.
In practice, this is not a problem. The differences between local and
global IDF diminish the more documents that you add to the index. With
real-world volumes of data, the local IDFs soon even out. The problem
is not that relevance is broken but that there is too little data.
For testing purposes, there are two ways we can work around this
issue. The first is to create an index with one primary shard, as we
did in the section introducing the match query. If you have only one
shard, then the local IDF is the global IDF.
The second workaround is to add ?search_type=dfs_query_then_fetch to
your search requests. The dfs stands for Distributed Frequency Search,
and it tells Elasticsearch to first retrieve the local IDF from each
shard in order to calculate the global IDF across the whole index.
For more information take a look here

Time taken by count query - Elastic search

I want to know the time taken by count query in elastic search, just like the search query query which contain took - time taken.
My Query looks like -
curl -XGET "http://localhost:9200/index1/type1/_count"
And result for that query -
{
"count": 136,
"_shards": {
"total": 15,
"successful": 15,
"failed": 0
}
}
Is there is any way so that I can get the time taken for count query just like search api?
Document for count API - Count API

At the time of writing this answer still its not supported by Elastic, raised a feature request and mostly I will work on to add a support of it.

A trick that can help with that is to use _search
with size zero (so no restult will be returned);
track_total_hits set to true (so it will count all hits, not only the ones in the result window); and
filter_path equal to took,htis.total.value.
For example, I executed the query above in a cluster of mine...
GET viagens-*/_search?filter_path=took,hits.total.value
{
"size": 0,
"track_total_hits": true,
"query": {
"match_all": {}
}
}
...and got this result:
{
"took": 2,
"hits": {
"total": {
"value": 2589552
}
}
}
It does not profile the Count API itself, unfortunately, but has a similar result. Can be very useful as an alternative in some situations!

Elasticsearch - How to boost score by the results of an aggregation?

My use case is as follows:
Execute a search against Products and boost the score by its salesRank relative to the other documents in the results. The top 10% sellers should be boosted by a factor of 1.5 and the top 25-10% should be boosted by a factor of 1.25. The percentiles are calculated on the results of the query, not the entire data set. This is feature is being used for on-the-fly instant results as the user types, so single character queries would still return results.
So for example, if I search for "Widget" and get back 100 results, the top 10 sellers returned will get boosted by 1.5 and the top 10-25 will get boosted by 1.25.
I immediately thought of using the percentiles aggregation feature to calculate the 75th and 90th percentiles of the result set.
POST /catalog/product/_search?_source_include=name,salesRank
{
"query": {
"match_phrase_prefix": {
"name": "N"
}
},
"aggs": {
"sales_rank_percentiles": {
"percentiles": {
"field" : "salesRank",
"percents" : [75, 90]
}
}
}
}
This gets me the following:
{
"hits": {
"total": 142,
"max_score": 1.6653868,
"hits": [
{
"_score": 1.6653868,
"_source": {
"name": "nylon",
"salesRank": 46
}
},
{
"_score": 1.6643861,
"_source": {
"name": "neon",
"salesRank": 358
}
},
..... <SNIP> .....
]
},
"aggregations": {
"sales_rank_percentiles": {
"values": {
"75.0": 83.25,
"90.0": 304
}
}
}
}
So great, that gives me the results and the percentiles. But I would like to boost "neon" above "nylon" because it's a top 10% seller in the results (note: in our system, the salesRank value is descending in precedence, higher value = more sales). The text relevancy is very low since only one character was supplied, so sales rank should have a big effect.
It seems that a function core query could be used here, but all of the examples in the documentation uses doc[] to use values from the document. There aren't any for using other information from the top-level of the response, e.g. "aggs" {}. I would basically like to boost a document if its sales rank falls within the 100-90th and 89th-75th percentiles, by 1.5 and 1.25 respectively.
Is this something Elasticsearch supports or am I going to have to roll my own with a custom script or plugin? Or try a different approach entirely? My preference would be to pre-calculate percentiles, index them, and do a constant score boost, but stakeholder prefers the run-time calculation.
I'm using Elasticsearch 1.2.0.

What if you keep sellers as a parent document and periodically updates their stars (and some boosting factor), say, via some worker. Then you match products using has_parent query, and use a combination of score mode, custom score query to match top products from top sellers?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio