Frequent circuit Breaking Exceptions occurring in ES Cluster V7.3.1? - elasticsearch

We are using ES v7.3.1 and from past few days the shards in our ES cluster get unassigned because of circuit breaking exception, I am not able to understand the exact reason which leads to this exception, any help would be really helpful.
This is the detailed info that I get using command GET _cluster/allocation/explain
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2021-02-12T08:05:40.154Z",
"failed_allocation_attempts" : 1,
"details" : "failed shard on node [WbHklo7iSf6jGj90cP9Y-A]: failed to perform indices:data/write/bulk[s] on replica [segment_index_573179789d2572f27bc73e6b][6], node[WbHklo7iSf6jGj90cP9Y-A], [R], s[STARTED], a[id=SNoWjhhYRXClfVqa6lsDAQ], failure RemoteTransportException[[ip-1-0-104-220][1.0.104.220:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [14311120802/13.3gb], which is larger than the limit of [14247644364/13.2gb], real usage: [14311004440/13.3gb], new bytes reserved: [116362/113.6kb], usages [request=0/0b, fielddata=808714408/771.2mb, in_flight_requests=15836072/15.1mb, accounting=683535018/651.8mb]]; ",
"last_allocation_status" : "no_attempt"
}

Your shard allocation explain API is providing all the details you need to resolve the issue
indices:data/write/bulk is causing the circuit breaker.
WbHklo7iSf6jGj90cP9Y-A node on which issue is happening.
its happening on replica shard segment_index_573179789d2572f27bc73e6b
below is the actual limit and data which you are sending which is causing data circuit breaker.
Parent circuit breaker is happening on your node, as mentioned in the log line, I have highlighted the parent and actual limit is 13.2 GB but in your bulk request you are sending 13.3 GB data.
CircuitBreakingException[[parent] Data too large, data for
[<transport_request>] would be [14311120802/13.3gb], which is larger
than the limit of [14247644364/13.2gb], real usage:
[14311004440/13.3gb], new bytes reserved: [116362/113.6kb]
Solution
You can reduce the size of your bulk request to avoid this circuit breaker

Can you provide the result of cluster health?
curl -XGET "http://localhost:9200/_cluster/health?pretty"
Can you provide the results of JVM pressure metric?
As there is not much information, below I explain what happened to me in a similar case of circuit breaking, and what I did in my case.
I am confident that the bulk size is small, the sum of several things(search + bulk + assigment process of new shards) causes the RAM to fill up.
In the cases that I know, "Data too large", refers to the JVM heap. Perhaps, you have reached the memory limit, trigging the circuit breaker. I dare to think(with a high probability of being wrong :/ ) that your ram per node is proportional to the % of the configured circuit breaker.
For example, if you have a circuit breaker set to 80% and it shows an error at 13.3 GB, you surely have 16 GB of RAM (100%) per node.
Depending of your circuit breaker configuration, as a temp workaround you can increses the number temporarily. If your circuit breaker is already high, it is not advisable to increase it.
Maybe your bulk its creating multiples indices, multiples indices creates multiples shards. Too many shards fill the memory, because each shards create memory, filedescriptor uses memory. There have been cases where the logic / pattern of creating new indices is incorrect, and they generate many new indices (creating many shards) inadvertently instead of creating just one index.
Maybe you were already at the limits and you did not know it, until now that you created new indexes (therefore new shards), and the problem arose.
How many shards do you have? curl -XGET "http://localhost:9200/_cat/shards" | wc -l
How many shards support each of your nodes?
When you reach the limit offset with shard your can reduce the number of replica shards, as a temp workaround. In order to complete the bulk.
An option that works for me was disabling temporary shards replica for less important indices.
Check your shards: curl -XGET "http://localhost:9200/_cat/shards"
curl -X "PUT" "http://localhost:9200/mylessimportant_indices/_settings" \
-H "Content-type: application/json" \
-d $'{
"index" : {
"number_of_replicas":0
}
}' | jq

Related

Elasticsearch application latency investigation

We have an Elasticsearch setup w/ [data, master, client] nodes. Client receives only query traffic, pass query to data nodes, gets the response, sends back to caller (based on my general understanding).
We are seeing that 'took' latency in the query response is around ~16ms, but our application who is measuring latency when calling into client is around ~90ms. Here are some numbers on our setup:
ES Setup: 3 client nodes (60GB/3 cpu/30GB heap each), 3 data nodes (80GB/16 cpu/30GB heap each), 3 master nodes. Its a k8 based helm chart setup.
client/data have enough cpu/mem (based on k8's pod level cpu/mem usages)
QPS - 20 req/sec
Shard size ~ 24GB, 0 replicas. Each shard is on a separate data-node. Indices are using mmapfs/preload "*"
Query types: bool query w/ 3 match clauses and 3 should for boosting on few fields. We have "_source=true".
Our documents are quite bigger with (mean, p90, p99) as (200kb, 400kb, 800kb)
Our response size is of the order of (mean, p99) (164kb, 840kb). We also observed latencies for bigger response sizes is much higher than the baseline.
Can someone comment on following questions:
How can we know more exactly where is this extra latency is
introduced? When reading about "took" here, it includes the querying and response forming stages. But something happens after that so that our application measured latency jumps to ~90ms. Where else can I look to look more into this increase? I have access to Prometheus ES dash and K8 pods usages, but all of them look normal and no spikes.
Are there some ES settings we can play with to optimize this
latency? We feel its mostly due to bigger response sizes. Can there be some compression introduced in ES to help w/ this?
Your question is very broad with very less information on your deployment, like types of search queries, index mapping, cluster/nodes/indices specification, and QPS..it's generally very difficult to suggest anything without looking at the system which has performance issues...
Reg, client nodes, yes, they receive the traffic but they also calculate the Global result from the local result set received from each shard involved in the search request. so they do the heavy processing and should have enough capacity otherwise would become bottleneck, even though your data nodes calculate the local result fast, processing at the client node would take more time and overall took time would increase.
You can also see if you have a room to improve some of the suggestions I wrote in this blog post.
Hope this helps.

optimization on old indexes collecting logs from my apps

I have an elastic cluster with 3x nodes(each 6x cpu, 31GB heap , 64GB RAM) collecting 25GB logs per day , but after 3x months I realized my dashboards become very slow when checking stats in past weeks , please, advice if there is an option to improve the indexes read erformance so it become faster when calculating my dashboard stats?
Thanks!
I would suggest you try to increase the shards number
when you have more shards Elasticsearch will split your data over the shards so as a result, Elastic will send multiple parallel requests to search in a smaller data stack
for Shards number you could try to split it based on your heap memory size
No matter what actual JVM heap size you have, the upper bound on the maximum shard count should be 20 shards per 1 GB of heap configured on the server.
ElasticSearch - Optimal number of Shards per node
https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index
https://opster.com/elasticsearch-glossary/elasticsearch-choose-number-of-shards/
It seems that the amount of data that you accumulated and use for your dashboard is causing performance problems.
A straightforward option is to increase your cluster's resources but then you're bound to hit the same problem again. So you should rather rethink your data retention policy.
Chances are that you are really only interested in most recent data. You need to answer the question what "recent" means in your use case and simply discard anything older than that.
Elasticsearch has tools to automate this, look into Index Lifecycle Management.
What you probably need is to create an index template and apply a lifecycle policy to it. Elasticsearch will then handle automatic rollover of indices, eviction of old data, even migration through data tiers in hot-warm-cold architecture if you really want very long retention periods.
All this will lead to a more predictable performance of your cluster.

Elasticsearch: QueueResizingEsThreadPoolExecutor exception

At some point during 35k endurance load test of my java web app which fetches static data from Elasticsearch, I am start getting following elasticsearch exception:
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable#1a25fe82 on QueueResizingEsThreadPoolExecutor[name = search4/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 10.7ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor#6312a0bb[Running, pool size = 25, active threads = 25, queued tasks = 1000, completed tasks = 34575035]]
Elasticsearch details:
Elasticsearch version 6.2.4.
The cluster consists of 5 nodes. The JVM heap size setting for each node is Xms16g and Xmx16g. Each of the node machine has 16 processors.
NOTE: Initially, when I got this exception for the first time, I decided to increase the thread_pool.search.queue_size parameter in elasticsearch.yml, set it to 10000. Yes, I understand, I just postponed the problem to happen later.
Elasticsearch indicies details:
Currently, there are about 20 indicies, and only 6 are being used among of them. The unused one are old indicies that were not deleted after newer were created. The indexes itself are really small:
Index within the red rectangle is the index used by my web app. It's shards and replicas settings are "number_of_shards": "5" and "number_of_replicas": "2" respectively.
It's shards details:
In this article I found that
Small shards result in small segments, which increases overhead. Aim
to keep the average shard size between at least a few GB and a few
tens of GB. For use-cases with time-based data, it is common to see
shards between 20GB and 40GB in size.
As you can see from the screenshot above, my shard size is much less than mentioned size.
Hence, Q: what is the right number of shards in my case? Is it 1 or 2?
The index won't grow up much over the time.
ES Queries issued during the test. The load tests simulates scenario where user navigates to the page for searching some products. User can filter the products using corresponding filters (for e.g. name, city, etc...). The unique filter values is fetched from ES index using composite query. So this is the first query type. Another query is for fetching products from ES. It consists of must, must_not, filter, has_child queries, the size attribute equals 100.
I set the slow search logging, but nothing had been logged:
"index": {
"search": {
"slowlog": {
"level": "info",
"threshold": {
"fetch": {
"debug": "500ms",
"info": "500ms"
},
"query": {
"debug": "2s",
"info": "1s"
}
}
}
}
I feel like I am missing something simple to make it finally and being able to handle my load. Appreciate, if any one can help me with solving the issue.
For such a small size, you are using 5 primary shards, which I feel, due to your ES version 6.X(default was 5), and you never changed it, but In short having high number of primary shards for small index, has severe performance penalty, please refer very similar use-case(I was also having 5 PS 😀) which I covered in my blog.
As you already mentioned that your index size will not grow significantly in future, I would suggest to have 1 primary shard and 4 replica shards
1 Primary shard means for a single search, only one thread and one request will be created in Elasticsearch, this will provide better utilisation of resources.
As you have 5 data nodes, having 4 replica means shards are properly distributed on each data node, so your throughput and performance will be optimal.
After this change, measure the performance, and I am sure after this, you can again reduce the search queue size to 1k, as you know having high queue size is just delaying the problem and not addressing the issue at hand.
Coming to your search slow log, I feel you have very high threshold, for query phase 1 seconds for a query is really high for user-facing application, try to lower it ~100ms and not down those queries and optimize them further.

Elasticsearch: High CPU usage by Lucene Merge Thread

I have a ES 2.4.1 cluster with 3 master and 18 data nodes which collects log data with a new index being created every day. In a day index size grows to about 2TB. Indexes older than 7 days get deleted. Very few searches are being performed on the cluster, so the main goal is to increase indexing throughput.
I see a lot of the following exceptions which is another symptom of what I am going to say next:
EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4#5a7d8a24 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#5f9ef44f[Running, pool size = 8, active threads = 8, queued tasks = 50, completed tasks = 68888704]]];]];
The nodes in the cluster are constantly pegging CPU. I increased index refresh interval to 30s but that had little effect. When I check hot threads I see multiple "Lucene Merge Thread" per node using 100% CPU. I also noticed that segment count is constantly around 1000 per shard, which seems like a lot. The following is an example of a segment stat:
"_2zo5": {
"generation": 139541,
"num_docs": 5206661,
"deleted_docs": 123023,
"size_in_bytes": 5423948035,
"memory_in_bytes": 7393758,
"committed": true,
"search": true,
"version": "5.5.2",
"compound": false
}
Extremely high "generation" number worries me and I'd like to optimize segment creation and merge to reduce CPU load on the nodes.
Details about indexing and cluster configuration:
Each node is an i2.2xl AWS instance with 8 CPU cores and 1.6T SSD drives
Documents are indexed constantly by 6 client threads with bulk size 1000
Each index has 30 shards with 1 replica
It takes about 25 sec per batch of 1000 documents
/_cat/thread_pool?h=bulk*&v shows that bulk.completed are equally spread out across nodes
Index buffer size and transaction durability are left at default
_all is disabled, but dynamic mappings are enabled
The number of merge threads is left at default, which should be OK given that I am using SSDs
What's the best way to go about it?
Thanks!
Here are the optimizations I made to the cluster to increase indexing throughput:
Increased threadpool.bulk.queue_size to 500 because index requests were frequently overloading the queues
Increased disk watermarks, because default settings were too aggressive for the large SSDs that we were using. I set "cluster.routing.allocation.disk.watermark.low": "100gb" and "cluster.routing.allocation.disk.watermark.high": "10gb"
Deleted unused indexes to free up resources ES uses to manage their shards
Increased number of primary shards to 175 with the goal of keeping shard size under 50GB and have approximately a shard per processor
Set client index batch size to 10MB, which seemed to work very well for us because the size of documents indexed varied drastically (from KBs to MBs)
Hope this helps others
I have run similar workloads and your best bet is to run hourly indices and run optimize on older indices to keep segments in check.

How to prevent Elasticsearch from index throttling?

I have a 40 node Elasticsearch cluster which is hammered by a high index request rate. Each of these nodes makes use of an SSD for the best performance. As suggested from several sources, I have tried to prevent index throttling with the following configuration:
indices.store.throttle.type: none
Unfortunately, I'm still seeing performance issues as the cluster still periodically throttles indices. This is confirmed by the following logs:
[2015-03-13 00:03:12,803][INFO ][index.engine.internal ] [CO3SCH010160941] [siphonaudit_20150313][19] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2015-03-13 00:03:12,829][INFO ][index.engine.internal ] [CO3SCH010160941] [siphonaudit_20150313][19] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2015-03-13 00:03:13,804][INFO ][index.engine.internal ] [CO3SCH010160941] [siphonaudit_20150313][19] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2015-03-13 00:03:13,818][INFO ][index.engine.internal ] [CO3SCH010160941] [siphonaudit_20150313][19] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2015-03-13 00:05:00,791][INFO ][index.engine.internal ] [CO3SCH010160941] [siphon_20150313][6] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2015-03-13 00:05:00,808][INFO ][index.engine.internal ] [CO3SCH010160941] [siphon_20150313][6] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2015-03-13 00:06:00,861][INFO ][index.engine.internal ] [CO3SCH010160941] [siphon_20150313][6] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2015-03-13 00:06:00,879][INFO ][index.engine.internal ] [CO3SCH010160941] [siphon_20150313][6] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
The throttling occurs after one of the 40 nodes dies for various expected reasons. The cluster immediately enters a yellow state, in which a number of shards will begin initializing on the remaining nodes.
Any idea why the cluster continues to throttle after explicitly configuring it not to? Any other suggestions to have the cluster more quickly return to a green state after a node failure?
The setting that actually corresponds to the maxNumMerges in the log file is called index.merge.scheduler.max_merge_count. Increasing this along with index.merge.scheduler.max_thread_count (where max_thread_count <= max_merge_count) will increase the number of simultaneous merges which are allowed for segments within an individual index's shards.
If you have a very high indexing rate that results in many GBs in a single index, you probably want to raise some of the other assumptions that the Elasticsearch default settings make about segment size, too. Try raising the floor_segment - the minimum size before a segment will be considered for merging, the max_merged_segment - the maximum size of a single segment, and the segments_per_tier -- the number of segments of roughly equivalent size before they start getting merged into a new tier. On an application that has a high indexing rate and finished index sizes of roughly 120GB with 10 shards per index, we use the following settings:
curl -XPUT /index_name/_settings -d'
{
"settings": {
"index.merge.policy.max_merge_at_once": 10,
"index.merge.scheduler.max_thread_count": 10,
"index.merge.scheduler.max_merge_count": 10,
"index.merge.policy.floor_segment": "100mb",
"index.merge.policy.segments_per_tier": 25,
"index.merge.policy.max_merged_segment": "10gb"
}
}
Also, one important thing you can do to improve loss-of-node/node restarted recovery time on applications with high indexing rates is taking advantage of index recovery prioritization (in ES >= 1.7). Tune this setting so that the indices that receive the most indexing activity are recovered first. As you may know, the "normal" shard initialization process just copies the already-indexed segment files between nodes. However, if indexing activity is occurring against a shard before or during initialization, the translog with the new documents can become very large. In the scenario where merging goes through the roof during recovery, it's the replay of this translog against the shard that is almost always the culprit. Thus, using index recovery prioritization to recover those shards first and delay shards with less indexing activity, you can minimize the eventual size of the translog which will dramatically improve recovery time.
We are using 1.7 and noticed a similar problem: The indexing getting throttled even when the IO was not saturated (Fusion IO in our case).
After increasing "index.merge.scheduler.max_thread_count" the problem seems to be gone -- we did not see any more throttling being logged so far.
I would try setting "index.merge.scheduler.max_thread_count" to at least the max reported numMergesInFlight (6 in the logs above).
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/index-modules-merge.html#scheduling
Hope this helps!
Have you looked into increasing the shard allocation delay to give the node time to recover before the master starts promoting replicas?
https://www.elastic.co/guide/en/elasticsearch/reference/current/delayed-allocation.html
try setting index.merge.scheduler.max_thread_count to 1
https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing

Resources