circuit_breaking_exception in kibanaa - elasticsearch

{
statusCode: 429,
error: "Too Many Requests",
message: "[circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [2047736072/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2047736072/1.9gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=854525953/814.9mb, in_flight_requests=0/0b, accounting=79344850/75.6mb], with { bytes_wanted=2047736072 & bytes_limit=2040109465 & durability="PERMANENT" }"
}

circuit breakers are used to prevent the elasticsearch process to die and there are various types of circuit breakers and by looking at your logs its clear it's breaking the parent circuit breaker and to solve this, either increase the Elasticsearch JVM heap size(recommended) or increase the circuit limit.

As Elasticsearch Ninja alluded to, this error is generally produced from Elasticsearch, despite Kibana being the one displaying the error. Adjusting the heap size for Elasticsearch should generally resolve this error.
This should be done with the Xms and Xmx options of the jvm.options file for Elasticsearch.
https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html#heap-size-settings

Related

Data too large ElasticSearch issue along with Readiness probe failed

We have set up an EFK stack for our project and from yesterday kibana seems down. When we initially troubleshooter we have found the following errors:
Readiness probe failed: Error: Got HTTP code 503 but expected a 200 & Readiness probe failed: Error: Got HTTP code 000 but expected a 200
Later we found the same issue with elasticsearch pod as well. along with this we found the following issue with Data request limit:
FATAL
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent]
Data too large, data for [indices:admin/template/get] would be
[1036909172/988.8mb], which is larger than the limitof
[1020054732/972.7mb], real usage: [1036909056/988.8mb], new bytes
reserved: [116/116b], usages [request=0/0b, fielddata=420/420b,
in_flight_requests=67310/65.7kb, model_inference=0/0b,
eql_sequence=0/0b,
accounting=110294544/105.1mb]","bytes_wanted":1036909172,"bytes_limit":1020054732,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent]
Data too large, data for [indices:admin/template/get] would be
[1036909172/988.8mb], which is larger than the limit of
[1020054732/972.7mb], real usage: [1036909056/988.8mb], new bytes
reserved: [116/116b], usages [request=0/0b, fielddata=420/420b,
in_flight_requests=67310/65.7kb, model_inference=0/0b,
eql_sequence=0/0b,
accounting=110294544/105.1mb]","bytes_wanted":1036909172,"bytes_limit":1020054732,"durability":"PERMANENT"},"status":429}
We have tried changing the REDINESS_PROBE_TIMEOUT, Initial Delay, Timeout, Probe Period, Success Threshold, and Failure Threshold. Also tried increasing the Indicess Breaker limit but it's not reflecting we can see error still taking old limits, tried fixing circuit_breaking_exception by adding ES_JAVA_OPTS values as well.
Nothing seems to be working, any help would be appreciated.
the same phenomenon occurred during the service operation. This issue is identified as a memory shortage. So there are several ways to think about it over.
Physical Memory Expansion (Scale Out)
Additional equipment due to insufficient memory available
Lower load through monitoring
If circuit_breaking_exception remains in the log, develop a monitoring device that lowers the load
Setting java_opts
You can set memory usage, but it's meaningless if you don't have enough hardware memory

Increase queue capacity in ElasticSearch

Elastic version 7.8
I'm getting an error when running this code for thousands of records:
var bulkIndexResponse = await _client.BulkAsync(i => i
.Index(indexName)
.IndexMany(bases));
if (!bulkIndexResponse.IsValid)
{
throw bulkIndexResponse.OriginalException;
}
It eventually crashes with the following error:
Invalid NEST response built from a successful (200) low level call on POST: /indexname/_bulk
# Invalid Bulk items:
operation[1159]: index returned 429 _index: indexname _type: _doc _id: _version: 0 error: Type:
es_rejected_execution_exception Reason: "Could not perform enrichment, enrich coordination queue at
capacity [1024/1024]"
I would like to know how this enrich coordination queue capacity can be increased to accommodate continuous calls of BulkAsync with around a thousand records on each call.
you can check what thread_pool is getting full by /_cat/thread_pool?v and increase the queue (as ninja said) in elasticsearch.yml for each node.
but increasing queue size affect heap consumption and subsequently maybe it would affect performance.
when you get this error it may have two reason. first you are sending large bulk request. try to decrease the bulk request under 500 or lower. second you have some performance issue. try to find and solve the issue. maybe you should add more node to your cluster.
Not sure what version you are, but this enrich coordination queue seems to be the bulk queue and you can increase the queue size(these are node specific) by changing the elasticsearch.yml of that node.
Refer threadpools in ES for more info.

Elasticsearch 7.x circuit breaker - data too large - troubleshoot

The problem:
Since the upgrading from ES-5.4 to ES-7.2 I started getting "data too large" errors, when trying to write concurrent bulk request (or/and search requests) from my multi-threaded Java application (using elasticsearch-rest-high-level-client-7.2.0.jar java client) to an ES cluster of 2-4 nodes.
My ES configuration:
Elasticsearch version: 7.2
custom configuration in elasticsearch.yml:
thread_pool.search.queue_size = 20000
thread_pool.write.queue_size = 500
I use only the default 7.x circuit-breaker values, such as:
indices.breaker.total.limit = 95%
indices.breaker.total.use_real_memory = true
network.breaker.inflight_requests.limit = 100%
network.breaker.inflight_requests.overhead = 2
The error from elasticsearch.log:
{
"error": {
"root_cause": [
{
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
}
],
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
},
"status": 429
}
Thoughts:
I'm having hard time to pin point the source of the issue.
When using ES cluster nodes with <=8gb heap size (on a <=16gb vm), the problem become very visible, so, one obvious solution is to increase the memory of the nodes.
But I feel that increasing the memory only hides the issue.
Questions:
I would like to understand what scenarios could have led to this error?
and what action can I take in order to handle it properly?
(change circuit-breaker values, change es.yml configuration, change/limit my ES requests)
The reason is that the heap of the node is pretty full and being caught by the circuit breaker is nice because it prevents the nodes from running into OOMs, going stale and crash...
Elasticsearch 6.2.0 introduced the circuit breaker and improved it in 7.0.0. With the version upgrade from ES-5.4 to ES-7.2, you are running straight into this improvement.
I see 3 solutions so far:
Increase heap size if possible
Reduce the size of your bulk requests if feasible
Scale-out your cluster as the shards are consuming a lot of heap, leaving nothing to process the large request. More nodes will help the cluster to distribute the shards and requests among more nodes, what leads to a lower AVG heap usage on all nodes.
As an UGLY workaround (not solving the issue) one could increase the limit after reading and understanding the implications:
So I've spent some time researching how exactly ES implemented the new circuit breaker mechanism, and tried to understand why we are suddenly getting those errors?
the circuit breaker mechanism exists since the very first versions.
we started experience issues around it when moving from version 5.4 to 7.2
in version 7.2 ES introduced a new way for calculating circuit-break: Circuit-break based on real memory usage (why and how: https://www.elastic.co/blog/improving-node-resiliency-with-the-real-memory-circuit-breaker, code: https://github.com/elastic/elasticsearch/pull/31767)
In our internal upgrade of ES to version 7.2, we changed the jdk from 8 to 11.
also as part of our internal upgrade we changed the jvm.options default configuration, switching the official recommended CMS GC with the G1GC GC which have a fairly new support by elasticsearch.
considering all the above, I found this bug that was fixed in version 7.4 regarding the use of circuit-breaker together with the G1GC GC: https://github.com/elastic/elasticsearch/pull/46169
How to fix:
change the configuration back to CMS GC.
or, take the fix. the fix for the bug is just a configuration change that can be easily changed and tested in your deployment.

elasticsearch es_rejected_execution_exception

I'm trying to index a 12mb log file which has 50,000 logs.
After Indexing around 30,000 logs, I'm getting the following error
[2018-04-17T05:52:48,254][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7#560f63a9 on EsThreadPoolExecutor[name = EC2AMAZ-1763048/bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#7d6ae98b[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 3834]]"})
However, I've gone through the documentation and elasticsearch forum which suggested me to increase the elasticsearch bulk queue size. I tried using curl but I'm not able to do that.
curl -XPUT localhost:9200/_cluster/settings -d '{"persistent" : {"threadpool.bulk.queue_size" : 100}}'
is increasing the queue size good option? I can't increase the hardware because I have fewer data.
The error I'm facing is due to the problem with the queue size or something else? If with queue size How to update the queue size in elasticsearch.yml and do I need to restart es after updating in elasticsearch.yml?
Please let me know. Thanks for your time
Once your indexing cant keep up with indexing requests - elasticsearch enqueues them in threadpool.bulk.queue and starts rejecting if the # of requests in queue exceeds threadpool.bulk.queue_size
Its good idea to consider throttling your indexing . Threadpool size defaults are generally good ; While you can increase them , you may not have enough resources ( memory, CPU ) available .
This blogpost from elastic.co explains the problem really well .
by reducing the batch size it resolved my problem.
POST _reindex
{
"source":{
"index":"sourceIndex",
"size": 100
},
"dest":{
"index":"destIndex"}
}

Best practices for ElasticSearch initial bulk import

I'm running a docker setup with ElasticSearch, Logstash, Filebeat and Kibana inspired by the Elastic Docker Compose. I need to initial load 15 GB og logfiles into the system (Filebeat->Logstash->ElasticSearch) but I'm having some issues with performance.
It seems that Filebeat/Logstash is outputting too much work for ElasticSearch. After some time I begin to see a bunch of errors in ElasticSearch like this:
[INFO ][o.e.i.IndexingMemoryController] [f8kc50d] now throttling indexing for shard [log-2017.06.30]: segment writing can't keep up
I've found this old documentation article on how to disable merge throttling: https://www.elastic.co/guide/en/elasticsearch/guide/master/indexing-performance.html#segments-and-merging.
PUT /_cluster/settings
{
"transient" : {
"indices.store.throttle.type" : "none"
}
}
But in current version (ElasticSearch 6) it gives me this error:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "transient setting [indices.store.throttle.type], not dynamically updateable"
}
],
"type": "illegal_argument_exception",
"reason": "transient setting [indices.store.throttle.type], not dynamically updateable"
},
"status": 400
}
How can I solve the above issue?
The VM has 4 CPU cores (Intel Xeon E5-2650) and ElasticSearch is assigned 4GB of RAM, Logstash and Kibana 1GB each. Swapping is disabled using "swapoff -a". X-pack and monitoring is enabled. I only have one ES node for this log server. Do I need to have multiple node for this initial bulk import?
EDIT1:
Changing the number_of_replicas and refresh_interval seems to make it perform better. Still testing.
PUT /log-*/_settings
{
"index.number_of_replicas" : "0",
"index.refresh_interval" : "-1"
}
Most likely the bottleneck is IO (you can confirm this running iostat, also it would be useful if you post ES monitoring screenshots), so you need to reduce pressure on it.
Default ES configuration causes generation of many index segments during a bulk load. To fix this, for the bulk load, increase index.refresh_interval (or set it to -1) - see doc. The default value is 1 sec, which causes new segment to be created every 1 second, also try to increase batch size and see if it helps.
Also if you use spinning disks,set index.merge.scheduler.max_thread_count to 1. This will allow only one thread to perform segments merging and will reduce contention for IO between segments merging and indexing.

Resources