Data too large ElasticSearch issue along with Readiness probe failed - elasticsearch

We have set up an EFK stack for our project and from yesterday kibana seems down. When we initially troubleshooter we have found the following errors:
Readiness probe failed: Error: Got HTTP code 503 but expected a 200 & Readiness probe failed: Error: Got HTTP code 000 but expected a 200
Later we found the same issue with elasticsearch pod as well. along with this we found the following issue with Data request limit:
FATAL
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent]
Data too large, data for [indices:admin/template/get] would be
[1036909172/988.8mb], which is larger than the limitof
[1020054732/972.7mb], real usage: [1036909056/988.8mb], new bytes
reserved: [116/116b], usages [request=0/0b, fielddata=420/420b,
in_flight_requests=67310/65.7kb, model_inference=0/0b,
eql_sequence=0/0b,
accounting=110294544/105.1mb]","bytes_wanted":1036909172,"bytes_limit":1020054732,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent]
Data too large, data for [indices:admin/template/get] would be
[1036909172/988.8mb], which is larger than the limit of
[1020054732/972.7mb], real usage: [1036909056/988.8mb], new bytes
reserved: [116/116b], usages [request=0/0b, fielddata=420/420b,
in_flight_requests=67310/65.7kb, model_inference=0/0b,
eql_sequence=0/0b,
accounting=110294544/105.1mb]","bytes_wanted":1036909172,"bytes_limit":1020054732,"durability":"PERMANENT"},"status":429}
We have tried changing the REDINESS_PROBE_TIMEOUT, Initial Delay, Timeout, Probe Period, Success Threshold, and Failure Threshold. Also tried increasing the Indicess Breaker limit but it's not reflecting we can see error still taking old limits, tried fixing circuit_breaking_exception by adding ES_JAVA_OPTS values as well.
Nothing seems to be working, any help would be appreciated.

the same phenomenon occurred during the service operation. This issue is identified as a memory shortage. So there are several ways to think about it over.
Physical Memory Expansion (Scale Out)
Additional equipment due to insufficient memory available
Lower load through monitoring
If circuit_breaking_exception remains in the log, develop a monitoring device that lowers the load
Setting java_opts
You can set memory usage, but it's meaningless if you don't have enough hardware memory

Related

circuit_breaking_exception in kibanaa

{
statusCode: 429,
error: "Too Many Requests",
message: "[circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [2047736072/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2047736072/1.9gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=854525953/814.9mb, in_flight_requests=0/0b, accounting=79344850/75.6mb], with { bytes_wanted=2047736072 & bytes_limit=2040109465 & durability="PERMANENT" }"
}
circuit breakers are used to prevent the elasticsearch process to die and there are various types of circuit breakers and by looking at your logs its clear it's breaking the parent circuit breaker and to solve this, either increase the Elasticsearch JVM heap size(recommended) or increase the circuit limit.
As Elasticsearch Ninja alluded to, this error is generally produced from Elasticsearch, despite Kibana being the one displaying the error. Adjusting the heap size for Elasticsearch should generally resolve this error.
This should be done with the Xms and Xmx options of the jvm.options file for Elasticsearch.
https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html#heap-size-settings

Elasticsearch 7.x circuit breaker - data too large - troubleshoot

The problem:
Since the upgrading from ES-5.4 to ES-7.2 I started getting "data too large" errors, when trying to write concurrent bulk request (or/and search requests) from my multi-threaded Java application (using elasticsearch-rest-high-level-client-7.2.0.jar java client) to an ES cluster of 2-4 nodes.
My ES configuration:
Elasticsearch version: 7.2
custom configuration in elasticsearch.yml:
thread_pool.search.queue_size = 20000
thread_pool.write.queue_size = 500
I use only the default 7.x circuit-breaker values, such as:
indices.breaker.total.limit = 95%
indices.breaker.total.use_real_memory = true
network.breaker.inflight_requests.limit = 100%
network.breaker.inflight_requests.overhead = 2
The error from elasticsearch.log:
{
"error": {
"root_cause": [
{
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
}
],
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
},
"status": 429
}
Thoughts:
I'm having hard time to pin point the source of the issue.
When using ES cluster nodes with <=8gb heap size (on a <=16gb vm), the problem become very visible, so, one obvious solution is to increase the memory of the nodes.
But I feel that increasing the memory only hides the issue.
Questions:
I would like to understand what scenarios could have led to this error?
and what action can I take in order to handle it properly?
(change circuit-breaker values, change es.yml configuration, change/limit my ES requests)
The reason is that the heap of the node is pretty full and being caught by the circuit breaker is nice because it prevents the nodes from running into OOMs, going stale and crash...
Elasticsearch 6.2.0 introduced the circuit breaker and improved it in 7.0.0. With the version upgrade from ES-5.4 to ES-7.2, you are running straight into this improvement.
I see 3 solutions so far:
Increase heap size if possible
Reduce the size of your bulk requests if feasible
Scale-out your cluster as the shards are consuming a lot of heap, leaving nothing to process the large request. More nodes will help the cluster to distribute the shards and requests among more nodes, what leads to a lower AVG heap usage on all nodes.
As an UGLY workaround (not solving the issue) one could increase the limit after reading and understanding the implications:
So I've spent some time researching how exactly ES implemented the new circuit breaker mechanism, and tried to understand why we are suddenly getting those errors?
the circuit breaker mechanism exists since the very first versions.
we started experience issues around it when moving from version 5.4 to 7.2
in version 7.2 ES introduced a new way for calculating circuit-break: Circuit-break based on real memory usage (why and how: https://www.elastic.co/blog/improving-node-resiliency-with-the-real-memory-circuit-breaker, code: https://github.com/elastic/elasticsearch/pull/31767)
In our internal upgrade of ES to version 7.2, we changed the jdk from 8 to 11.
also as part of our internal upgrade we changed the jvm.options default configuration, switching the official recommended CMS GC with the G1GC GC which have a fairly new support by elasticsearch.
considering all the above, I found this bug that was fixed in version 7.4 regarding the use of circuit-breaker together with the G1GC GC: https://github.com/elastic/elasticsearch/pull/46169
How to fix:
change the configuration back to CMS GC.
or, take the fix. the fix for the bug is just a configuration change that can be easily changed and tested in your deployment.

How to get the real & actual storage usage of pods on Kubernetes?

Is there any straight-forward way to get the actual storage usage of pods on Kubernetes?
I've tried to do so using Prometheus, but only the amount of storage allocated to every pod is exposed, not what is really consumed by my application (pods).
I need a way to see how much storage every pod is consuming and reporting that to Prometheus or Grafana.
There is a way but it might not be a 'straight forward' one.
If pods are running in Linux you can execute:
kubectl exec -it <pod> cat /proc/1/io
It will return stats regarding the main IO processes. Here is the description of those:
rchar
-----
I/O counter: chars read
The number of bytes which this task has caused to be read from storage. This
is simply the sum of bytes which this process passed to read() and pread().
It includes things like tty IO and it is unaffected by whether or not actual
physical disk IO was required (the read might have been satisfied from
pagecache)
wchar
-----
I/O counter: chars written
The number of bytes which this task has caused, or shall cause to be written
to disk. Similar caveats apply here as with rchar.
read_bytes
----------
I/O counter: bytes read
Attempt to count the number of bytes which this process really did cause to
be fetched from the storage layer. Done at the submit_bio() level, so it is
accurate for block-backed filesystems. <please add status regarding NFS and
CIFS at a later time>
write_bytes
-----------
I/O counter: bytes written
Attempt to count the number of bytes which this process caused to be sent to
the storage layer. This is done at page-dirtying time.
You can also get info regarding disk usage of a particular container. It was already described here.
Please let me know if that helped.
this is very tricky,
prometheus is scraping some kubelet metrics and just created a grafana dashboard with below parameters and worked :
Query :
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100
grafana legend :
{{ namespace }} | {{ persistentvolumeclaim }}

How can i resolve HTTPSConnectionPool(host='www.googleapis.com', port=443) Max retries exceeded with url (Google cloud storage)

I have created API using Django Rest Framework.
API communicates with GCP cloud storage to store profile Image(around 1MB/pic).
While performing load testing (around 1000 request/s) to that server.
I have encountered the following error.
I seem to be a GCP cloud storage max request issue, but unable to figure out the solution of it.
Exception Type: SSLError at /api/v1/users
Exception Value: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /storage/v1/b/<gcp-bucket-name>?projection=noAcl (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')",),))
Looks like you have the answer to your question here:
"...buckets have an initial IO capacity of around 1000 write requests
per second...As the request rate for a given bucket grows, Cloud
Storage automatically increases the IO capacity for that bucket"
Therefore it automatically Auto-Scale. The only thing is that you need to increase the requests/s gradually as described here:
"If your request rate is expected to go over these thresholds, you should start with a request rate below or near the thresholds and then double the request rate no faster than every 20 minutes"
Looks like your bucket should get an increase of I/O capacity that will work in the future.
You are actually right in the edge (1000 req/s), but I guess this is what is causing your error.

Performance issues running kafacat over slow speed link

I have weird performance issues with fetch.max.message.bytes parameter in librdkafka consumer implementation (version 0.11). I run some tests using kafkacat over slow speed network link (4 Mbps) and received following results:
1024 bytes = 1.740s
65536 bytes = 2.670s
131072 bytes = 7.070s
When I started debugging protocol messages I noticed a way to high RTT values.
|SEND|rdkafka| Sent FetchRequest (v4, 68 bytes # 0, CorrId 8)
|RECV|rdkafka| Received FetchResponse (v4, 131120 bytes, CorrId 8, rtt 607.68ms)
It seems that increase of fetch.max.message.bytes value causes very high network saturation, but it carries only single message per request.
On the other hand when I try kafka-console-consumer everything runs as expected (I get throughput 500 messages per second over the same network link).
Any ideas or suggestions where to look at?
You are most likely hitting issue #1384 which is a bug with the new v0.11.0 consumer. The bug is particularly evident on slow links or with MessageSets/batches with few messages.
A fix is on the way.

Resources