es_rejected_execution_exception rejected execution - elasticsearch

I'm getting the following error when doing indexing.
es_rejected_execution_exception rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#16248886
on EsThreadPoolExecutor[bulk, queue capacity = 50,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#739e3764[Running,
pool size = 16, active threads = 16, queued tasks = 51, completed
tasks = 407667]
My current setup:
Two nodes. One is the master (data: true, master: true) while the other one is data only (data: true, master: false). They are both EC2 I2.4XL (16 Cores, 122GB RAM, 320GB instance storage). 2 shards, 1 replication.
Those two nodes are being fed by our aggregation server which has 20 separate workers. Each worker makes bulk indexing request to our ES cluster with 50 items to index. Each item is between 1000-4000 characters.
Current server setup: 4x client facing servers -> aggregation server -> ElasticSearch.
Now the issue is this error only started occurring when we introduced the second node. Before when we had one machine, we got consistent indexing throughput of 20k request per second. Now with two machine, once it hits the 10k mark (~20% CPU usage)
we start getting some of the errors outlined above.
But here is the interesting thing which I have noticed. We have a mock item generator which generates a random document to be indexed. Generally these documents are of the same size, but have random parameters. We use this to do the stress test and check the stability. This mock item generator sends requests to aggregation server which in turn passes them to Elasticsearch. The interesting thing is, we are able to index around 40-45k (# ~80% CPU usage) items per second without getting this error. So it seems really interesting as to why we get this error. Has anyone seen this error or know what could be causing it?

Related

Opensearch throws 429 error when Fluentbit outputs log under heavy load

With the below fluentbit configuration we are getting errors from opensearch under heavy load.
Http bulk requests to opensearch by fluentbit(respresenting 429 errors as spike)
Fluentbit config:
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
DB /var/log/flb_kube.db
Mem_Buf_Limit 400M
storage.type filesystem
Skip_Long_Lines On
Refresh_Interval 1
Rotate_Wait 600
[OUTPUT]
Name es
Match kube.*
Host ${ES_HOST}
Port ${PORT}
Buffer_Size False
AWS_Auth Off
AWS_Role_ARN ${ES_ARN}
AWS_External_ID ${ES_IAMROLE}
HTTP_User ${ES_USER}
HTTP_Passwd ${ES_PASSWD}
tls On
tls.verify Off
Trace_Output ${TRACE_OUTPUT}
Trace_Error On
Replace_Dots On
Index fluentbit
Type flb
AWS_Region ${AWS_REGION}
Logstash_Format On
Logstash_Prefix ${ES_LOGSTASHPREFIX}_app_log
Logstash_DateFormat %Y.%m.%d
Retry_Limit 10
storage.total_limit_size 1G
For resolving this we have upgraded our opensearch instance type from r5.xlarge.search(4 nodes) to r5.2xlarge.search(3 nodes) but that also didn't solve the issue.
We have also increased the ES index refresh_interval to 60s but that didn't help.
We read that output to ES from fluentbit can be controlled via buffering so we decreased Mem_Buf_Limit to 400M and it didn't help.
Can someone help if can try any other things or we are missing something.
The issue here is not that of fluentbit but is of opensearch/elasticsearch.
The HTTP 429 errors (es_request_rejected_exception) in ES occur when too many requests are sent to the cluster, than what the thread pool for it can handle. The thread pool in OpenSearch for different tasks are allocated differently with search operations getting a larger share. The option to manually modify thread pool allocation is not available for versions 5.1 and later.
You can try to resolve this by few ways.
1: Refresh rate (you already did that and it didn't help).
2: Change the indexing speed. Try to send logs with an interval greater than your current.
3: Upscale (you did and it didn't work either)
You can get an idea with the following formula for thread pools.
Number of thread pools allocated for writes = Number of Virtual CPUs (your case)
Number of thread pools allocated for search = ((3 * Number of virtual CPUs)/2) + 1
So, I am guessing your issue here is a big number of shards! You can either decrease the shards for each index or if you are having this issue only once in a while when there is extra load, you can change the replica count to 0 and when the period is finished, change it back to the original.
Check these two links to find out more about optimizing your ES domain.
indexing performance
Best practices

Increase queue capacity in ElasticSearch

Elastic version 7.8
I'm getting an error when running this code for thousands of records:
var bulkIndexResponse = await _client.BulkAsync(i => i
.Index(indexName)
.IndexMany(bases));
if (!bulkIndexResponse.IsValid)
{
throw bulkIndexResponse.OriginalException;
}
It eventually crashes with the following error:
Invalid NEST response built from a successful (200) low level call on POST: /indexname/_bulk
# Invalid Bulk items:
operation[1159]: index returned 429 _index: indexname _type: _doc _id: _version: 0 error: Type:
es_rejected_execution_exception Reason: "Could not perform enrichment, enrich coordination queue at
capacity [1024/1024]"
I would like to know how this enrich coordination queue capacity can be increased to accommodate continuous calls of BulkAsync with around a thousand records on each call.
you can check what thread_pool is getting full by /_cat/thread_pool?v and increase the queue (as ninja said) in elasticsearch.yml for each node.
but increasing queue size affect heap consumption and subsequently maybe it would affect performance.
when you get this error it may have two reason. first you are sending large bulk request. try to decrease the bulk request under 500 or lower. second you have some performance issue. try to find and solve the issue. maybe you should add more node to your cluster.
Not sure what version you are, but this enrich coordination queue seems to be the bulk queue and you can increase the queue size(these are node specific) by changing the elasticsearch.yml of that node.
Refer threadpools in ES for more info.

elasticsearch es_rejected_execution_exception

I'm trying to index a 12mb log file which has 50,000 logs.
After Indexing around 30,000 logs, I'm getting the following error
[2018-04-17T05:52:48,254][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7#560f63a9 on EsThreadPoolExecutor[name = EC2AMAZ-1763048/bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#7d6ae98b[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 3834]]"})
However, I've gone through the documentation and elasticsearch forum which suggested me to increase the elasticsearch bulk queue size. I tried using curl but I'm not able to do that.
curl -XPUT localhost:9200/_cluster/settings -d '{"persistent" : {"threadpool.bulk.queue_size" : 100}}'
is increasing the queue size good option? I can't increase the hardware because I have fewer data.
The error I'm facing is due to the problem with the queue size or something else? If with queue size How to update the queue size in elasticsearch.yml and do I need to restart es after updating in elasticsearch.yml?
Please let me know. Thanks for your time
Once your indexing cant keep up with indexing requests - elasticsearch enqueues them in threadpool.bulk.queue and starts rejecting if the # of requests in queue exceeds threadpool.bulk.queue_size
Its good idea to consider throttling your indexing . Threadpool size defaults are generally good ; While you can increase them , you may not have enough resources ( memory, CPU ) available .
This blogpost from elastic.co explains the problem really well .
by reducing the batch size it resolved my problem.
POST _reindex
{
"source":{
"index":"sourceIndex",
"size": 100
},
"dest":{
"index":"destIndex"}
}

Elastic gives inconsistent results under stress

Our ES is fairly slow, we did not optimize it (and the query) yet, but according to this link, request rejection from Elastic is a form of a feedback that asks to slow down and adapt the size of the bulk.
We built a form of a back pressure where the size of a blocking bulk (a list of individual requests sent at the same time, we do not use MSearch yet) depends on how many requests were rejected in the previous bulk. We wait for current bulk to finish before starting a new one. Obviously all rejected requests are re-injected into the request-queue (in a form of a data needed to construct the query). For example if our Elastic can handle 500 simultaneous requests and we send 600, some of them will be rejected and the new size will be reduced to 480 (20% off).
What we found out was that ES returns different results for the previously rejected requests. For example it may return something like the expected result, but with an offset of 2. We also have missing results where an address should have 1 result, but has none due to this bug.
If the bulk size is less than the threshold that ES can handle, everything goes as expected and we get expected results.
It doesn't look like it's the library's (elastic4s) problem.
Elastic configuration:
2 nodes with 5 shards each
Per node:
2 CPU, 32 GB ram, 16 GB heap. Everything else is default
I couldn't find any information on the internet, did anyone have this problem? What was the solution?
What we tried so far:
Thread.sleep between bulks as the link above suggests.
Removing cache on query level as well as removing it from the index.
Trying same index on a different (slower) hardware.
Verified that it's not a race-condition (in our code) problem.
Update:
What the query like.
Thread pool for search:
"search" : {
"type" : "fixed",
"min" : 4,
"max" : 4,
"queue_size" : 1000
},
2nd UPDATE:
We also tried setting preference to our query (thinking that it was a problem with shards): .preference(Preference.Primary) with no positive result (they were even more random than before). Two consecutive runs with this setting give different "random" results, so this is not consistent.
The reason for inconsistent results was that Elastic replies with Success if at least 1 shard had a result. So basically if only one of our 5 shards succeeded, the request would return a successful result with only 20% of the data.
As seen here and here, this is not a bug, this is a feature. Elastic prefers to return some (albeit, inconsistent) result instead of not returning anything.
The solution to this problem is either to use only one shard or to treat more than 0 failed shards as a general request failure using following object that each ES response has:
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},

Threadpool/Queue size limitation unsolved

I am using ES to do some data indexing in Windows OS. However, I have come across with the following errors always. It seems that it would be a queue size or threadpool size problem. However, I could not find any document that reveal how can I change the Windows settings to solve it.
[2016-07-20 11:11:56,343][DEBUG][action.search ] [Adaptoid] [cpu-2015.09.23][2], node[1Qp4zwR_Q5GLX-VChDOc2Q], [P], v[42], s[STARTED], a[id=KznRm9A5S0OhTMZMoED0qA]: Failed to execute [org.elasticsearch.action.search.SearchRequest#444b07] lastShard [true]
RemoteTransportException[[Adaptoid][172.16.1.238:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4#cd47e on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#9c72f5[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 1226]]];
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4#cd47e on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#9c72f5[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 1226]]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
Is there anyone who have experience with this?
There is no problem with Elasticsearch, but with your indexing procedure. By throwing that exception ES is telling you that you are sending too many search requests to ES and is not able to keep up.
If, at the same time, you are doing indexing the pressure (memory, CPU, merging segments) from the indexing process could affect the other operations ES is performing. So, if you also indexing, do it at a lower pace as it's affecting the search operations.

Resources