Threadpool/Queue size limitation unsolved

Threadpool/Queue size limitation unsolved - elasticsearch

I am using ES to do some data indexing in Windows OS. However, I have come across with the following errors always. It seems that it would be a queue size or threadpool size problem. However, I could not find any document that reveal how can I change the Windows settings to solve it.
[2016-07-20 11:11:56,343][DEBUG][action.search ] [Adaptoid] [cpu-2015.09.23][2], node[1Qp4zwR_Q5GLX-VChDOc2Q], [P], v[42], s[STARTED], a[id=KznRm9A5S0OhTMZMoED0qA]: Failed to execute [org.elasticsearch.action.search.SearchRequest#444b07] lastShard [true]
RemoteTransportException[[Adaptoid][172.16.1.238:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4#cd47e on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#9c72f5[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 1226]]];
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4#cd47e on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#9c72f5[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 1226]]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
Is there anyone who have experience with this?

There is no problem with Elasticsearch, but with your indexing procedure. By throwing that exception ES is telling you that you are sending too many search requests to ES and is not able to keep up.
If, at the same time, you are doing indexing the pressure (memory, CPU, merging segments) from the indexing process could affect the other operations ES is performing. So, if you also indexing, do it at a lower pace as it's affecting the search operations.

Related

Increase queue capacity in ElasticSearch

Elastic version 7.8
I'm getting an error when running this code for thousands of records:
var bulkIndexResponse = await _client.BulkAsync(i => i
.Index(indexName)
.IndexMany(bases));
if (!bulkIndexResponse.IsValid)
{
throw bulkIndexResponse.OriginalException;
}
It eventually crashes with the following error:
Invalid NEST response built from a successful (200) low level call on POST: /indexname/_bulk
# Invalid Bulk items:
operation[1159]: index returned 429 _index: indexname _type: _doc _id: _version: 0 error: Type:
es_rejected_execution_exception Reason: "Could not perform enrichment, enrich coordination queue at
capacity [1024/1024]"
I would like to know how this enrich coordination queue capacity can be increased to accommodate continuous calls of BulkAsync with around a thousand records on each call.

you can check what thread_pool is getting full by /_cat/thread_pool?v and increase the queue (as ninja said) in elasticsearch.yml for each node.
but increasing queue size affect heap consumption and subsequently maybe it would affect performance.
when you get this error it may have two reason. first you are sending large bulk request. try to decrease the bulk request under 500 or lower. second you have some performance issue. try to find and solve the issue. maybe you should add more node to your cluster.

Not sure what version you are, but this enrich coordination queue seems to be the bulk queue and you can increase the queue size(these are node specific) by changing the elasticsearch.yml of that node.
Refer threadpools in ES for more info.

Adjusting pool size of akka.net cluster pool router

Here's the scenario I am looking to achieve using Akka.Net clustering:
A user submits a load generation task to be performed, to a coordinator actor. (The task has its parameters and a 'load factor' - so a factor of 10 would create 10 parallel tasks.)
The coordinator actor creates a cluster pool router with pool size equal to the load factor, and broadcasts a message to tell the routees to execute the task.
As expected, the required number of routees is created on the cluster nodes, and the load is generated.
Now, I need to enable a ramp-up and ramp-down capability. I am looking to do that by periodically adjusting the pool size so every x seconds, the pool size is increased for a ramp-up, and for ramp-down, every x seconds, the pool size is reduced.
The cluster pool router is created using the following to create an initial pool of 1 routee:
ClusterRouterPoolSettings settings = new ClusterRouterPoolSettings(1, 1000, false, "worker");
var props = Props.Create<TestActor>(() => new TestActor())
.WithRouter(new ClusterRouterPool(new BroadcastPool(100), settings));
var testActorRouter = Context.ActorOf(props, actorName);
To ramp up pool size by 5, the coordinator uses the following:
testActorRouter.Ask<Router>(new AdjustPoolSize(5));
However, this results in the 5 routees being added on the coordinator's node, instead of on the other cluster nodes. This is in spite of the pool settings specifying the role as "worker" (which the coordinator node doesn't have) and specifying local routees = false.
Is there a different way to adjust the pool size so it increases the cluster pool routees?
Thanks!

Cannot set thread_pool setting in Elastic Cloud user settings

Using ES v6.4.3
I'm getting a bunch of TransportService errors when writing a high volume of transactions. The exact error is:
StatusCodeError: 429 - {"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[instance-0000000002][10.44.0.71:19428][indices:data/write/bulk[s][p]]"}],"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$7#35110df8 on EsThreadPoolExecutor[name = instance-0000000002/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#3fd60b4f[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 1705133]]"},"status":429}
The general consensus seems to be bumping the queue_size so requests don't get dropped. As you can see in the error my queue_size is the default 200, and filled up. (I know simple bumping the queue_size is not a magic solution, but this is exactly what I need in this case)
So following this doc on how to change the elasticsearch.yml setting I try to add the queue_size bump here:
thread_pool.write.queue_size: 2000
And when I save I get this error:
'thread_pool.write.queue_size': is not allowed
I understand the user override settings blacklist certain settings, so if my problem is truly that thread_pool.write.queue_size is blacklisted, how can I access my elasticsearch.yml file to change it?
Thank you!

elasticsearch es_rejected_execution_exception

I'm trying to index a 12mb log file which has 50,000 logs.
After Indexing around 30,000 logs, I'm getting the following error
[2018-04-17T05:52:48,254][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7#560f63a9 on EsThreadPoolExecutor[name = EC2AMAZ-1763048/bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#7d6ae98b[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 3834]]"})
However, I've gone through the documentation and elasticsearch forum which suggested me to increase the elasticsearch bulk queue size. I tried using curl but I'm not able to do that.
curl -XPUT localhost:9200/_cluster/settings -d '{"persistent" : {"threadpool.bulk.queue_size" : 100}}'
is increasing the queue size good option? I can't increase the hardware because I have fewer data.
The error I'm facing is due to the problem with the queue size or something else? If with queue size How to update the queue size in elasticsearch.yml and do I need to restart es after updating in elasticsearch.yml?
Please let me know. Thanks for your time

Once your indexing cant keep up with indexing requests - elasticsearch enqueues them in threadpool.bulk.queue and starts rejecting if the # of requests in queue exceeds threadpool.bulk.queue_size
Its good idea to consider throttling your indexing . Threadpool size defaults are generally good ; While you can increase them , you may not have enough resources ( memory, CPU ) available .
This blogpost from elastic.co explains the problem really well .

by reducing the batch size it resolved my problem.
POST _reindex
{
"source":{
"index":"sourceIndex",
"size": 100
},
"dest":{
"index":"destIndex"}
}

es_rejected_execution_exception rejected execution

I'm getting the following error when doing indexing.
es_rejected_execution_exception rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#16248886
on EsThreadPoolExecutor[bulk, queue capacity = 50,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#739e3764[Running,
pool size = 16, active threads = 16, queued tasks = 51, completed
tasks = 407667]
My current setup:
Two nodes. One is the master (data: true, master: true) while the other one is data only (data: true, master: false). They are both EC2 I2.4XL (16 Cores, 122GB RAM, 320GB instance storage). 2 shards, 1 replication.
Those two nodes are being fed by our aggregation server which has 20 separate workers. Each worker makes bulk indexing request to our ES cluster with 50 items to index. Each item is between 1000-4000 characters.
Current server setup: 4x client facing servers -> aggregation server -> ElasticSearch.
Now the issue is this error only started occurring when we introduced the second node. Before when we had one machine, we got consistent indexing throughput of 20k request per second. Now with two machine, once it hits the 10k mark (~20% CPU usage)
we start getting some of the errors outlined above.
But here is the interesting thing which I have noticed. We have a mock item generator which generates a random document to be indexed. Generally these documents are of the same size, but have random parameters. We use this to do the stress test and check the stability. This mock item generator sends requests to aggregation server which in turn passes them to Elasticsearch. The interesting thing is, we are able to index around 40-45k (# ~80% CPU usage) items per second without getting this error. So it seems really interesting as to why we get this error. Has anyone seen this error or know what could be causing it?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio