Rejected Execution of org.elasticsearch.transport.TransportService Error - elasticsearch

I am trying to run elastic search and using the following command I am trying to put data-
'curl -XPOST http://localhost:9200/_bulk?pretty --data-binary #data_.json'
But I am getting the following error-
"create" : {
"_index" : "appname-docm",
"_type" : "HYD",
"_id" : "AVVYfsk7M5xgvmX8VR_B",
"status" : 429,
"error" : {
"type" : "es_rejected_execution_exception",
"reason" : "rejected execution of org.elasticsearch.transport.TransportService$4#c8998f4 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#553aee29[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 0]]"
}
}
},
I tried increasing the queue size by-
threadpool.search.queue_size: 100000
But I still get the same error.

The problem that you are getting is because the bulk operations queue is full.
A node ES has many threads pools, generic, search, index, suggest, bulk, etc.
In your case the problem is due to the queue of bulk operations is full.
Try adjusting the queue size of thread pool of bulk operation:
thread_pool.bulk.queue_size: 100
Or reduce the amount of bulk operations that you are sending at once.
For more details see https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html

Try the following:
curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "threadpool.bulk.queue_size" : 500 } }'
Edit:
And to Get current settings
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true"

Related

AWS Elasticsearch showing cluster health yellow, how should I fix it?

I am using AWS Elasticsearch. My cluster status is yellow for past 48 hours on the recommendation provided here:
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html
I've updated my nodes to be 15 data and it has 3 master nodes.
Even though it has more spaces for around 60 Gb in each nodes , it is still in yellow state.
When i executed this command GET /_cluster/allocation/explain
"index" : "***********************************",
"shard" : 4,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2020-10-09T16:19:41.803Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [f6hB7EYOSR-GiJLFXBn01w]: failed recovery, failure RecoveryFailedException[[******************************][4]: Recovery failed from {70c36ff18063566c3a6089f3d696440a}{*******************}{*************}{di}{di_number=39, zone=us-east-1d, distributed_snapshot_deletion_enabled=true} into {**********************}{****************}{*************}{*****}{*******}{di}{distributed_snapshot_deletion_enabled=true, zone=us-east-1d, di_number=39}]; nested: RemoteTransportException[[****************][*********][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [1554462628/1.4gb], which is larger than the limit of [1513521152/1.4gb], real usage: [1554460888/1.4gb], new bytes reserved: [1740/1.6kb], usages [request=0/0b, fielddata=621718551/592.9mb, in_flight_requests=73378/71.6kb, accounting=35794764/34.1mb]]; ",
"last_allocation_status" : "no_attempt"
}
This is what it says. How can i resolve this?

Elastic Cloud Circuit_breaking_exception

We recently upgraded our elastic cloud deployment from 6.8.5 to 7.9
After the upgrade we are seeing the following error time to time.
{
"error" : {
"root_cause" : [
{
"type" : "circuit_breaking_exception",
"reason" : "[parent] Data too large, data for [<http_request>] would be [416906520/397.5mb], which is larger than the limit of [408420352/389.5mb], real usage: [416906520/397.5mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=32399/31.6kb, in_flight_requests=0/0b, model_inference=0/0b, accounting=4714192/4.4mb]",
"bytes_wanted" : 416906520,
"bytes_limit" : 408420352,
"durability" : "PERMANENT"
}
],
"type" : "circuit_breaking_exception",
"reason" : "[parent] Data too large, data for [<http_request>] would be [416906520/397.5mb], which is larger than the limit of [408420352/389.5mb], real usage: [416906520/397.5mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=32399/31.6kb, in_flight_requests=0/0b, model_inference=0/0b, accounting=4714192/4.4mb]",
"bytes_wanted" : 416906520,
"bytes_limit" : 408420352,
"durability" : "PERMANENT"
},
"status" : 429
}
This deployment consists of only one node with 1G memory. We would like to know the cause of this error. Is it due to the upgrade?
Thank you.
First, the circuit breaker is a protection that some request doesn't push your cluster over the limit of what it can handle — this is killing a single request rather than (potentially) the entire cluster. Also note that this HTTP request alone isn't too large, but it trips the parent circuit breaker — so this request on top of everything else would be too much.
The initial circuit breaker was already added in 6.2.0, but was tightened down further in 7.0.0. I assume that's the reason why you are seeing this (more frequently) now.
You could change the indices.breaker.total.limit, but this isn't a magic switch to get more out of your cluster. 1GB of memory might just not be enough for what you are trying to do.

Elasticsearch indexing is very slow

I have a Titan database with Cassandra storage backend, and I am trying to create a mixed index based on two property keys.
I am able to register the Index using following commands:
graph=TitanFactory.open(config);
graph.tx().rollback()
m = graph.openManagement();
m.buildIndex("titleBodyMixed", Vertex.class).addKey(m.getPropertyKey("title")).addKey(m.getPropertyKey("body")).buildMixedIndex("search");
m.commit();
m.awaitGraphIndexStatus(graph, 'titleBodyMixed').status(SchemaStatus.REGISTERED).timeout(3, java.time.temporal.ChronoUnit.MINUTES).call();
And when I am checking, the Index is successfully registered after a few seconds. At next step, I try to reindex the database using the following commands:
m = graph.openManagement();
m.updateIndex(m.getGraphIndex('titleBodyMixed'), SchemaAction.REINDEX).get();
However, updateIndex command is not finishing, (After 12 hours).
I have about 300k data entry in the database and each data entry has one Title and one Body to index.
My question is that how can I speed up the indexing?
When I am using top command I see that my CPU is not saturated by indexing processes:
My Titan config file is as bellow:
config =new BaseConfiguration();
config.setProperty("storage.backend","cassandra");
config.setProperty("storage.hostname", "127.0.0.1");
config.setProperty("storage.cassandra.keyspace", "smartgraph");
config.setProperty("index.search.elasticsearch.interface", "NODE");
config.setProperty("index.search.backend", "elasticsearch");
The following is showing elasticsearch service properties:
curl -X GET 'http://localhost:9200'
{
"status" : 200,
"name" : "Ms. Marvel",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.7.2",
"build_hash" : "e43676b1385b8125d647f593f7202acbd816e8ec",
"build_timestamp" : "2015-09-14T09:49:53Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"
},
"tagline" : "You Know, for Search"
}
The idea is, the index reindexing process will not start unless all sessions are closed. You most probably have sessions open with the database. Therefore, the reindex job is never triggered.
With this Gremlin script, you could close all sessions. You should see that the indexing will take place afterwards.
Will that help?

Courier Fetch: shards failed

Why do I get these warnings after adding more data to my elasticsearch?
And the warnings are different every time I browse the dashboard.
"Courier Fetch: 30 of 60 shards failed."
More details:
It's a sole node on a CentOS 7.1
/etc/elasticsearch/elasticsearch.yml
index.number_of_shards: 3
index.number_of_replicas: 1
bootstrap.mlockall: true
threadpool.bulk.queue_size: 1000
indices.fielddata.cache.size: 50%
threadpool.index.queue_size: 400
index.refresh_interval: 30s
index.number_of_shards: 5
index.number_of_replicas: 1
/usr/share/elasticsearch/bin/elasticsearch.in.sh
ES_HEAP_SIZE=3G
#I use this Garbage Collector instead of the default one.
JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
cluster status
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 61,
"active_shards" : 61,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 61
}
cluster details
{
"cluster_name" : "my_cluster",
"nodes" : {
"some weird number" : {
"name" : "ES 1",
"transport_address" : "inet[localhost/127.0.0.1:9300]",
"host" : "some host",
"ip" : "150.244.58.112",
"version" : "1.4.4",
"build" : "c88f77f",
"http_address" : "inet[localhost/127.0.0.1:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 7854,
"max_file_descriptors" : 65535,
"mlockall" : false
}
}
}
}
I'm curious about the "mlockall" : false because on the yml I did write bootstrap.mlockall: true
logs
lots of lines like:
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23#a9a34f5
For me tuning the threadpool search queue_size solved the issue. I tried a number of other things and this is the one that solved it.
I added this to my elasticsearch.yml
threadpool.search.queue_size: 10000
and then restarted elasticsearch.
Reasoning... (from the docs)
A node holds several thread pools in order to improve how threads
memory consumption are managed within a node. Many of these pools also
have queues associated with them, which allow pending requests to be
held instead of discarded.
and for search in particular...
For count/search operations. Defaults to fixed with a size of int((#
of available_processors * 3) / 2) + 1, queue_size of 1000.
For more information you can refer to the elasticsearch docs here...
I had trouble finding this information so I hope this helps others!
I got this error when my query was missing a closing quote:
field:"value
In my ElasticSearch logs I see these exceptions:
Caused by: org.elasticsearch.index.query.QueryShardException:
Failed to parse query [field:"value]
...
Caused by: org.apache.lucene.queryparser.classic.ParseException:
Cannot parse 'field:"value': Lexical error at line 1, column 13.
Encountered: <EOF> after : "\"value"
Using Elasticsearch 5.4 thread_pool has an underscore it it.
thread_pool.search.queue_size: 10000
See documentation at Elasticsearch Thread Pool module documentation
This is likely an indication that there's a problem with your cluster's health. Without knowing more about your cluster, there's not much more that can be said.
I agree with #Philip's opinion, But it's necessary to restart elasticsearch at least on Elasticsearch >=1.5.2, because you can dynamically set threadpool.search.queue_size.
curl -XPUT http://your_es:9200/_cluster/settings
{
"transient":{
"threadpool.search.queue_size":10000
}
}
from Elasticsearch >= version 5, its not possible to update cluster settings for thread_pool.search.queue_size using _cluster/settings API. In my case updating ElasticSearch Node yml file is not an option either since if node fails then auto scaling code would bring other ES node with default yml settings.
I have a cluster with 3 nodes and having 400 active primary shards with 7 active threads for queue size of 1000. Increasing number of nodes to 5 with similar config has resolved the issue as queries are getting distributed horizontally to more available nodes.
this will not work on elasticsearch 5.6.
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[colmbmiscxx.xx][172.29.xx.xx:9300][cluster:admin/settings/update]"
}
],
"type": "illegal_argument_exception",
"reason": "transient setting [threadpool.search.queue_size], not dynamically updateable"
},
"status": 400
}

ElasticSearch UNASSIGNED indices fix without data loss

for whatever reason a bunch of indices became UNASSIGNED. I'm looking for a way of assigning them to a cluster node without loosing any data.
I tried using the following API call, but it results in data loss, unfortunately (due to allow_primary):
curl -XPOST 'localhost:9200/_cluster/reroute?pretty' -d '{
"commands" : [ {
"allocate" : {
"index" : "index-name",
"shard" : "0",
"allow_primary" : true,
"node" : "node-name"
}
}
]
}'
I also keep getting the following entries in elasticsearch.log:
[2015-03-16 11:51:12,181][DEBUG][action.search.type ] [cluster node] All shards failed for phase: [query_fetch]
[2015-03-16 11:51:12,450][DEBUG][action.search.type ] [cluster node] All shards failed for phase: [query_fetch]
[2015-03-16 11:51:19,349][DEBUG][action.bulk ] [cluster node] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-03-16 11:51:20,057][DEBUG][action.bulk ] [cluster node] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
Any help would be appreciated.

Resources