Increase queue capacity in ElasticSearch - elasticsearch

Elastic version 7.8
I'm getting an error when running this code for thousands of records:
var bulkIndexResponse = await _client.BulkAsync(i => i
.Index(indexName)
.IndexMany(bases));
if (!bulkIndexResponse.IsValid)
{
throw bulkIndexResponse.OriginalException;
}
It eventually crashes with the following error:
Invalid NEST response built from a successful (200) low level call on POST: /indexname/_bulk
# Invalid Bulk items:
operation[1159]: index returned 429 _index: indexname _type: _doc _id: _version: 0 error: Type:
es_rejected_execution_exception Reason: "Could not perform enrichment, enrich coordination queue at
capacity [1024/1024]"
I would like to know how this enrich coordination queue capacity can be increased to accommodate continuous calls of BulkAsync with around a thousand records on each call.

you can check what thread_pool is getting full by /_cat/thread_pool?v and increase the queue (as ninja said) in elasticsearch.yml for each node.
but increasing queue size affect heap consumption and subsequently maybe it would affect performance.
when you get this error it may have two reason. first you are sending large bulk request. try to decrease the bulk request under 500 or lower. second you have some performance issue. try to find and solve the issue. maybe you should add more node to your cluster.

Not sure what version you are, but this enrich coordination queue seems to be the bulk queue and you can increase the queue size(these are node specific) by changing the elasticsearch.yml of that node.
Refer threadpools in ES for more info.

Related

circuit_breaking_exception in kibanaa

{
statusCode: 429,
error: "Too Many Requests",
message: "[circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [2047736072/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2047736072/1.9gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=854525953/814.9mb, in_flight_requests=0/0b, accounting=79344850/75.6mb], with { bytes_wanted=2047736072 & bytes_limit=2040109465 & durability="PERMANENT" }"
}
circuit breakers are used to prevent the elasticsearch process to die and there are various types of circuit breakers and by looking at your logs its clear it's breaking the parent circuit breaker and to solve this, either increase the Elasticsearch JVM heap size(recommended) or increase the circuit limit.
As Elasticsearch Ninja alluded to, this error is generally produced from Elasticsearch, despite Kibana being the one displaying the error. Adjusting the heap size for Elasticsearch should generally resolve this error.
This should be done with the Xms and Xmx options of the jvm.options file for Elasticsearch.
https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html#heap-size-settings

Elasticsearch 7.x circuit breaker - data too large - troubleshoot

The problem:
Since the upgrading from ES-5.4 to ES-7.2 I started getting "data too large" errors, when trying to write concurrent bulk request (or/and search requests) from my multi-threaded Java application (using elasticsearch-rest-high-level-client-7.2.0.jar java client) to an ES cluster of 2-4 nodes.
My ES configuration:
Elasticsearch version: 7.2
custom configuration in elasticsearch.yml:
thread_pool.search.queue_size = 20000
thread_pool.write.queue_size = 500
I use only the default 7.x circuit-breaker values, such as:
indices.breaker.total.limit = 95%
indices.breaker.total.use_real_memory = true
network.breaker.inflight_requests.limit = 100%
network.breaker.inflight_requests.overhead = 2
The error from elasticsearch.log:
{
"error": {
"root_cause": [
{
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
}
],
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
},
"status": 429
}
Thoughts:
I'm having hard time to pin point the source of the issue.
When using ES cluster nodes with <=8gb heap size (on a <=16gb vm), the problem become very visible, so, one obvious solution is to increase the memory of the nodes.
But I feel that increasing the memory only hides the issue.
Questions:
I would like to understand what scenarios could have led to this error?
and what action can I take in order to handle it properly?
(change circuit-breaker values, change es.yml configuration, change/limit my ES requests)
The reason is that the heap of the node is pretty full and being caught by the circuit breaker is nice because it prevents the nodes from running into OOMs, going stale and crash...
Elasticsearch 6.2.0 introduced the circuit breaker and improved it in 7.0.0. With the version upgrade from ES-5.4 to ES-7.2, you are running straight into this improvement.
I see 3 solutions so far:
Increase heap size if possible
Reduce the size of your bulk requests if feasible
Scale-out your cluster as the shards are consuming a lot of heap, leaving nothing to process the large request. More nodes will help the cluster to distribute the shards and requests among more nodes, what leads to a lower AVG heap usage on all nodes.
As an UGLY workaround (not solving the issue) one could increase the limit after reading and understanding the implications:
So I've spent some time researching how exactly ES implemented the new circuit breaker mechanism, and tried to understand why we are suddenly getting those errors?
the circuit breaker mechanism exists since the very first versions.
we started experience issues around it when moving from version 5.4 to 7.2
in version 7.2 ES introduced a new way for calculating circuit-break: Circuit-break based on real memory usage (why and how: https://www.elastic.co/blog/improving-node-resiliency-with-the-real-memory-circuit-breaker, code: https://github.com/elastic/elasticsearch/pull/31767)
In our internal upgrade of ES to version 7.2, we changed the jdk from 8 to 11.
also as part of our internal upgrade we changed the jvm.options default configuration, switching the official recommended CMS GC with the G1GC GC which have a fairly new support by elasticsearch.
considering all the above, I found this bug that was fixed in version 7.4 regarding the use of circuit-breaker together with the G1GC GC: https://github.com/elastic/elasticsearch/pull/46169
How to fix:
change the configuration back to CMS GC.
or, take the fix. the fix for the bug is just a configuration change that can be easily changed and tested in your deployment.

elasticsearch es_rejected_execution_exception

I'm trying to index a 12mb log file which has 50,000 logs.
After Indexing around 30,000 logs, I'm getting the following error
[2018-04-17T05:52:48,254][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7#560f63a9 on EsThreadPoolExecutor[name = EC2AMAZ-1763048/bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#7d6ae98b[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 3834]]"})
However, I've gone through the documentation and elasticsearch forum which suggested me to increase the elasticsearch bulk queue size. I tried using curl but I'm not able to do that.
curl -XPUT localhost:9200/_cluster/settings -d '{"persistent" : {"threadpool.bulk.queue_size" : 100}}'
is increasing the queue size good option? I can't increase the hardware because I have fewer data.
The error I'm facing is due to the problem with the queue size or something else? If with queue size How to update the queue size in elasticsearch.yml and do I need to restart es after updating in elasticsearch.yml?
Please let me know. Thanks for your time
Once your indexing cant keep up with indexing requests - elasticsearch enqueues them in threadpool.bulk.queue and starts rejecting if the # of requests in queue exceeds threadpool.bulk.queue_size
Its good idea to consider throttling your indexing . Threadpool size defaults are generally good ; While you can increase them , you may not have enough resources ( memory, CPU ) available .
This blogpost from elastic.co explains the problem really well .
by reducing the batch size it resolved my problem.
POST _reindex
{
"source":{
"index":"sourceIndex",
"size": 100
},
"dest":{
"index":"destIndex"}
}

es_rejected_execution_exception rejected execution

I'm getting the following error when doing indexing.
es_rejected_execution_exception rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#16248886
on EsThreadPoolExecutor[bulk, queue capacity = 50,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#739e3764[Running,
pool size = 16, active threads = 16, queued tasks = 51, completed
tasks = 407667]
My current setup:
Two nodes. One is the master (data: true, master: true) while the other one is data only (data: true, master: false). They are both EC2 I2.4XL (16 Cores, 122GB RAM, 320GB instance storage). 2 shards, 1 replication.
Those two nodes are being fed by our aggregation server which has 20 separate workers. Each worker makes bulk indexing request to our ES cluster with 50 items to index. Each item is between 1000-4000 characters.
Current server setup: 4x client facing servers -> aggregation server -> ElasticSearch.
Now the issue is this error only started occurring when we introduced the second node. Before when we had one machine, we got consistent indexing throughput of 20k request per second. Now with two machine, once it hits the 10k mark (~20% CPU usage)
we start getting some of the errors outlined above.
But here is the interesting thing which I have noticed. We have a mock item generator which generates a random document to be indexed. Generally these documents are of the same size, but have random parameters. We use this to do the stress test and check the stability. This mock item generator sends requests to aggregation server which in turn passes them to Elasticsearch. The interesting thing is, we are able to index around 40-45k (# ~80% CPU usage) items per second without getting this error. So it seems really interesting as to why we get this error. Has anyone seen this error or know what could be causing it?

Spark Streaming and ElasticSearch - Could not write all entries

I'm currently writing a Scala application made of a Producer and a Consumer. The Producers get some data from and external source and writes em inside Kafka. The Consumer reads from Kafka and writes to Elasticsearch.
The consumer is based on Spark Streaming and every 5 seconds fetches new messages from Kafka and writes them to ElasticSearch. The problem is I'm not able to write to ES because I get a lot of errors like the one below :
ERROR] [2015-04-24 11:21:14,734] [org.apache.spark.TaskContextImpl]:
Error in TaskCompletionListener
org.elasticsearch.hadoop.EsHadoopException: Could not write all
entries [3/26560] (maybe ES was overloaded?). Bailing out... at
org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:225)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:57)
~[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[na:na] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[na:na] at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.scheduler.Task.run(Task.scala:58)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
[spark-core_2.10-1.2.1.jar:1.2.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_65] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Consider that the producer is writing 6 messages every 15 seconds so I really don't understand how this "overload" can possibly happen (I even cleaned the topic and flushed all old messages, I thought it was related to an offset issue). The task executed by Spark Streaming every 5 seconds can be summarized by the following code :
val result = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Map("wasp.raw" -> 1), StorageLevel.MEMORY_ONLY_SER_2)
val convertedResult = result.map(k => (k._1 ,AvroToJsonUtil.avroToJson(k._2)))
//TO-DO : Remove resource (yahoo/yahoo) hardcoded parameter
log.info(s"*** EXECUTING SPARK STREAMING TASK + ${java.lang.System.currentTimeMillis()}***")
convertedResult.foreachRDD(rdd => {
rdd.map(data => data._2).saveToEs("yahoo/yahoo", Map("es.input.json" -> "true"))
})
If I try to print the messages instead of sending to ES, everything is fine and I actually see only 6 messages. Why can't I write to ES?
For the sake of completeness, I'm using this library to write to ES : elasticsearch-spark_2.10 with the latest beta version.
I found, after many retries, a way to write to ElasticSearch without getting any error. Basically passing the parameter "es.batch.size.entries" -> "1" to the saveToES method solved the problem. I don't understand why using the default or any other batch size leads to the aforementioned error considering that I would expect an error message if I'm trying to write more stuff than the allowed max batch size, not less.
Moreover I've noticed that actually I was writing to ES but not all my messages, I was losing between 1 and 3 messages per batch.
When I pushed dataframe to ES on Spark, I had the same error message. Even with "es.batch.size.entries" -> "1" configuration,I had the same error.
Once I increased thread pool in ES, I could figure out this issue.
for example,
Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 600
threadpool.bulk.queue_size: 30000
Like it was already mentioned here, this is a document write conflict.
Your convertedResult data stream contains multiple records with the same id. When written to elastic as part of the same batch produces the error above.
Possible solutions:
Generate unique id for each record. Depending on your use case it can be done in a few different ways. As example, one common solution is to create a new field by combining the id and lastModifiedDate fields and use that field as id when writing to elastic.
Perform de-duplication of records based on id - select only one record with particular id and discard other duplicates. Depending on your use case, this could be the most current record (based on time stamp field), most complete (most of the fields contain data), etc.
The #1 solution will store all records that you receive in the stream.
The #2 solution will store only the unique records for a specific id based on your de-duplication logic. This result would be the same as setting "es.batch.size.entries" -> "1", except you will not limit the performance by writing one record at a time.
One of the possibility is the cluster/shard status being RED. Please address this issue which may be due to unassigned replicas. Once status turned GREEN the API call succeeded just fine.
This is a document write conflict.
For example:
Multiple documents specify the same _id for Elasticsearch to use.
These documents are located in different partitions.
Spark writes multiple partitions to ES simultaneously.
Result is Elasticsearch receiving multiple updates for a single Document at once - from multiple sources / through multiple nodes / containing different data
"I was losing between 1 and 3 messages per batch."
Fluctuating number of failures when batch size > 1
Success if batch write size "1"
Just adding another potential reason for this error, hopefully it helps someone.
If your Elasticsearch index has child documents then:
if you are using a custom routing field (not _id), then according to
the documentation the uniqueness of the documents is not guaranteed.
This might cause issues while updating from spark.
If you are using the standard _id, the uniqueness will be preserved, however you need to make sure the following options are provided while writing from Spark to Elasticsearch:
es.mapping.join
es.mapping.routing

Resources