Performance Issue: rejected execution of org.elasticsearch.ingest.PipelineExecutionSService - elasticsearch

I've struggled to transfer 500 Million documents, which are shipped from Windows IIS logs, from kafka to elasticsearch. At the beginning of shipping process, Everything is good.
From Kafka-manager dashboard, I could see the speed of document out/bytes is about 1 million per minutes.
After one week, The speed of out/bytes is decreased to 200K per minutes. I thought that it has some problem. As I opened elasticsearch log file, I could see numerous of ERRORs.
Error is the below statement.
[ERROR][o.e.a.b.TransportBulkAction] [***-node-2] failed to execute
pipeline for a bulk request org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionSService$..... on EsThreadPoolExecutor
At the first time, I thought it was a problem of thread pool deficiency..
But tuning write thread pool is not strongly recommended by elasticsearch forum.
At the second time, it came from ingest-geoip because error statement said that "ingest.PipelineExecution....", So i simplified geoip filter in my logstash configuration. that is, turn off geoip.
Also, Tried to reduce the number of pipeline worker, and the number of batch size in logstash config.
Everything'd failed... There is no hope for overcoming this error.
Help Genius!

From the log you pasted it looks like the queue capacity is 200, but there are 203 queued tasks. I guess that either the indexing is slow due to ingest pipelines taking too long, or that there is a burst of indexing data which puts pressure on the queue. another option is that you are not rolling over the index, and when an index is getting too big the merges are bigger and longer and indexing performance decreases.
I would start by increasing the queue capacity to 2000, monitor the queue size, and check whether you get momentary/long bursts of incoming data.
Another thing to do is to monitor the indexing latency, and check whether ingest pipelines are the bottleneck, by checking their timing. you can try disabling them for a short time (if that is acceptable) and see if that relaxes the queue and errors in the log.

Related

why elasticsearch reject almost all queries when thread pool and queue is full instead of answering as much as possible and rejecting the remaining?

We have a single node elasticsearch for an ecommerce website with ~500 query per second.
This image shows our elasticsearch node metrics for a period of 3 hours:
In the image, when queue count reaches 1000 (queue size), query rate decreased significantly. It seems elasticsearch panics when both thread pool and queue is full and starts to rejecting most of queries. The intended behavior should be like responding queries as much as possible and only rejecting those that are more than real capacity. My question is that is this behavior natural or should we change our configs?
thread pool and queues in elasticsearch:
a node holds several thread pools in order to improve how threads
memory consumption are managed within a node. Many of these pools also
have queues associated with them, which allow pending requests to be
held instead of discarded.
I think this is a normal behavior. it seems in some time you have resource killer queries and your thread_pool become full and after that the queue would be full. when thread_pool is full it mean that system is processing existing queries and there is no room for new queries.
I recommend to check the tasks and queries:
curl -s [master-ip]:9200/_cat/tasks?v
get delayed search task_ID from above command an use in below command
curl -s [master-ip]:9200/_tasks/[task_ID]?pretty

Elasticsearch drops too many requests -- would a buffer improve things?

We have a cluster of workers that send indexing requests to a 4-node Elasticsearch cluster. The documents are indexed as they are generated, and since the workers have a high degree of concurrency, Elasticsearch is having trouble handling all the requests. To give some numbers, the workers process up to 3,200 tasks at the same time, and each task usually generates about 13 indexing requests. This generates an instantaneous rate that is between 60 and 250 indexing requests per second.
From the start, Elasticsearch had problems and requests were timing out or returning 429. To get around this, we increased the timeout on our workers to 200 seconds and increased the write thread pool queue size on our nodes to 700.
That's not a satisfactory long-term solution though, and I was looking for alternatives. I have noticed that when I copied an index within the same cluster with elasticdump, the write thread pool was almost empty and I attributed that to the fact that elasticdump batches indexing requests and (probably) uses the bulk API to communicate with Elasticsearch.
That gave me the idea that I could write a buffer that receives requests from the workers, batches them in groups of 200-300 requests and then sends the bulk request to Elasticsearch for one group only.
Does such a thing already exist, and does it sound like a good idea?
First of all, it's important to understand what happens behind the scene when you send the index request to Elasticsearch, to troubleshoot the issue or finding the root-cause.
Elasticsearch has several thread pools but for indexing requests(single/bulk) write threadpool is being used, please check this according to your Elasticsearch version as Elastic keeps on changing the threadpools(earlier there was a separate threadpool for single and bulk request with different queue capacity).
In the latest ES version(7.10) write threadpool's queue capacity increased significantly to 10000 from 200(exist in earlier release), there may be below reasons to do it.
Elasticsearch now prefers to buffer more indexing requests instead of rejecting the requests.
Although increasing queue capacity means more latency but it's a trade-off and this will reduce the data-loss if the client doesn't have the retry mechanism.
I am sure, you would have not moved to ES 7.9 version, when capacity was increased, but you can increase the size of this queue slowly and allocate more processors(if you have more capacity) easily through the config change mentioned in this official example. Although this is a very debatable topic and a lot of people consider this as a band-aid solution than the proper fix, but now as Elastic themself increased the queue size, you can also try it, and if you have a short duration of increased traffic than it makes even more sense.
Another critical thing is to find out the root cause why your ES nodes are queuing up more requests, it can be legitimate like increasing indexing traffic and infra reached its limit. but if it's not legitimate you can have a look at my short tips to improve one-time indexing performance and overall indexing performance, by implementing these tips you will get a better indexing rate which will reduce the pressure on write thread pool queue.
Edit: As mentioned by #Val in the comment, if you are also indexing docs one by one then moving to bulk index API will give you the biggest boost.

Is there a way to find out if load on Elastic stack is growing?

I have just started learning Elastic stack and I already have to diagnose production issue. Our setup from time to time has problems with pulling messages from ActiveMq to Elastic Search using Logstash. There is a lag which can be 1-3 hours.
One suspicion is that maybe load went up after latest release of our application.
Is there a way to find out total size of messages stored grouped by month? Not only their number but total size of them. Maybe documents' size went up not number of documents.
Start with setting up a production monitoring instance to provide detailed statistics on your cluster: https://www.elastic.co/guide/en/elastic-stack-overview/7.1/monitoring-production.html
This will allow you to get at those metrics like messages/month, average document size, index performance, buffer load, etc. A bit more detail on internal performance is available with https://visualvm.github.io/
While putting that piece together, you can also tweak Logstash performance e.g.
Tune Logstash worker settings:
Begin by scaling up the number of pipeline workers by using the -w flag. This will increase the number of threads available for filters and outputs. It is safe to scale this up to a multiple of CPU cores, if need be, as the threads can become idle on I/O.
You may also tune the output batch size. For many outputs, such as the Elasticsearch output, this setting will correspond to the size of I/O operations. In the case of the Elasticsearch output, this setting corresponds to the batch size.
From https://www.elastic.co/guide/en/logstash/current/performance-troubleshooting.html

ElasticSearch/Logstash/Kibana How to deal with spikes in log traffic

What is the best way to deal with a surge in log messages being written to an ElasticSearch cluster in a standard ELK setup?
We use a standard ELK (ElasticSearch/Logstash/Kibana) set-up in AWS for our websites logging needs.
We have an autoscaling group of Logstash instances behind a load balancer, that log to an autoscaling group of ElasticSearch instances behind another load balancer. We then have a single instance serving Kibana.
For day to day business we run 2 Logstash instances and 2 ElasticSearch instances.
Our website experiences short periods of high level traffic during events - our traffic increases by about 2000% during these events. We know about these occurring events well in advance.
Currently we just increase the number of ElasticSearch instances temporarily during the event. However we have had issues where we have subsequently scaled down too quickly, meaning we have lost shards and corrupted our indexes.
I've been thinking of setting the auto_expand_replicas setting to "1-all" to ensure each node has a copy of all the data, so we don't need to worry about how quickly we scale up or down. How significant would the overhead of transferring all the data to new nodes be? We currently only keep about 2 weeks of log data - this works out around 50gb in all.
I've also seen people mention using a separate auto scaling group of non-data nodes to deal with increases of search traffic, while keep the number of data nodes the same. Would this help in a write heavy situation, such as the event I previously mentioned?
My Advice
Your best bet is using Redis as a broker in between Logstash and Elasticsearch:
This is described on some old Logstash docs but is still pretty relevant.
Yes, you will see a minimal delay between the logs being produced and them eventually landing in Elasticsearch, but it should be minimal as the latency between Redis and Logstash is relatively small. In my experience Logstash tends to work through the backlog on Redis pretty quickly.
This kind of setup also gives you a more robust setup where even if Logstash goes down, you're still accepting the events through Redis.
Just scaling Elasticsearch
As to your question on whether or not extra non-data nodes will help in write-heavy periods: I don't believe so, no. Non-data nodes are great when you're seeing lots of searches (reads) being performed, as they delegate the search to all the data nodes, and then aggregate the results before sending them back to the client. They take away the load of aggregating the results from the data nodes.
Writes will always involve your data nodes.
I don't think adding and removing nodes is a great way to cater for this.
You can try to tweak the thread pools and queues in your peak periods. Let's say normally you have the following:
threadpool:
index:
type: fixed
size: 30
queue_size: 1000
search
type: fixed
size: 30
queue_size: 1000
So you have an even amount of search and index threads available. Just before your peak time, you can change the setting (on the run) to the following:
threadpool:
index:
type: fixed
size: 50
queue_size: 2000
search
type: fixed
size: 10
queue_size: 500
Now you have a lot more threads doing indexing, allowing for a faster indexing throughput, while search is put on the backburner. For good measure I've also increased the queue_size to allow for more of a backlog to build up. This might not work as expected, though, and experimentation and tweaking is recommended.

Find ES bottleneck in bulk (bigdesk screenshots attached)

Updated: Beware long post
Before I go big and move to a bigger server I want to understand what's wrong with this one.
This is an MVP of an elasticsearch server in AWS (EC2). Two micro-s with just 600mb ram each.
I want to understand what's wrong with this configuration. As you can see there's a bottlenecking in the bulk command. The OS memory is quite full, heap memory is still low and although the process CPU is running at maximum, the OS cpu is low.
I reduced the complexity of each document in the bulk-feed and set unwanted field to not be indexed. The screenshots below is my last attempt.
Is it an I/O bottleneck? I store the data on a S3 bucket.
Server Info:
2 Nodes (one in each server), 3 indexes each of them running with 2 shards and 1 replica. So it's a primary node with a running backup one. Strangely "Iron man" node never took over a shard.
I run again the feeder with the above cluster state and the bottleneck seems to be on both nodes:
Here is the beginning of the feeder:
Primary:
Secondary (secondary has the bottleneck):
After 5 minutes of feeding:
Primary (now primary has the bottleneck)
Secondary (secondary now is better):
I'm using py-elasticsearch so requests are auto-throttled in the streamer. However after the big bottleneck below it threw this error:
elasticsearch.exceptions.ConnectionError:
ConnectionError(HTTPConnectionPool(host='IP_HERE', port=9200):
Read timed out. (read timeout=10)) caused by:
ReadTimeoutError(HTTPConnectionPool(host='IP_HERE', port=9200):
Read timed out. (read timeout=10))
And here below is a very interesting screenshot on the same "bulk-feed". The Queue reached 20, the python threw the expression above and the refresh command runs until now that I'm writing.
My objective is to understand which source (CPU, RAM, Disk, Network...) is the inadequate or even better to use the existing sources more efficiently.
So Nate's script was (among others) reducing the refresh interval. Let me add some other findings as well:
The refresh rate was stressing the cluster however I continued searching and found more "errors". One gotcha was that I have a deprecated S3.Gateway. S3 is persistent but slower than the EC2 volume.
Not only did I have S3 as data storage but on a different region (ec2 virginia -> s3 oregon). So sending documents over the network. I got down to that because some old tutorials have S3 as cloud data storage option.
After solving that, the "Documents deleted" below was better. When I was using S3 it was like 30%. This is from Elasticsearch HQ plugin.
Since now we have optimized I/O. Let's see what else we can do.
I found out that CPU is an issue. Although big desk says that the workload was minimal, t1.micros are not to used for persistent CPU usage. That means that although on the charts CPU it is not fully used that's because Amazon throttles it in intervals and in reality they are fully used.
If you put a big more complex documents it will stress the server.
Happy dev-oping.
Can you run the IndexPerfES.sh script against index you are bulk indexing to, we can then see if the performance improves. I think that the refresh rate is degrading performance and is perhaps causing stress on the cluster, leading to problems. Let me know and we can work this out.

Resources