ElasticSearch/Logstash/Kibana How to deal with spikes in log traffic

What is the best way to deal with a surge in log messages being written to an ElasticSearch cluster in a standard ELK setup?
We use a standard ELK (ElasticSearch/Logstash/Kibana) set-up in AWS for our websites logging needs.
We have an autoscaling group of Logstash instances behind a load balancer, that log to an autoscaling group of ElasticSearch instances behind another load balancer. We then have a single instance serving Kibana.
For day to day business we run 2 Logstash instances and 2 ElasticSearch instances.
Our website experiences short periods of high level traffic during events - our traffic increases by about 2000% during these events. We know about these occurring events well in advance.
Currently we just increase the number of ElasticSearch instances temporarily during the event. However we have had issues where we have subsequently scaled down too quickly, meaning we have lost shards and corrupted our indexes.
I've been thinking of setting the auto_expand_replicas setting to "1-all" to ensure each node has a copy of all the data, so we don't need to worry about how quickly we scale up or down. How significant would the overhead of transferring all the data to new nodes be? We currently only keep about 2 weeks of log data - this works out around 50gb in all.
I've also seen people mention using a separate auto scaling group of non-data nodes to deal with increases of search traffic, while keep the number of data nodes the same. Would this help in a write heavy situation, such as the event I previously mentioned?

My Advice
Your best bet is using Redis as a broker in between Logstash and Elasticsearch:
This is described on some old Logstash docs but is still pretty relevant.
Yes, you will see a minimal delay between the logs being produced and them eventually landing in Elasticsearch, but it should be minimal as the latency between Redis and Logstash is relatively small. In my experience Logstash tends to work through the backlog on Redis pretty quickly.
This kind of setup also gives you a more robust setup where even if Logstash goes down, you're still accepting the events through Redis.
Just scaling Elasticsearch
As to your question on whether or not extra non-data nodes will help in write-heavy periods: I don't believe so, no. Non-data nodes are great when you're seeing lots of searches (reads) being performed, as they delegate the search to all the data nodes, and then aggregate the results before sending them back to the client. They take away the load of aggregating the results from the data nodes.
Writes will always involve your data nodes.
I don't think adding and removing nodes is a great way to cater for this.
You can try to tweak the thread pools and queues in your peak periods. Let's say normally you have the following:
type: fixed
size: 30
queue_size: 1000
type: fixed
size: 30
queue_size: 1000
So you have an even amount of search and index threads available. Just before your peak time, you can change the setting (on the run) to the following:
type: fixed
size: 50
queue_size: 2000
type: fixed
size: 10
queue_size: 500
Now you have a lot more threads doing indexing, allowing for a faster indexing throughput, while search is put on the backburner. For good measure I've also increased the queue_size to allow for more of a backlog to build up. This might not work as expected, though, and experimentation and tweaking is recommended.


Elasticsearch application latency investigation

We have an Elasticsearch setup w/ [data, master, client] nodes. Client receives only query traffic, pass query to data nodes, gets the response, sends back to caller (based on my general understanding).
We are seeing that 'took' latency in the query response is around ~16ms, but our application who is measuring latency when calling into client is around ~90ms. Here are some numbers on our setup:
ES Setup: 3 client nodes (60GB/3 cpu/30GB heap each), 3 data nodes (80GB/16 cpu/30GB heap each), 3 master nodes. Its a k8 based helm chart setup.
client/data have enough cpu/mem (based on k8's pod level cpu/mem usages)
QPS - 20 req/sec
Shard size ~ 24GB, 0 replicas. Each shard is on a separate data-node. Indices are using mmapfs/preload "*"
Query types: bool query w/ 3 match clauses and 3 should for boosting on few fields. We have "_source=true".
Our documents are quite bigger with (mean, p90, p99) as (200kb, 400kb, 800kb)
Our response size is of the order of (mean, p99) (164kb, 840kb). We also observed latencies for bigger response sizes is much higher than the baseline.
Can someone comment on following questions:
How can we know more exactly where is this extra latency is
introduced? When reading about "took" here, it includes the querying and response forming stages. But something happens after that so that our application measured latency jumps to ~90ms. Where else can I look to look more into this increase? I have access to Prometheus ES dash and K8 pods usages, but all of them look normal and no spikes.
Are there some ES settings we can play with to optimize this
latency? We feel its mostly due to bigger response sizes. Can there be some compression introduced in ES to help w/ this?
Your question is very broad with very less information on your deployment, like types of search queries, index mapping, cluster/nodes/indices specification, and QPS..it's generally very difficult to suggest anything without looking at the system which has performance issues...
Reg, client nodes, yes, they receive the traffic but they also calculate the Global result from the local result set received from each shard involved in the search request. so they do the heavy processing and should have enough capacity otherwise would become bottleneck, even though your data nodes calculate the local result fast, processing at the client node would take more time and overall took time would increase.
You can also see if you have a room to improve some of the suggestions I wrote in this blog post.
Hope this helps.

Elasticsearch drops too many requests -- would a buffer improve things?

We have a cluster of workers that send indexing requests to a 4-node Elasticsearch cluster. The documents are indexed as they are generated, and since the workers have a high degree of concurrency, Elasticsearch is having trouble handling all the requests. To give some numbers, the workers process up to 3,200 tasks at the same time, and each task usually generates about 13 indexing requests. This generates an instantaneous rate that is between 60 and 250 indexing requests per second.
From the start, Elasticsearch had problems and requests were timing out or returning 429. To get around this, we increased the timeout on our workers to 200 seconds and increased the write thread pool queue size on our nodes to 700.
That's not a satisfactory long-term solution though, and I was looking for alternatives. I have noticed that when I copied an index within the same cluster with elasticdump, the write thread pool was almost empty and I attributed that to the fact that elasticdump batches indexing requests and (probably) uses the bulk API to communicate with Elasticsearch.
That gave me the idea that I could write a buffer that receives requests from the workers, batches them in groups of 200-300 requests and then sends the bulk request to Elasticsearch for one group only.
Does such a thing already exist, and does it sound like a good idea?
First of all, it's important to understand what happens behind the scene when you send the index request to Elasticsearch, to troubleshoot the issue or finding the root-cause.
Elasticsearch has several thread pools but for indexing requests(single/bulk) write threadpool is being used, please check this according to your Elasticsearch version as Elastic keeps on changing the threadpools(earlier there was a separate threadpool for single and bulk request with different queue capacity).
In the latest ES version(7.10) write threadpool's queue capacity increased significantly to 10000 from 200(exist in earlier release), there may be below reasons to do it.
Elasticsearch now prefers to buffer more indexing requests instead of rejecting the requests.
Although increasing queue capacity means more latency but it's a trade-off and this will reduce the data-loss if the client doesn't have the retry mechanism.
I am sure, you would have not moved to ES 7.9 version, when capacity was increased, but you can increase the size of this queue slowly and allocate more processors(if you have more capacity) easily through the config change mentioned in this official example. Although this is a very debatable topic and a lot of people consider this as a band-aid solution than the proper fix, but now as Elastic themself increased the queue size, you can also try it, and if you have a short duration of increased traffic than it makes even more sense.
Another critical thing is to find out the root cause why your ES nodes are queuing up more requests, it can be legitimate like increasing indexing traffic and infra reached its limit. but if it's not legitimate you can have a look at my short tips to improve one-time indexing performance and overall indexing performance, by implementing these tips you will get a better indexing rate which will reduce the pressure on write thread pool queue.
Edit: As mentioned by #Val in the comment, if you are also indexing docs one by one then moving to bulk index API will give you the biggest boost.

Is there a way to find out if load on Elastic stack is growing?

I have just started learning Elastic stack and I already have to diagnose production issue. Our setup from time to time has problems with pulling messages from ActiveMq to Elastic Search using Logstash. There is a lag which can be 1-3 hours.
One suspicion is that maybe load went up after latest release of our application.
Is there a way to find out total size of messages stored grouped by month? Not only their number but total size of them. Maybe documents' size went up not number of documents.
Start with setting up a production monitoring instance to provide detailed statistics on your cluster: https://www.elastic.co/guide/en/elastic-stack-overview/7.1/monitoring-production.html
This will allow you to get at those metrics like messages/month, average document size, index performance, buffer load, etc. A bit more detail on internal performance is available with https://visualvm.github.io/
While putting that piece together, you can also tweak Logstash performance e.g.
Tune Logstash worker settings:
Begin by scaling up the number of pipeline workers by using the -w flag. This will increase the number of threads available for filters and outputs. It is safe to scale this up to a multiple of CPU cores, if need be, as the threads can become idle on I/O.
You may also tune the output batch size. For many outputs, such as the Elasticsearch output, this setting will correspond to the size of I/O operations. In the case of the Elasticsearch output, this setting corresponds to the batch size.
From https://www.elastic.co/guide/en/logstash/current/performance-troubleshooting.html

optimize elasticsearch / JVM

I have a website for classified. For this I'm using elasticsearch, postgres and rails on a same ubuntu 14.04 dedicated server, with 256GB of RAM and 20 cores, 40 threads.
I have 10 indexes on elasticsearch, each have default numbers of shards (5). They have between 1000 and 400 000 classifieds depending on which index.
approximately 5000 requests per minute, 2/3 making an elasticsearch request.
according to htop, jvm is using around 500% of CPU
I try different options, I reduce number of shards per index, I also try to change JAVA_OPTS as followed
#JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
but it doesn't seems to change anything.
so to questions :
when you change any setting on elasticsearch, and then restart, should the improvement (if any) be visible immediately or can it arrive a bit later thanks to cache or anything else ?
can any one help me to find good configuration for JVM / elasticsearch so it will not take that many resources
First, it's a horrible idea to run your web server, database and Elasticsearch server all on the same box. Each of these should be given it's own box, at least. In the case of Elasticsearch, it's actually recommended to have at least 3 servers, or nodes. That way you end up with a load balanced cluster that won't run into split-brain issues.
Further, sharding only makes sense in a cluster. If you only have one node, then all the shards reside on the same node. This causes two performance problems. First, you get the hit that sharding always adds. For every query, Elasticsearch must query each shard individually (each is a separate Lucene index). Then, it must combine and process the result from all the shards to produce the final result. That's a not insignificant amount of overhead. Second, because all the shards reside on the same node, you're I/O-locked. The shards have to be queried one at a time instead of all at once. Optimally, you should have one shard per node, however, since you can't create more shards without reindexing, it's common to have a few extra hanging around for future horizontal scaling. In that scenario, the cost of reindexing what could be 100's of gigs of data or more outweighs a little bit of performance bottleneck. However, if you've got 5 shards running one node, that's probably a large part of your performance problems right there.
Finally, and again, with Elasticsearch in particular, swapping is a huge no-no. Most of what makes Elasticsearch efficient is it's cache which all resides in RAM. If swaps occur, it jacks with the cache in sometimes unpredictable ways. As result, it's recommended to turn off swapping completely on the box your node(s) run on, and set Elasticsearch/JVM to have a min and max memory consumption of roughly half the available RAM of the box. That's virtually impossible to achieve if you have other things running on it like a web server or database. Databases in particular aggressively consume RAM in order to increase throughput, which is why those should likewise reside on their own servers.

Write heavy elasticsearch

I am writing a real time analytics tool using kafka,storm and elasticsearch and want a elasticsearch that is write optimized for about 50K/sec inserts. For the purpose of POC I tried inserting bulk documents into the elasticsearch attaining 10K inserts per seconds.
I am running ES on a large box of amazon ec2.
I have tweaked the properties as below:
indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 30mb
indices.memory.min_index_buffer_size: 96mb
threadpool.bulk.type: fixed
threadpool.bulk.size: 100
threadpool.bulk.queue_size: 2000
bootstrap.mlockall: true
But I want write performance in order of 50Ks and not 10Ks to ensure the normal flow of my storm topology. Can anyone suggest how to configure a heavy write optimized ES cluster.
The scripts located here may help you improve indexing performance. There are many options and configurations to try, I write about some here however this isn't a comprehensive list. Reducing replicas and increasing shards increases indexing performance but however reduces availability and searching performance during indexing.
Perhaps sending HTTP bulk requests to several nodes rather than just the master node could help you get the figures you desire.
Hope this helps somewhat. 10k/ps inserts is good better than what most people have achieved however whether they get to use a large Amazon instance I don't know.
