Elastic 6.1 replication speed capped? - elasticsearch

I'm playing with Elastic 6.1.1 and testing the limit of the software.
If I take an index of ~ 300GB with 0 replicas and 10 data nodes, and then decide to add a replica, all Elastic instances are massively using network (but not CPU). This is a normal behaviour :)
But it appears network usage is somewhat "capped" - considering network graphs - to 160Mbps (20MiB/sec). This limit is strange as it was default throttle limit on previous versions of Elastic (indices.store.throttle.max_bytes_per_sec), but this variable was deleted starting with Elastic 2.X
I wonder what is this cap, and how I could remove it.
I tried raising index.merge.scheduler.max_thread_count with no effect ...
Do you see any other tuning that can be done in that end ?
Any feedback welcome !

You have this - https://www.elastic.co/guide/en/elasticsearch/reference/6.1/recovery.html - which limits the transfer rate of anything related to copying shard from node to node. You can start playing with it by increasing it gradually and see what the impact is on the cluster performance.
Also, you have https://www.elastic.co/guide/en/elasticsearch/reference/6.1/shards-allocation.html#_shard_allocation_settings which also affects the traffic between nodes when copying shards.

Related

optimization on old indexes collecting logs from my apps

I have an elastic cluster with 3x nodes(each 6x cpu, 31GB heap , 64GB RAM) collecting 25GB logs per day , but after 3x months I realized my dashboards become very slow when checking stats in past weeks , please, advice if there is an option to improve the indexes read erformance so it become faster when calculating my dashboard stats?
Thanks!
I would suggest you try to increase the shards number
when you have more shards Elasticsearch will split your data over the shards so as a result, Elastic will send multiple parallel requests to search in a smaller data stack
for Shards number you could try to split it based on your heap memory size
No matter what actual JVM heap size you have, the upper bound on the maximum shard count should be 20 shards per 1 GB of heap configured on the server.
ElasticSearch - Optimal number of Shards per node
https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index
https://opster.com/elasticsearch-glossary/elasticsearch-choose-number-of-shards/
It seems that the amount of data that you accumulated and use for your dashboard is causing performance problems.
A straightforward option is to increase your cluster's resources but then you're bound to hit the same problem again. So you should rather rethink your data retention policy.
Chances are that you are really only interested in most recent data. You need to answer the question what "recent" means in your use case and simply discard anything older than that.
Elasticsearch has tools to automate this, look into Index Lifecycle Management.
What you probably need is to create an index template and apply a lifecycle policy to it. Elasticsearch will then handle automatic rollover of indices, eviction of old data, even migration through data tiers in hot-warm-cold architecture if you really want very long retention periods.
All this will lead to a more predictable performance of your cluster.

Figure out the problematic index in ES cluster?

I have elastic-search cluster which hosts more than 15 indices, I have a Datadog integration which shows me the below view of my elastic-search cluster.
We have alert integration with DD(datadog) which gives us alert if overall CPU usage goes beyond 60% and also in our application we start getting alerts when elasticsearch cluster is under stress as in this case our response time increases beyond a configures threshold.
Now my problem is how to know which indices are consuming the ES cluster resources most, so that we can fine either throttle the request from those indices or optimize their requests.
Some things which we did:
Looked at the slow query log: Which doesn't give us the culprit as due to heavy load or CPU usage, we have slow queries log from almost all the big indices.
Like in the DD dashboard there is spike in the bulk queue, but this is overall and not specific to a particular ES indices.
So my problem is very simple and all I want some metric from DD or elastic which can easily tell me which indices are consuming the most resources on a elastic-search cluster.
Unfortuanetly I can not propose an exact solution/workaround to you but you might have a look at the following documentations/API's:
Indices Stats API
Cluster Stats API
Nodes Stats API
The cpu usage is not included in the exported fields but maybe you can derive a high cpu usage behaviour from the other fields.
I hope I could help you in some way.

optimize elasticsearch / JVM

I have a website for classified. For this I'm using elasticsearch, postgres and rails on a same ubuntu 14.04 dedicated server, with 256GB of RAM and 20 cores, 40 threads.
I have 10 indexes on elasticsearch, each have default numbers of shards (5). They have between 1000 and 400 000 classifieds depending on which index.
approximately 5000 requests per minute, 2/3 making an elasticsearch request.
according to htop, jvm is using around 500% of CPU
I try different options, I reduce number of shards per index, I also try to change JAVA_OPTS as followed
#JAVA_OPTS="$JAVA_OPTS -XX:+UseParNewGC"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC"
#JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
#JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
but it doesn't seems to change anything.
so to questions :
when you change any setting on elasticsearch, and then restart, should the improvement (if any) be visible immediately or can it arrive a bit later thanks to cache or anything else ?
can any one help me to find good configuration for JVM / elasticsearch so it will not take that many resources
First, it's a horrible idea to run your web server, database and Elasticsearch server all on the same box. Each of these should be given it's own box, at least. In the case of Elasticsearch, it's actually recommended to have at least 3 servers, or nodes. That way you end up with a load balanced cluster that won't run into split-brain issues.
Further, sharding only makes sense in a cluster. If you only have one node, then all the shards reside on the same node. This causes two performance problems. First, you get the hit that sharding always adds. For every query, Elasticsearch must query each shard individually (each is a separate Lucene index). Then, it must combine and process the result from all the shards to produce the final result. That's a not insignificant amount of overhead. Second, because all the shards reside on the same node, you're I/O-locked. The shards have to be queried one at a time instead of all at once. Optimally, you should have one shard per node, however, since you can't create more shards without reindexing, it's common to have a few extra hanging around for future horizontal scaling. In that scenario, the cost of reindexing what could be 100's of gigs of data or more outweighs a little bit of performance bottleneck. However, if you've got 5 shards running one node, that's probably a large part of your performance problems right there.
Finally, and again, with Elasticsearch in particular, swapping is a huge no-no. Most of what makes Elasticsearch efficient is it's cache which all resides in RAM. If swaps occur, it jacks with the cache in sometimes unpredictable ways. As result, it's recommended to turn off swapping completely on the box your node(s) run on, and set Elasticsearch/JVM to have a min and max memory consumption of roughly half the available RAM of the box. That's virtually impossible to achieve if you have other things running on it like a web server or database. Databases in particular aggressively consume RAM in order to increase throughput, which is why those should likewise reside on their own servers.

ElasticSearch/Logstash/Kibana How to deal with spikes in log traffic

What is the best way to deal with a surge in log messages being written to an ElasticSearch cluster in a standard ELK setup?
We use a standard ELK (ElasticSearch/Logstash/Kibana) set-up in AWS for our websites logging needs.
We have an autoscaling group of Logstash instances behind a load balancer, that log to an autoscaling group of ElasticSearch instances behind another load balancer. We then have a single instance serving Kibana.
For day to day business we run 2 Logstash instances and 2 ElasticSearch instances.
Our website experiences short periods of high level traffic during events - our traffic increases by about 2000% during these events. We know about these occurring events well in advance.
Currently we just increase the number of ElasticSearch instances temporarily during the event. However we have had issues where we have subsequently scaled down too quickly, meaning we have lost shards and corrupted our indexes.
I've been thinking of setting the auto_expand_replicas setting to "1-all" to ensure each node has a copy of all the data, so we don't need to worry about how quickly we scale up or down. How significant would the overhead of transferring all the data to new nodes be? We currently only keep about 2 weeks of log data - this works out around 50gb in all.
I've also seen people mention using a separate auto scaling group of non-data nodes to deal with increases of search traffic, while keep the number of data nodes the same. Would this help in a write heavy situation, such as the event I previously mentioned?
My Advice
Your best bet is using Redis as a broker in between Logstash and Elasticsearch:
This is described on some old Logstash docs but is still pretty relevant.
Yes, you will see a minimal delay between the logs being produced and them eventually landing in Elasticsearch, but it should be minimal as the latency between Redis and Logstash is relatively small. In my experience Logstash tends to work through the backlog on Redis pretty quickly.
This kind of setup also gives you a more robust setup where even if Logstash goes down, you're still accepting the events through Redis.
Just scaling Elasticsearch
As to your question on whether or not extra non-data nodes will help in write-heavy periods: I don't believe so, no. Non-data nodes are great when you're seeing lots of searches (reads) being performed, as they delegate the search to all the data nodes, and then aggregate the results before sending them back to the client. They take away the load of aggregating the results from the data nodes.
Writes will always involve your data nodes.
I don't think adding and removing nodes is a great way to cater for this.
You can try to tweak the thread pools and queues in your peak periods. Let's say normally you have the following:
threadpool:
index:
type: fixed
size: 30
queue_size: 1000
search
type: fixed
size: 30
queue_size: 1000
So you have an even amount of search and index threads available. Just before your peak time, you can change the setting (on the run) to the following:
threadpool:
index:
type: fixed
size: 50
queue_size: 2000
search
type: fixed
size: 10
queue_size: 500
Now you have a lot more threads doing indexing, allowing for a faster indexing throughput, while search is put on the backburner. For good measure I've also increased the queue_size to allow for more of a backlog to build up. This might not work as expected, though, and experimentation and tweaking is recommended.

Write heavy elasticsearch

I am writing a real time analytics tool using kafka,storm and elasticsearch and want a elasticsearch that is write optimized for about 50K/sec inserts. For the purpose of POC I tried inserting bulk documents into the elasticsearch attaining 10K inserts per seconds.
I am running ES on a large box of amazon ec2.
I have tweaked the properties as below:
indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 30mb
indices.memory.min_index_buffer_size: 96mb
threadpool.bulk.type: fixed
threadpool.bulk.size: 100
threadpool.bulk.queue_size: 2000
bootstrap.mlockall: true
But I want write performance in order of 50Ks and not 10Ks to ensure the normal flow of my storm topology. Can anyone suggest how to configure a heavy write optimized ES cluster.
The scripts located here may help you improve indexing performance. There are many options and configurations to try, I write about some here however this isn't a comprehensive list. Reducing replicas and increasing shards increases indexing performance but however reduces availability and searching performance during indexing.
Perhaps sending HTTP bulk requests to several nodes rather than just the master node could help you get the figures you desire.
Hope this helps somewhat. 10k/ps inserts is good better than what most people have achieved however whether they get to use a large Amazon instance I don't know.

Resources