ElasticSearch stats API taking long time to response - performance

We have an ES cluster at AWS running with the following setup:
(I know, i need minimum 3 master nodes)
1 Coordinator
2 Data nodes
1 Master Node
Data nodes spec:
CPU: 8 Cores
Ram: 20GB
Disk: 1TB ssd 4000 IOPS
Problem:
ES endpoints for Search, Delete, Backup, Cluster Heatlh, Insert are working fine.
Since yesterday some endpoints like /_cat/indices, /_nodes/_local/stats and etc, started to take too long to respond(more than 4 minutes) :( and consequently our Kibana is in red state(Timeout after 30000ms)
Useful info:
All Shards are OK (3500 in total)
The cluster is in green state
X-pack disabled
Average of 1gb/shard
500k document count.
Requests made by localhost at AWS
CPU, DISK, RAM, IOPS all fine
Any ideas?
Thanks in advance :)
EDIT/SOLUTION 1:
After a few days i found out what was the problem, but first a little bit context...
We use Elasticsearch for storing user audit messages, and mobile error messages, at the first moment (obiviously in a rush to deliver new microservices and remove load from our MongoDB cluster) we designed elasticsearch indices by day, so every day a new indice was created and at the end of the day that indice had arround 6 ~ 9gb of data.
Six months later, almost 180 indices bigger, and 720 primary shards open we bumped into this problem.
Then i did read this again(the basics!) :
https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html
After talking to the team responsible for this microservice we redesigned our indices to a monthly index, and guess what? problem solved!
Now our cluster is much faster than before and this simple command saved me some sweet nights of sleep.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
Thanks!

Related

Creating high throughput Elasticsearch cluster

We are in process of implementing Elasticsearch as a search solution in our organization. For the POC we implemented a 3-Node cluster ( each node with 16 VCores and 60 GB RAM and 6 * 375GB SSDs) with all the nodes acting as master, data and co-ordination node. As it was a POC indexing speeds were not a consideration we were just trying to see if it will work or not.
Note : We did try to index 20 million documents on our POC cluster and it took about 23-24 hours to do that which is pushing us to take time and design the production cluster with proper sizing and settings.
Now we are trying to implement a production cluster (in Google Cloud Platform) with emphasis on both indexing speed and search speed.
Our use case is as follows :
We will bulk index 7 million to 20 million documents per index ( we have 1 index for each client and there will be only one cluster). This bulk index is a weekly process i.e. we'll index all data once and will query it for whole week before refreshing it.We are aiming for a 0.5 million document per second indexing throughput.
We are also looking for a strategy to horizontally scale when we add more clients. I have mentioned the strategy in subsequent sections.
Our data model has nested document structure and lot of queries on nested documents which according to me are CPU, Memory and IO intensive. We are aiming for sub second query times for 95th percentile of queries.
I have done quite a bit of reading around this forum and other blogs where companies have high performing Elasticsearch clusters running successfully.
Following are my learnings :
Have dedicated master nodes (always odd number to avoid split-brain). These machines can be medium sized ( 16 vCores and 60 Gigs ram) .
Give 50% of RAM to ES Heap with an exception of not exceeding heap size above 31 GB to avoid 32 bit pointers. We are planning to set it to 28GB on each node.
Data nodes are the workhorses of the cluster hence have to be high on CPUs, RAM and IO. We are planning to have (64 VCores, 240 Gb RAM and 6 * 375 GB SSDs).
Have co-ordination nodes as well to take bulk index and search requests.
Now we are planning to begin with following configuration:
3 Masters - 16Vcores, 60GB RAM and 1 X 375 GB SSD
3 Cordinators - 64Vcores, 60GB RAM and 1 X 375 GB SSD (Compute Intensive machines)
6 Data Nodes - 64 VCores, 240 Gb RAM and 6 * 375 GB SSDs
We have a plan to adding 1 Data Node for each incoming client.
Now since hardware is out of windows, lets focus on indexing strategy.
Few best practices that I've collated are as follows :
Lower number of shards per node is good of most number of scenarios, but have good data distribution across all the nodes for a load balanced situation. Since we are planning to have 6 data nodes to start with, I'm inclined to have 6 shards for the first client to utilize the cluster fully.
Have 1 replication to survive loss of nodes.
Next is bulk indexing process. We have a full fledged spark installation and are going to use elasticsearch-hadoop connector to push data from Spark to our cluster.
During indexing we set the refresh_interval to 1m to make sure that there are less frequent refreshes.
We are using 100 parallel Spark tasks which each task sending 2MB data for bulk request. So at a time there is 2 * 100 = 200 MB of bulk requests which I believe is well within what ES can handle. We can definitely alter these settings based on feedback or trial and error.
I've read more about setting cache percentage, thread pool size and queue size settings, but we are planning to keep them to smart defaults for beginning.
We are open to use both Concurrent CMS or G1GC algorithms for GC but would need advice on this. I've read pros and cons for using both and in dilemma in which one to use.
Now to my actual questions :
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
We will be sending query requests via coordinator nodes. Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
Say a search request come up to coordinator node it blocks 1 thread there and send request to data nodes which in turn blocks threads on data nodes as per where query data is lying. Is this assumption correct ?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
References
https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87
https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
We did try to index 20 million documents on our POC cluster and it took about 23-24 hours
That is surprisingly little — like less than 250 docs/s. I think my 8GB RAM laptop can insert 13 million docs in 2h. Either you have very complex documents, some bad settings, or your bottleneck is on the ingestion side.
About your nodes: I think you could easily get away with less memory on the master nodes (like 32GB should be plenty). Also the memory on data nodes is pretty high; I'd normally expect heap in relation to the rest of the memory to be 1:1 or for lots of "hot" data maybe 1:3. Not sure you'll get the most out of that 1:7.5 ratio.
CMS vs G1GC: If you have a current Elasticsearch and Java version, both are an option, otherwise CMS. You're generally trading throughput for (GC) latency, so if you benchmark be sure to have a long enough timeframe to properly hit GC phases and run as close to production queries in parallel as possible.
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
I'd say the coordinator is fine. Unless you use a custom routing key and the bulk only contains data for that specific data node, 5/6th of the documents would need to be forwarded to other data nodes anyway (if you have 6 data nodes). And you can offload the bulk processing and coordination handling to non data nodes.
However, overall it might make more sense to have 3 additional data nodes and skip the dedicated coordinating node. Though this is something you can only say for certain by benchmarking your specific scenario.
Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
I'm not sure I understand the question. But have you looked into https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster, which might shed some more light on this topic?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
While there are different queues for different query operations, there is otherwise no clear separation of tasks (like "only use 20% of the resources for indexing). Maybe go a little more conservative on the parallel bulk requests to avoid overloading the node.
If you are not reading from an index while it's being indexed (ideally you flip an alias once done): You might want to disable the refresh rate entirely and let Elasticsearch create segments as needed, but do a force refresh and change the setting once done. Also you could try running with 0 replicas while indexing, change replicas to 1 once done, and then wait for it to finish — though I'd benchmark if this is helping overall and if it's worth the added complexity.

What factors to consider for planning capacity for ELK?

I'm very new to ELK stack and trying to learn about it.
I want to understand if how to plan the capacity for 100GB/day data to be kept for 30 days , how to plan for it.
The data is mainly access logs, application logs, events.
I will need to run few regular expression to populate data fields ( unless there is any default delimiter support for it ) .
I'm planning to keep 3 replicas of it .
Can please someone guide me if how many servers and in what type of configuration would be required for this ?
Thanks very much in advance .
We use the "hot-warm architecture" (docs) of clustered elasticsearch instances (you can run multiple on one server!). Each with 31g memory.
Our "hot" instances configured with an retention time of 3 days and an ssd raid. After this three days data is moved to an "warm" HD Raid with an retention time of (30 - 90 days). Its nice because latest data to search in kibana is quite fast processed because of the ssds.
Sometimes we started 1 Elastic Master (4g memory) and 2 Elastic Clients (31g memory) on a single server.
In my opinion such a machine should have at least about 128g memory, 1tb ssd capacity, 4-6tb hd capacity.

Elasticsearch Indexing Speed Degrade with the Time

I am doing some performance tuning in elastic search for my project and I need some help in improving the elastic search indexing speed. I am using ES 5.1.1 and I have 2 nodes setup with 8 shards for the index. I have the servers for 2 nodes with 16GB RAM and 12CPUs allocated for each server with 2.2GHz clock speed. I need to index around 25,000,000 documents within 1.5 hours, which I am currently doing in around 4 hours. I have done the following config changes to improve the indexing time.
Setting ‘indices.store.throttle.type’ to ‘none’
Setting ‘refresh_interval’ to ‘-1’
Increasing ‘translog.flush_threshold_size’ to 1GB
Setting ‘number_of_replicas’ to ‘0’
Using 8 shards for the index
Setting VM Options -Xms8g -Xmx8g (Half of the RAM size)
I am using the bulk processor to generate the documents in my java application and I’m using the following configurations to setup the bulk processor.
Bulk Actions Count : 10000
Bulk Size in MB : 100
Concurrent Requests : 100
Flush Interval : 30
Initially I can index around 356167 in the first minute. But with the time, It decreases and after around 1 hour its around 121280 docs per minute.
How can I keep the indexing rate steady over the time? Is there any other ways to improve the performance?
I highly encourage not to change configuration parameters like the translog flush size, the throttling, unless you know what you are doing (and this does not mean reading some blog post on the internet :-)
Try a single shard per server and especially reduce the bulk size to something like 10MB. 100MB * 100 concurrent requests means you need 10GB of heap to deal with those (without doing anything else). I suppose not all of the documents get indexed because of your rejected tasks in your threadpools.
Start small and get bigger instead of starting big but not have any insight in your indexing.

Elasticsearch dies on primary node at random times. What should I look for in troubleshooting?

I have 2 Elasticsearch VM's running (4 GB RAM VM's) and they are configured in 1 cluster with 2 nodes. It's been running fine for months but all of a sudden in the last week the primary node has been just ending with no reason (that I can find).
So when I try to restart the node it takes a while to re-sync but does so eventually.
I have things set to only use 1/2 the RAM on the server and it seems to be doing that but looking at my free memory and HTOP I'm seeing more than that consumed.
My browser-based plugins (Bigdesk, Marvel, elasticsearch-head) also respond very very slowly now. I have noticed that my marvel index files are massive - over 1 GB a day. Can I remove these to improve performance? The marvel data is way more than my actual index/searchable data.
What else can I do to tweak and optimize this system?
Thanks.
Found a solution and hopefully this will help others. Did some other searching as I noticed there were a ton of Marvel indices and they were quite large. I found Curator and ran a delete on indices older than 30 days.
curator --host <IP ADDRESS> delete indices --older-than 30 --time-unit days --timestring '%Y.%m.%d'
That command purged all the indices over 30 days old and the cluster went green immediately and all plug-ins are working like normal. Granted, I know I'm losing historical data over 30 days but that is ok.
HTH someone else that might be in the same boat.

Logstash/Elasticsearch/Kibana resource planning

How to plan resources (I suspect, elasticsearch instances) according to load:
With load I mean ≈500K events/min, each containing 8-10 fields.
What are the configuration knobs I should turn?
I'm new to this stack.
500,000 events per minute is 8,333 events per second, which should be pretty easy for a small cluster (3-5 machines) to handle.
The problem will come with keeping 720M daily documents open for 60 days (43B documents). If each of the 10 fields is 32 bytes, that's 13.8TB of disk space (nearly 28TB with a single replica).
For comparison, I have 5 nodes at the max (64GB of RAM, 31GB heap), with 1.2B documents consuming 1.2TB of disk space (double with a replica). This cluster could not handle the load with only 32GB of RAM per machine, but it's happy now with 64GB. This is 10 days of data for us.
Roughly, you're expecting to have 40x the number of documents consuming 10x the disk space than my cluster.
I don't have the exact numbers in front of me, but our pilot project for using doc_values is giving us something like a 90% heap savings.
If all of that math holds, and doc_values is that good, you could be OK with a similar cluster as far as actual bytes indexed were concerned. I would solicit additional information on the overhead of having so many individual documents.
We've done some amount of elasticsearch tuning, but there's probably more than could be done as well.
I would advise you to start with a handful of 64GB machines. You can add more as needed. Toss in a couple of (smaller) client nodes as the front-end for index and search requests.

Resources