Sizing Elastic Search Cluster / hot-warm architecture - elasticsearch

I am trying to implement a hot-warm architecture in order to configure a 6 month data retention (36 TB, 200 GB per day) policy in which data has to be deleted from an index 180 days after it has been indexed.
A hot index older than 40 days will enter the warm phase.
I would like to know How many nodes per phase should my cluster have? How many shard/ replicas should I create? What is the optimal average size of a shard for the best searching performance? What is the Capacity of RAM / Storage / CPU per node ?
Any help is appreciated.

Related

Elasticsearch Shard distribution size differs enormously

I need to load 1.2 billion documents in the elasticsearch. As of today we have 6 nodes in the cluster. To equally distribute the shards among the 6 nodes I have mentioned the number of shards to be 42. I use spark and it takes me almost 3 days load the index. The shards distribution looks so off.
The node6 only has two shards in it while node 2 has almost 10 shards. The size distribution is also not even. Some shards are 114.6gb while some are just 870mb within the same node.
I have tried to figure out the solution too. I can include the
index.routing.allocation.total_shards_per_node: 7
while creating the index and make it evenly distribute. Will forcing the designated amount of shards in the node, crash the node if there is not enough resource available?
I want to size the shards evenly. My index size is 900 gb apprx. I want each shards to be atleast 20 gb. Could I use the following setting while creating the index?
max_primary_shard_size: 25gb
Is setting up max shard size only possible through ilm policy and will I require roll over policy for that ? I am not too familiar with the ilm. Sorry if this does not make sense.
The main reason I am trying to optimize the index is because I am getting timeout error on my application when I am querying the elastic search. I know I can increase my timeout time in my application and do some query optimization, but first I want to optimize my index and make my application as fast as possible.
I load the index only one time and do not write any documents to it after onetime load. For additional data, which i load every 15 days, I create a different index and use an alias name on the both the indexes to query. Other than sharding if there is any suggestion to optimize my indexes I will really appreciate it. It takes me 3 days just to load the data so it is quite difficult to experiment.
are you using custom routing values in your indexing approach? that might explain the shard size differences.
and if you aren't already, disable replicas and refreshes when doing your bulk index, as that will speed things up
finally your shard size of 20gig is probably a little low, I would suggest doubling that size, aiming for <50gig

Elasticsearch Memory Usage increases over time and is at 100%

I see that indexing performance degraded over a period of time in Elasticsearch. I see that the mem usage has slowly increased over a period of time until it became 100%. At this state I cannot index any more data. I have default shard settings - 5 primary and 1 replica. My index is time based with index created every hour to store coral service logs of various teams. An index size corresponds to about 3GB with 5 shards and with replica it is about 6GB. With a single shard and 0 replicas it comes to about 1.7 GB.
I am using ec2's i2.2x large hosts which offer 1.6TB space and 61GB RAM and 8 cores.
I have set the heap size to 30GB.
Following is node statistics:
https://jpst.it/1eznd
Could you please help in fixing this? My whole cluster came down that I had to delete all the indices.

Optimizing Bulk Indexing in elasticsearch

We have an elastic search cluster of 3 nodes of the following configurations
#Cpu Cores Memory(GB) Disk(GB) IO Performance
36 244.0 48000 very high
The machines are in 3 different zones namely eu-west-1c,eu-west-1a,eu-west-1b.
Each elastic search instance is being allocated 30GB of heap space.
we are using the above cluster for running aggregations only. The cluster has replication factor of 1 and all the string fields are not analyzed , doc_values is true for all the fields.
We are pumping data into this cluster running 6 instances of logstash in parallel ( having a batch size of 1000)
When more instances of logstash are started one by one the nodes of the ElasticSearch cluster starts throwing out of memory error.
What could be the possible optimizations to speed up bulk indexing rate on the cluster?= Will presence of nodes of cluster in the same zone increase bulk indexing? Will adding more nodes in the cluster help ?
Couple of steps taken so far
Increase the bulk queue size from 50 to 1000
Increase refresh interval from 1 seconds to 2 minutes
Changed segments merge throttling to none (
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing- performance.html)
We cannot set the replication factor to 0 due to inconsistency involved if one of the nodes goes down.

Why is there unequal CPU usage as elasticsearch cluster scales?

I have a 15 node elasticsearch cluster and am indexing a lot of documents. The documents are of the form { "message": "some sentences" }. When I had a 9 node cluster, I could get CPU utilization upto 80% on all of them, when I turned it into a 15 node cluster, i get 90% CPU usage on 4 nodes and only ~50% on the rest.
The specification of the cluster is:
15 Nodes c4.2xlarge EC2 insatnces
15 shards, no replicas
There is load balancer in-front of all the instances and the instances are accessed through the load balancer.
Marvel is running and is used to monitor the cluster
Refresh interval 1s
I could index 50k docs/sec on 9 nodes and only 70k docs/sec on 15 nodes. Shouldn't I be able to do more?
I'm not yet an expert on scalability and load balancing in ES but some things to consider :
load balancing should be native in ES thus having a load balancer in-front can actually mitigate the in-house load balancing results. It's kind of like having a speed limitation on your car but manually using the brakes, it doesn't make that much sense since your speed limitator should already do the job and will be prevented from doing it right when you input "manual regulation". Have you tried not using your load balancer and just using the native load balancing to see how it fares ?
while having more CPU / computation power across different servers / shards, it also forces you to go through multiple shards every time you write/read a document, thus if 1 shard can do N computations, M shards won't actually be able to do M*N computations
having 15 shards is probably overkill in a lot of cases
having 15 shards but no replication is weird/bad since if any of your 15 servers falls, you won't be able to access your whole index
you can actually hold multiple nodes on a single server
What is your index size in terms of storage ?

How to improve percolator performance in ElasticSearch?

Summary
We need to increase percolator performance (throughput).
Most likely approach is scaling out to multiple servers.
Questions
How to do scaling out right?
1) Would increasing number of shards in underlying index allow running more percolate requests in parallel?
2) How much memory does ElasticSearch server need if it does percolation only?
Is it better to have 2 servers with 4GB RAM or one server with 16GB RAM?
3) Would having SSD meaningfully help percolator's performance, or it is better to increase RAM and/or number of nodes?
Our current situation
We have 200,000 queries (job search alerts) in our job index.
We are able to run 4 parallel queues that call percolator.
Every query is able to percolate batch of 50 jobs in about 35 seconds, so we can percolate about:
4 queues * 50 jobs per batch / 35 seconds * 60 seconds in minute = 343
jobs per minute
We need more.
Our jobs index have 4 shards and we are using .percolator sitting on top of that jobs index.
Hardware: 2 processors server with 32 cores total. 32GB RAM.
We allocated 8GB RAM to ElasticSearch.
When percolator is working, 4 percolation queues I mentioned above consume about 50% of CPU.
When we tried to increase number of parallel percolation queues from 4 to 6, CPU utilization jumped to 75%+.
What is worse, percolator started to fail with NoShardAvailableActionException:
[2015-03-04 09:46:22,221][DEBUG][action.percolate ] [Cletus
Kasady] [jobs][3] Shard multi percolate failure
org.elasticsearch.action.NoShardAvailableActionException: [jobs][3]
null
That error seems to suggest that we should increase number of shards and eventually add dedicated ElasticSearch server (+ later increase number of nodes).
Related:
How to Optimize elasticsearch percolator index Memory Performance
Answers
How to do scaling out right?
Q: 1) Would increasing number of shards in underlying index allow running more percolate requests in parallel?
A: No. Sharding is only really useful when creating a cluster. Additional shards on a single instance may in fact worsen performance. In general the number of shards should equal the number of nodes for optimal performance.
Q: 2) How much memory does ElasticSearch server need if it does percolation only?
Is it better to have 2 servers with 4GB RAM or one server with 16GB RAM?
A: Percolator Indices reside entirely in memory so the answer is A LOT. It is entirely dependent on the size of your index. In my experience 200 000 searches would require a 50MB index. In memory this index would occupy around 500MB of heap memory. Therefore 4 GB RAM should be enough if this is all you're running. I would suggest more nodes in your case. However as the size of your index grows, you will need to add RAM.
Q: 3) Would having SSD meaningfully help percolator's performance, or it is better to increase RAM and/or number of nodes?
A: I doubt it. As I said before percolators reside in memory so disk performance isn't much of a bottleneck.
EDIT: Don't take my word on those memory estimates. Check out the site plugins on the main ES site. I found Big Desk particularly helpful for watching performance counters for scaling and planning purposes. This should give you more valuable info on estimating your specific requirements.
EDIT in response to comment from #DennisGorelik below:
I got those numbers purely from observation but on reflection they make sense.
200K Queries to 50MB on disk: This ratio means the average query occupies 250 bytes when serialized to disk.
50MB index to 500MB on heap: Rather than serialized objects on disk we are dealing with in memory Java objects. Think about deserializing XML (or any data format really) you generally get 10x larger in-memory objects.

Resources