ElasticSearch max shard size

ElasticSearch max shard size - elasticsearch

I'm trying to configure my elasticsearch cluster but I need more information.
I understand rules to choose the number of shard to use.
What I don't know is the maximum size of a shard. Is it link this the JVM max size or another param?
Thanks

Rule of thumb is to not have a shard larger than 30-50GB. But this number depends on the use case, your acceptable query response times, your hardware etc.
You need to test this and establish this number. There is no hard rule for how large a shard can be. Take one node from the cluster, create the index with one primary and no replicas and add to it as many documents as you can. Test your queries. If you are happy with the response times, add more document. Test again. And so on. When the response times are not satisfactory anymore, this is your hardware limit for a shard size.
Don't forget that this number might be lower if the use-case requires it (many shards on a single node, huge number of concurrent requests per second, configuration mistakes etc).
In conclusion: test as indicated above and you get an "ideal" shard size as a starting point.

Related

Elasticsearch total shard count impact on an index search speed

I know that the shard count and size has a significant impact on the search performance (speed) and cluster recovery.
Does the total number of shard count impact the search speed? Let me simplify it assume I have 5 indices with 5 primary shards each and I am searching in indice1 only and assume it return me the response in 500ms. Will this be same (500ms) if I add 5 more indices? I know the recovery time would increase but not sure about a specific indice search performance.
Any help would be highly appricated.

Common sense would imply that searching on more data takes longer, however,
it's impossible to answer without also knowing:
the number of nodes (more nodes can parallelize searches on several shards)
their hardware specs (ram and cpu play a role in how many concurrent searches can happen on a single node)
if any write operations also happen at the same time (taking resources away from search threads)
etc...
The best you can do is to actually create a test case (using e.g. Rally) and test this on your own infrastructure.

Elasticsearch Latency

I am using Elasticsearch's MultiSearch API to make multiple search requests at once for one of my endpoints. My understanding is that these requests are done in parallel, but my endpoint's latency increases with the number of search requests I make through the API (<50). I have two questions:
Why is this latency increase happening/how does multisearch work behind the scenes? I am new to Elasticsearch, apologies for my lack of knowledge here.
What are some ways I can improve latency while keeping multisearch?

To provide a more comprehensive answer, it would be good to know your cluster setup.
These requests are indeed done in parallel, but your cluster still has its limits.
What I believe might be happening is that you might not have enough search threads to process that many searches in parallel and your search thread pool start queueing.
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
So for instance, if you issue a MultiSearch query of let's say 10 search queries where each query would hit 15 shards, this means that this whole query will need 150 search threads in total. And if there are other searches running and the cluster doesn't have available search threads - they will start queueing and eventually might reject if the queue grows too big.
What can you do about it?
Carefully review indices setups, their number_of_shards of shards, and indices sizes. Reducing the number_of_shards will require fewer search threads. Find a balance between number_of_shards and index sizes and their doc count. If there are less than 5M documents, keep everything in a single shard, otherwise, try to have shards of 3M-5M documents, e.g. index with 23M documents could use 5 or 6 shards.
Scale your cluster horizontally by adding new nodes, this will add new search threads
Tweak default thread pool settings (this is mostly the last thing you'd do)

Elasticsearch - Sharding and Performance

I think I've finally gotten a grasp of the fundamental understanding of how to allocate shards for Elasticsearch. Please correct me if I'm wrong, this is what I've pieced together:
Ideally, there should only exist one shard per index, per node.
The only reason why we would ever want to configure more than
one shard IS to over-allocate for future growth (i.e. adding more
nodes to physically support the data).
Now, assuming what I have above is correct, I then wonder if there are any performance issues or differences if I only had one node with 1 shard versus one node with 5 shards. Can anyone enlighten me on this subject?

"The only reason why we would ever want to configure more than one shard IS to over-allocate for future growth (i.e. adding more nodes to physically support the data)."
Not necessarily so. Having more shards helps parallelise your queries and helps them finish faster, but after a bit it can be counterproductive as too many shards will mean overheads in merging the individual shard responses and time spent in queuing and such things.
"one node with 1 shard versus one node with 5 shards"
It depends on what your use case is but you should see some performance gain for bigger queries, with 5 shards.

I believe it depends on the size of the shards. For instance, on the elastic website, they say the following:
"Querying lots of small shards will make the processing per shard
faster, but as many more tasks need to be queued up and processed in
sequence, it is not necessarily going to be faster than querying a
smaller number of larger shards. Having lots of small shards can also
reduce the query throughput if there are multiple concurrent queries."
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
In practice I have found that using some exploratory testing with realistic queries helps me determine more definitively how I should move forward with my architecture. It really depends on the use case. As was stated previously however, there comes a point where you can sort of "over optimize" and it ends up cancelling out any noticible gains you may have otherwise obtained by doing the opposite solution.
To be succinct, one shard per index, per node is a fine practice. But if you find yourself needing more, then just assess your use case first and determine if additional shards are truly necessary.

Is there a limit on the number of indexes that can be created on Elastic Search?

I'm using AWS-provided Elastic Search.
I have a signup page on my website, and on each signup; a new index for the new user gets created (to be used later by his work-group), which means that the number of indexes is continuously growing, (now it reached around 4~5k).
My question is: is there a performance limit on the number of indexes? is it safe (performance-wise) to keep creating new indexes dynamically with each new user?

Note: I haven't used AWS-Elasticsearch, so this answer may vary because they have started using open-distro of Elsticsearch and have forked the main branch. But a lot of principles should be the same. Also, this question doesn't have a definitive answer and it depends on various factors but I hope this answer will help the thought process.
One of the factors is the number of shards and replicas per index as that will contribute to the total number of shards per node. Each shard consumes some memory, so you will have to keep the number of shards limited per node so that they don't exceed maximum recommended 30GB heap space. As per this comment 600 to 1000 should be reasonable and you can scale your cluster according to that.
Also, you have to monitor the number of file descriptors and make sure that doesn't create any bottleneck for nodes to operate.
HTH!

If I'm not mistaken, the only limit is the disk space of your server, but if your index is growing too fast you should think about having more replica servers. I recomend reading this page: Indexing Performance Tips

Indexes themselves have no limit, however shards do, the recommended amount of shards per GB of heap is 20(JVM heap - you can check on kibana stack monitoring tab), this means if you have 5GB of JVM heap, the recommended amount is 100.
Remember that 1 index can take from 1 to x number of shards (1 primary and x secondary), normally people have 1 primary and 1 secondary, if this is you case then you would be able to create 50 indexes with those 5GB of heap

Resource usage with rolling indices in Elasticsearch

My question is mostly based on the following article:
https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index
The article advises against having multiple shards per node for two reasons:
Each shard is essentially a Lucene index, it consumes file handles, memory, and CPU resources
Each search request will touch a copy of every shard in the index. Contention arises and performance decreases when the shards are competing for the same hardware resources
The article advocates the use of rolling indices for indices that see many writes and fewer reads.
Questions:
Do the problems of resource consumption by Lucene indices arise if the old indices are left open?
Do the problems of contention arise when searching over a large time range involving many indices and hence many shards?
How does searching many small indices compare to searching one large one?
I should mention that in our particular case, there is only one ES node though of course generally applicable answers will be more useful to SO readers.

It's very difficult to spit out general best practices and guidelines when it comes to cluster sizing as it depends on so many factors. If you ask five ES experts, you'll get ten different answers.
After several years of tinkering and fiddling around ES, I've found out that what works best for me is always to start small (one node, how many indices your app needs and one shard per index), load a representative data set (ideally your full data set) and load test to death. Your load testing scenarii should represent the real maximum load you're experiencing (or expecting) in your production environment during peak hours.
Increase the capacity of your cluster (add shard, add nodes, tune knobs, etc) until your load test pass and make sure to increase your capacity by a few more percent in order to allow for future growth. You don't want your production to be fine now, you want it to be fine in a year from now. Of course, it will depend on how fast your data will grow and it's very unlikely that you can predict with 100% certainty what will happen in a year from now. For that reason, as soon as my load test pass, if I expect a large exponential data growth, I usually increase the capacity by 50% more percent, knowing that I will have to revisit my cluster topology within a few month or a year.
So to answer your questions:
Yes, if old indices are left open, they will consume resources.
Yes, the more indices you search, the more resources you will need in order to go through every shard of every index. Be careful with aliases spanning many, many rolling indices (especially on a single node)
This is too broad to answer, as it again depends on the amount of data we're talking about and on what kind of query you're sending, whether it uses aggregation, sorting and/or scripting, etc

Do the problems of resource consumption by Lucene indices arise if the old indices are left open?
Yes.
Do the problems of contention arise when searching over a large time range involving many indices and hence many shards?
Yes.
How does searching many small indices compare to searching one large one?
When ES searches an index it will pick up one copy of each shard (be it replica or primary) and asks that copy to run the query on its own set of data. Searching a shard will use one thread from the search threadpool the node has (the threadpool is per node). One thread basically means one CPU core. If your node has 8 cores then at any given time the node can search concurrently 8 shards.
Imagine you have 100 shards on that node and your query will want to search all of them. ES will initiate the search and all 100 shards will compete for the 8 cores so some shards will have to wait some amount of time (microseconds, milliseconds etc) to get their share of those 8 cores. Having many shards means less documents on each and, thus, potentially a faster response time from each. But then the node that initiated the request needs to gather all the shards' responses and aggregate the final result. So, the response will be ready when the slowest shard finally responds with its set of results.
On the other hand, if you have a big index with very few shards, there is not so much contention for those CPU cores. But the shards having a lot of work to do individually, it can take more time to return back the individual result.
When choosing the number of shards many aspects need to be considered. But, for some rough guidelines yes, 30GB per shard is a good limit. But this won't work for everyone and for every use case and the article fails to mention that. If, for example, your index is using parent/child relationships those 30GB per shard might be too much and the response time of a single shard can be too slow.
You took this out of the context: "The article advises against having multiple shards per node". No, the article advises one to think about the aspects of structuring the indices shards before hand. One important step here is the testing one. Please, test your data before deciding how many shards you need.
You mentioned in the post "rolling indices", and I assume time-based indices. In this case, one question is about the retention period (for how long you need the data). Based on the answer to this question you can determine how many indices you'll have. Knowing how many indices you'll have gives you the total number of shards you'll have.
Also, with rolling indices, you need to take care of deleting the expired indices. Have a look at Curator for this.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio