Does the document count matter when consider shard size? - elasticsearch

I am reading this doc https://www.elastic.co/guide/en/elasticsearch/reference/7.17/size-your-shards.html to decide how many shards I need.
It mentioned some factors like data size per shard, node heap memory size etc. For example, it says generally try to keep one shard size between 10G to 50G, but it doesn't mention document count.
I have some data which is very small individually but has a large number. It takes 5 GB storage for 10 million documents. In this case, do I use 1 shard?
The query executes on a single thread per shard, to search 10 million documents in one thread is probably not a good idea.
How should I size the shard for a large count of small document in this case?

there is a per shard limit at the lucene level of (2^32)-1 as per Elasticsearch and Lucene document limit
and while the recommended shard size is <50 gig, you can have smaller indices if that's all the data you have. the big, and unsaid thing in that suggestion is that you shouldn't have a tonne of really small indices. eg thousands of indices with 1 shard and a small amount of documents. you are better off merging them (if you can), as ultimately the bulk of the resource use is based on the number of shards, not documents
for what you want, use 1 primary shard

Related

Elasticsearch Shard distribution size differs enormously

I need to load 1.2 billion documents in the elasticsearch. As of today we have 6 nodes in the cluster. To equally distribute the shards among the 6 nodes I have mentioned the number of shards to be 42. I use spark and it takes me almost 3 days load the index. The shards distribution looks so off.
The node6 only has two shards in it while node 2 has almost 10 shards. The size distribution is also not even. Some shards are 114.6gb while some are just 870mb within the same node.
I have tried to figure out the solution too. I can include the
index.routing.allocation.total_shards_per_node: 7
while creating the index and make it evenly distribute. Will forcing the designated amount of shards in the node, crash the node if there is not enough resource available?
I want to size the shards evenly. My index size is 900 gb apprx. I want each shards to be atleast 20 gb. Could I use the following setting while creating the index?
max_primary_shard_size: 25gb
Is setting up max shard size only possible through ilm policy and will I require roll over policy for that ? I am not too familiar with the ilm. Sorry if this does not make sense.
The main reason I am trying to optimize the index is because I am getting timeout error on my application when I am querying the elastic search. I know I can increase my timeout time in my application and do some query optimization, but first I want to optimize my index and make my application as fast as possible.
I load the index only one time and do not write any documents to it after onetime load. For additional data, which i load every 15 days, I create a different index and use an alias name on the both the indexes to query. Other than sharding if there is any suggestion to optimize my indexes I will really appreciate it. It takes me 3 days just to load the data so it is quite difficult to experiment.
are you using custom routing values in your indexing approach? that might explain the shard size differences.
and if you aren't already, disable replicas and refreshes when doing your bulk index, as that will speed things up
finally your shard size of 20gig is probably a little low, I would suggest doubling that size, aiming for <50gig

ElasticSearch - How does sharding affect indexing performance?

I'm doing some benchmarks on a single-node cluster of ElasticSearch.
I faced to the situation that more shards will reduce the
indexing performance -at least in a single node- (both in latency and throughput)
These are some of my numbers:
Index with 1 shard it indexed +6K documents per minute
Index with 5 shards it indexed +3K documents per minute
Index with 20 shards it indexed +1K documents per minute
I had the same results with bulk API. So I'm wondering what's the relation and why this happens?
Note: I don't have the resource problem! Resources are free (CPU & Memory)
Just to have you on the same page:
Your data is organized in indices, each made of shards and distributed across multiple nodes. If a new document needs to be indexed, a new id is being generated and the destination shard is being calculated based on this id. After that, the write is delegated to the node, which is holding the calculated destination shard. This will distribute your documents pretty well across all of your shards.
Finding documents by id is now easy, as the shard, containing the wanted document, can be calulated just based on the id. There is no need for searching all shards. BTW, that's the reason why you can't change the number of shards afterwards. The changed shard number will result in a different document distribution across your shards.
Now, just to make it clear, each shard is a separate lucene index, made of segment files located on your disk. When writing, new segments will be created. If a particular number of segment files will be reached, the segments will be merged.
So just introducing more shards without distributing them to other nodes will just introduce a higher I/O and memory consumption for your single node.
While searching, the query will be executed against each shard. Afterwards the results of all shards needs to be merged into one result - more shards, more cpu work to do...
Coming back to your question:
For your write heavy indexing case, with just one node, the optimal number of indices and shards is 1!
But for the search case (not accessing by id), the optimal number of shards per node is the number of CPUs available. In such a way, searching can be done in multiple threads, resulting in better search performance. Correction: Searching and indexing are multithreaded, a single shard can fully utilize all CPU cores from a node.
But what are the benefits of sharding?
Availability: By replicating the shards to other nodes you can still serve if some of your nodes can´t be reached anymore!
Performance: Distibuting the primary shards to different nodes, will distribute the workload too.
So if your scenario is write heavy, keep the number of shards per index low. If you need better search performance, increase the number of shards, but keep the "physics" in mind. If you need reliability, take the number of nodes/replicas into account.
Further readings:
https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html
https://www.elastic.co/de/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87
I faced to the situation that more shards will reduce the indexing
performance -at least in a single node- (both in latency and
throughput)
For reference: Elasticsearch is a distributed database. Data is stored in an "index", the index is split into "shards". Each "shard" is allocated on a node (a different node if possible).
Having more shards allows to use more machines. This is very much how the "distributed" in "distributed database" actually work. Elasticsearch will automatically allocate and move shards in the background, to balance disk usage across all machines.
With 1 shards, the data is split onto one node, this gives you a baseline of N reads and M writes per second.
With 3 shards, the data is split onto three nodes, this gives you 3 times the throughput.
Of course this assumes that there are 3 machines available. If there is a single machine, then the machine is doing all the processing either way and having more shards has no effect.
There is a bit of overhead with sharding, gotta distribute queries and merge back results, hence doubling the amount of shards will not exactly double performance (expect in the order of +90%).
Your cluster has a single machine. You lose performance when you increase the amount of shards, because it's just increasing the overhead.
P.S. Shards have a replica by default, the replica will take over if the primary is gone (machine failed), this is how resiliency works. An index with 5 shards and 5 replicas can fully utilize 10 nodes. Meaning it takes few shards to use many many nodes.
P.P.S In my experience a configuration of shard=5 is a maximum. You should never set more than that, unless working with large clusters (10+ machines) or terabytes indexes.

Why would I need to shrink index in elasticsearch

I tried searching for an answer to my question but I couldn't find any, this is my first time dealing with big data and Elasticsearch, I'm trying to learn how Elasticsearch works by going through there online tutorial, while reading I came across the topic for shrinking indices and how that can be done, OK now I know how to do it but unfortunately I don't know why I need to do it?
Why do I need to shrink my index and decrease my shards? is it space related change or what?
Every Elasticsearch index consists of multiple shards (default 5), which are each a Lucene index. Each one of these has an overhead (in terms of memory, file handles,...) but allow more parallelization. In case you don't need that much parallelization any more at some point — think of a daily index for logs and after a few days there won't be more writes any more and only few reads — you might want to reduce the number of shards to cut down on their overhead.
The number of shards is tied to query performance in the following way:
How does shard size affect performance?
In Elasticsearch, each query is executed in a single thread per shard.
Multiple shards can however be processed in parallel, as can multiple
queries and aggregations against the same shard.
This means that the minimum query latency, when no caching is
involved, will depend on the data, the type of query, as well as the
size of the shard. Querying lots of small shards will make the
processing per shard faster, but as many more tasks need to be queued
up and processed in sequence, it is not necessarily going to be faster
than querying a smaller number of larger shards. Having lots of small
shards can also reduce the query throughput if there are multiple
concurrent queries.
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

Elastisearch 2.3.2 limits on resources

Is there a list of hard limits for Elasticsearch?
I am particularly interested if there is a theoretical limit on the number of indices one can create, and the number of records an index can have.
There are limits to the number of docs per shard of 2 billion, which is a hard Lucene limit.
The actual value is Integer.MAX_VALUE - 128 which is 2147483519 documents per shard.
While asking about the number of documents can be a realistic question, looking at the maximum number of indices is the wrong question. Probably there is a JVM limit, assuming some kind of array or ArrayList that holds these indices (or indices mappings - the cluster state) then the limit would be the size of the array or the size of the ArrayList/HashSet/Map etc.
Way before reaching that theoretical limit your cluster is probably dead or not even able to start. Each shard is using resources, the cluster state would be huge, even with one primary shard only. The correct question would be to ask about performance issues with your cluster and not thinking about the maximum number of indices in a cluster.

ElasticSearch - Optimal number of Shards per node

I would appreciate if someone could suggest the optimal number of shards per ES node for optimal performance or provide any recommended way to arrive at the number of shards one should use, given the number of cores and memory foot print.
I'm late to the party, but I just wanted to point out a couple of things:
The optimal number of shards per index is always 1. However, that provides no possibility of horizontal scale.
The optimal number of shards per node is always 1. However, then you cannot scale horizontally more than your current number of nodes.
The main point is that shards have an inherent cost to both indexing and querying. Each shard is actually a separate Lucene index. When you run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability. In other words, it's a trade-off.
Finally, it should be noted that any more than 1 shard per node will introduce I/O considerations. Since each shard must be indexed and queried individually, a node with 2 or more shards would require 2 or more separate I/O operations, which can't be run at the same time. If you have SSDs on your nodes then the actual cost of this can be reduced, since all the I/O happens much quicker. Still, it's something to be aware of.
That, then, begs the question of why would you want to have more than one shard per node? The answer to that is planned scalability. The number of shards in an index is fixed. The only way to add more shards later is to recreate the index and reindex all the data. Depending on the size of your index that may or may not be a big deal. At the time of writing, Stack Overflow's index is 203GB (see: https://stackexchange.com/performance). That's kind of a big deal to recreate all that data, so resharding would be a nightmare. If you have 3 nodes and a total of 6 shards, that means that you can scale out to up to 6 nodes at a later point easily without resharding.
There are three condition you consider before sharding..
Situation 1) You want to use elasticsearch with failover and high availability. Then you go for sharding.
In this case, you need to select number of shards according to number of nodes[ES instance] you want to use in production.
Consider you wanna give 3 nodes in production. Then you need to choose 1 primary shard and 2 replicas for every index. If you choose more shards than you need.
Situation 2) Your current server will hold the current data. But due to dynamic data increase future you may end up with no space on disk or your server cannot handle much data means, then you need to configure more no of shards like 2 or 3 shards (its up to your requirements) for each index. But there shouldn't any replica.
Situation 3) In this situation you the combined situation of situation 1 & 2. then you need to combine both configuration. Consider your data increased dynamically and also you need high availability and failover. Then you configure a index with 2 shards and 1 replica. Then you can share data among nodes and get an optimal performance..!
Note: Then query will be processed in each shard and perform mapreduce on results from all shards and return the result to us. So the map reduce process is expensive process. Minimum shards gives us optimal performance
If you are using only one node in production then, only one primary shards is optimal no of shards for each index.
Hope it helps..!
Just got back from configuring some log storage for 10 TB so let's talk sharding :D
Node limitations
Main source: The definitive guide to elasticsearch
HEAP: 32 GB at most:
If the heap is less than 32 GB, the JVM can use compressed pointers, which saves a lot of memory: 4 bytes per pointer instead of 8 bytes.
HEAP: 50% of the server memory at most. The rest is left to filesystem caches (thus 64 GB servers are a common sweet spot):
Lucene makes good use of the filesystem caches, which are managed by the kernel. Without enough filesystem cache space, performance will suffer. Furthermore, the more memory dedicated to the heap means less available for all your other fields using doc values.
[An index split in] N shards can spread the load over N servers:
1 shard can use all the processing power from 1 node (it's like an independent index). Operations on sharded indices are run concurrently on all shards and the result is aggregated.
Less shards is better (the ideal is 1 shard):
The overhead of sharding is significant. See this benchmark for numbers https://blog.trifork.com/2014/01/07/elasticsearch-how-many-shards/
Less servers is better (the ideal is 1 server (with 1 shard)]):
The load on an index can only be split across nodes by sharding (A shard is enough to use all resources on a node). More shards allow to use more servers but more servers bring more overhead for data aggregation... There is no free lunch.
Configuration
Usage: A single big index
We put everything in a single big index and let elasticsearch do all the hard work relating to sharding data. There is no logic whatsoever in the application so it's easier to dev and maintain.
Let's suppose that we plan for the index to be at most 111 GB in the future and we've got 50 GB servers (25 GB heap) from our cloud provider.
That means we should have 5 shards.
Note: Most people tend to overestimate their growth, try to be realistic. For instance, this 111GB example is already a BIG index. For comparison the stackoverflow index is 430 GB (2016) and it's a top 50 site worldwide, made entirely of written texts by millions of people.
Usage: Index by time
When there're too much data for a single index or it's getting too annoying to manage, the next thing is to split the index by time period.
The most extreme example is logging applications (logstach and graylog) which are using a new index every day.
The ideal configuration of 1-single-shard-per-index makes perfect sense in scenario. The index rotation period can be adjusted, if necessary, to keep the index smaller than the heap.
Special case: Let's imagine a popular internet forum with monthly indices. 99% of requests are hitting the last index. We have to set multiple shards (e.g. 3) to spread the load over multiple nodes. (Note: It's probably unnecessary optimization. A 99% hitrate is unlikely in the real world and the shard replica could distribute part of the read-only load anyway).
Usage: Going Exascale (just for the record)
ElasticSearch is magic. It's the easiest database to setup in cluster and it's one of the very few able to scale to many nodes (excluding Spanner ).
It's possible to go exascale with hundreds of elasticsearch nodes. There must be many indices and shards to spread the load on that many machines and that takes an appropriate sharding configuration (eventually adjusted per index).
The final bit of magic is to tune elasticsearch routing to target specific nodes for specific operations.
It might be also a good idea to have more than one primary shard per node, depends on use case. I have found out that bulk indexing was pretty slow, only one CPU core was used - so we had idle CPU power and very low IO, definitely hardware was not a bottleneck. Thread pool stats shown, that during indexing only one bulk thread was active. We have a lot of analyzers and complex tokenizer (decomposed analysis of German words). Increasing number of shards per node has resulted in more bulk threads being active (one per shard on node) and it has dramatically improved speed of indexing.
Number of primary shards and replicas depend upon following parameters:
No of Data Nodes: The replica shards for the given primary shard meant to be present on different data nodes, which means if there are 3 data Nodes: DN1, DN2, DN3 then if primary shard is in DN1 then the replica shard should be present in DN2 and/or DN3. Hence no of replicas should be less than total no of Data Nodes.
Capacity of each of the Data Nodes: Size of the shard cannot be more than the size of the data nodes hard disk and hence depending upon the expected size for the given index, no of primary shards should be defined.
Recovering mechanism in case of failure: If the data on the given index has quick recovering mechanism then 1 replica should be enough.
Performance requirement from the given index: As sharding helps in directing the client node to appropriate shard to improve the performance and hence depending upon the query parameter and size of the data belonging to that query parameter should be considered in defining the no of primary shards.
These are the ideal and basic guidelines to be followed, it should be optimized depending upon the actual use cases.
I have not tested this yet, but aws has a good articale about ES best practises. Look at Choosing Instance Types and Testing part.
Elastic.co recommends to:
[…] keep the number of shards per node below 20 per GB heap it has configured

Resources