I am new to elasticsearch.Suppose we have a two node cluster and have a config of 2 primary shards and one replica for our single index.So node 1 has P0,R1 and node 2 has P1,R0. Now suppose later on I reduce the number of replicas to 0.Then will the shards P0 and P1 automatically resize themselves to occupy the disk space vacated by replicas and allow me greater disk space for indexing then previously when I had replicas.
A replica shard takes more or less the same space as its primary since both contain the same documents. So, say, you have indexed 1 million documents in your index, then each primary shard contains more or less half that amount of documents, i.e. 500K document and each replica contains the same number of documents as well.
If each document weighs 1KB, then:
The primary shard P0 has 500K document weighing 500MB
The replica shard R0 has 500K document weighing 500MB
The primary shard P1 has 500K document weighing 500MB
The replica shard R1 has 500K document weighing 500MB
Which means that your index occupies 2GB of disk space on your node. If you later reduce the number of replicas to 0, then that will free up 1GB of space that your primary shards will be able to occupy, indeed.
However, note that by doing so, you certainly gain disk space, but you won't have any redundancy anymore and you will not be able to spread your index over two nodes, which is the main idea behind replicas to begin with.
The other thing is that the size of a shard is bounded by a physical limit that it will not be able to cross. That limit is dependent on many factors, among which the amount of heap and the total physical memory you have. If you have 2GB of heap and 50GB of disk space, you cannot expect to index 50GB of data into your index, that won't work, or will be very slow and unstable.
=> So the disk space only should not be the main driver for sizing your shards. Having enough disk space is necessary condition but not a sufficient one, you also need to look at the RAM and the heap allocated to your ES node.
Related
I have a cluster consisting of 4 data and 1 master node
each node has:
cpu: 4
ram: 16
and we have 1 largest index of 215 gigabytes (1 primary and 1 replica shard)
on peak days, this index is very heavily loaded (we use aggregation queries, since the index is used to store and send notifications to users), which negatively affects the operation of the entire cluster
the developers propose to allocate a separate cluster for this index, and I propose to add 3 more machines to the cluster and use shard allocation awareness only for this index and divide it into 3 primary and 1 replica shard
what do you think this is right approach? or what is the best way to do it in your opinion?
215 gigabytes in a single primary shard is clearly not optimal and most probably the cause of your issues.
The official recommendation is to not store more than 10GB to 50GB per shard, so you should split your index at least to 4 shards.
Then, if your index is loaded because of writes, then you should make sure to allocate each primary shard to each of your data node, so that all writes can happen in parallel on all four data nodes. If the index is loaded because of searches, it doesn't really matter where the primaries and replicas reside, but with big indexes it's always best to spread the load as evenly as possible among all the computing power you have.
I have 8 node cluster(3Master+3Data+2Coordinate) in each data node the heap size is 10gb and disk space is 441gb each total 1.2TB
in each day i have 32.73GB data in each day 26 shards created for 11 indices.So lets suppose the retention period is 30 days.In 30th day the data on cluster would be 982GB and total shards would be 780 each node gets 260 shards.So the average shard size would be 260mb(approx).i read this documentation that a node of 30gb heap size can handle 600 shards.So the question is Can heap size of 10gb can handle 260 shards ?.
This article which you read can be considered a good general recommendations but there are various factors which can affect it, like size of indices, size of shards, size of documents, type of disk, current load on the system and so on, in the same document you can notice the shard size recommendation is between 10 to 50 GB, while you have very small shard size(260 MB as you mentioned), so based on this, I can say 10GB heap can easily handle 260 shards in your case, although you should benchmark your cluster and read more about how ES internally stores the data and searches them so that its easy for you to fine-tune it.
I'm doing some benchmarks on a single-node cluster of ElasticSearch.
I faced to the situation that more shards will reduce the
indexing performance -at least in a single node- (both in latency and throughput)
These are some of my numbers:
Index with 1 shard it indexed +6K documents per minute
Index with 5 shards it indexed +3K documents per minute
Index with 20 shards it indexed +1K documents per minute
I had the same results with bulk API. So I'm wondering what's the relation and why this happens?
Note: I don't have the resource problem! Resources are free (CPU & Memory)
Just to have you on the same page:
Your data is organized in indices, each made of shards and distributed across multiple nodes. If a new document needs to be indexed, a new id is being generated and the destination shard is being calculated based on this id. After that, the write is delegated to the node, which is holding the calculated destination shard. This will distribute your documents pretty well across all of your shards.
Finding documents by id is now easy, as the shard, containing the wanted document, can be calulated just based on the id. There is no need for searching all shards. BTW, that's the reason why you can't change the number of shards afterwards. The changed shard number will result in a different document distribution across your shards.
Now, just to make it clear, each shard is a separate lucene index, made of segment files located on your disk. When writing, new segments will be created. If a particular number of segment files will be reached, the segments will be merged.
So just introducing more shards without distributing them to other nodes will just introduce a higher I/O and memory consumption for your single node.
While searching, the query will be executed against each shard. Afterwards the results of all shards needs to be merged into one result - more shards, more cpu work to do...
Coming back to your question:
For your write heavy indexing case, with just one node, the optimal number of indices and shards is 1!
But for the search case (not accessing by id), the optimal number of shards per node is the number of CPUs available. In such a way, searching can be done in multiple threads, resulting in better search performance. Correction: Searching and indexing are multithreaded, a single shard can fully utilize all CPU cores from a node.
But what are the benefits of sharding?
Availability: By replicating the shards to other nodes you can still serve if some of your nodes canĀ“t be reached anymore!
Performance: Distibuting the primary shards to different nodes, will distribute the workload too.
So if your scenario is write heavy, keep the number of shards per index low. If you need better search performance, increase the number of shards, but keep the "physics" in mind. If you need reliability, take the number of nodes/replicas into account.
Further readings:
https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html
https://www.elastic.co/de/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87
I faced to the situation that more shards will reduce the indexing
performance -at least in a single node- (both in latency and
throughput)
For reference: Elasticsearch is a distributed database. Data is stored in an "index", the index is split into "shards". Each "shard" is allocated on a node (a different node if possible).
Having more shards allows to use more machines. This is very much how the "distributed" in "distributed database" actually work. Elasticsearch will automatically allocate and move shards in the background, to balance disk usage across all machines.
With 1 shards, the data is split onto one node, this gives you a baseline of N reads and M writes per second.
With 3 shards, the data is split onto three nodes, this gives you 3 times the throughput.
Of course this assumes that there are 3 machines available. If there is a single machine, then the machine is doing all the processing either way and having more shards has no effect.
There is a bit of overhead with sharding, gotta distribute queries and merge back results, hence doubling the amount of shards will not exactly double performance (expect in the order of +90%).
Your cluster has a single machine. You lose performance when you increase the amount of shards, because it's just increasing the overhead.
P.S. Shards have a replica by default, the replica will take over if the primary is gone (machine failed), this is how resiliency works. An index with 5 shards and 5 replicas can fully utilize 10 nodes. Meaning it takes few shards to use many many nodes.
P.P.S In my experience a configuration of shard=5 is a maximum. You should never set more than that, unless working with large clusters (10+ machines) or terabytes indexes.
I see that indexing performance degraded over a period of time in Elasticsearch. I see that the mem usage has slowly increased over a period of time until it became 100%. At this state I cannot index any more data. I have default shard settings - 5 primary and 1 replica. My index is time based with index created every hour to store coral service logs of various teams. An index size corresponds to about 3GB with 5 shards and with replica it is about 6GB. With a single shard and 0 replicas it comes to about 1.7 GB.
I am using ec2's i2.2x large hosts which offer 1.6TB space and 61GB RAM and 8 cores.
I have set the heap size to 30GB.
Following is node statistics:
https://jpst.it/1eznd
Could you please help in fixing this? My whole cluster came down that I had to delete all the indices.
My ES nodes are using the default settings. 5 primary shards, replicas = 1.
Does changing the settings from 5 to 3 shards and 1 replica have any effect on disk space used or is disk size solely affected by indices and documents?
My nodes keep running out of space and I'm wondering if changing number of shards and replicas
will affect disk space.
Reducing shards won't help, you will still have the same amount of data, it's just won't be sharded as much. You can decrease the disk space by reducing number of replicas, disable _all and _source field and use special mapping (E.g. you could use keyword analyzer for all strings, throw away field norms, term vectors, etc. plus disabling _all and _source. This saves some space but the price to pay is less "searchability" - careful testing is required if your search requirements are still met.)