ElasticSearch replica and nodes - elasticsearch

I am testing now clustering with ElasticSearch and have question about the replicas between the nodes.
As you can see in the screenshot from Head I have 2 indexes.
movies has 5 shards and 2 replica
students has 5 shards and 1 replica
Which one is better and which one is faster with 3 active nodes and why?

Costs of having more number of replicas would be
more storage space required(Obviously)
less indexing performance
while the advantage from it would be
better search performance
better resiliency
Note that even though you have 2 replicas, it does not mean that your cluster can endure 2 nodes going down since all indexing request would fail if only one out of 3 copies of shards is available.(because of indexing quorum)
For detailed explanation please refer to this official document

"Better" is subjective.
With two replicas, you can handle two of the three machines in your cluster going down, though at the price of writing all the data to every machine. Read performance should also be higher as the cluster has more nodes from which to request the data.
With one replica, you can only survive the outage of one machine in your cluster, but you'll get a performance boost by writing 2 copies of the data across 3 servers (less IO on each server).
So it comes down to risk and performance. Hope that helps.

Related

Consistency and Partition-Tolerance in Elasticsearch

As I am new to elasticsearch using [elasticsearch version 7.4] and with a lot of studies it is not clear till now how much shards / Nodes are preferred in particular index.  As of now, I have configured 3 shards and 2 replicas with 3 Nodes(each having 8GB RAM, 500GB HDD). and having 55GB of Data in One Index. 
So I need your views/suggestions in the following points.
Is above given no of shards, Nodes, replicas is sufficient.  
For CAP theorem I will prefer CP i.e: Consistency and Partition-tolerance for this in 3 Node cluster 
For Consistency configured write_consistency=all 
For Partition-tolerance set master-eligible node to (N/2) + 1 in my case it is 3.
I can hopefully give you some useful advice from my time running Elasticsearch clusters :)
1)
Shards: See this blog post for more information, but your average shard will be 55gb/3 = 18gb which is a good shard size (in my experience it's best to keep
shards between 5gb-25gb, the ES docs recommend this as well).
Replicas: 2 replicas is my go-to for a good balance between failure tolerance and performance, so this is good.
Nodes: Those 3 nodes should be sufficient, and you won't need that much disk. With 2 replicas you'll have roughly 55gb * 3 = 165gb of data stored (could be more depending on your mapping) across 1500gb of hard drive, so perhaps you could save some money by using nodes with 100gb disks.
2)
For partition tolerance I might suggest setting write_consistency=quorum. That way, even if you lose a node and therefore 1 replica shard you'll still be able to write with 1 primary and 1 replica left. Otherwise, you'd need to reboot/recreate that node to start writing again. See https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-index_.html#index-consistency for more details.
Master-eligible: Yes I recommend a minimum of 3 master nodes, so you'll want to set all 3 of these nodes to be master and data nodes.

RethinkDB Replica Sets and Shards for HA

I would like some guidance on configuring replicas and shards for RethinkDB cluser.
Let's say my cluster consists of 4 instances, all in the same region but in different AZs. Should I set the number of replicas to 4? What are the tradeoffs between choosing between 3 and 4 replica sets in this configuration? How do I determine how many shards to create and does this impact disk use of performance depending on the number of replica sets I have chosen?
For best performance you usually want one shard per server. (The only exception to this is if your servers have a lot of processors, in which case you might get better performance from having 2-3 shards per server.) The number of replicas should be at least 3 if you want auto-failover to work; making it higher will allow your cluster to survive a higher number of concurrent server failures.

How many shards should I use with Elasticsearch on a dev & CI environment?

By default Elasticsearch is configured to start with 5 shards.
Is there a reason to use 5 shards locally (on my development machine) and on the continuous integration server (for integration tests)? Is it better to use 1?
Obviously I don't care about scalability in those cases, I just want the simplest setup.
The simplest setup is 1 primary shard, 0 replicas.
If you only have one node and replica count is >0 it will always be yellow. Not a problem per se, but those will not be needed.
If you want to test search response time with that one shard, for example, it depends on some factors if 1 is enough or you need more. The simplest rule of thumb is to have shards no larger than 30-50GB, for example. But this number also depends on factors.
So, I'd say if you have one node, start with 1 primary, 0 replicas. If that primary is too "large", think about having more primaries (each shard will do part of the work and each will use one core CPU for searching).
Once you've pushed some data with a specific shard configuration, you cannot set a different number of shards without re-index your data. So my guess is that the default configuration of elasticsearch is made so that you can scale your cluster to 5 nodes (then each node gets one shard) without headaches.
from the elasticsearch documentation:
A new index in Elasticsearch is allotted five primary shards by default. That means that we can spread that index out over a maximum of five nodes, with one shard on each node. That’s a lot of capacity, and it happens without you having to think about it at all!

ElasticSearch - Optimal number of Shards per node

I would appreciate if someone could suggest the optimal number of shards per ES node for optimal performance or provide any recommended way to arrive at the number of shards one should use, given the number of cores and memory foot print.
I'm late to the party, but I just wanted to point out a couple of things:
The optimal number of shards per index is always 1. However, that provides no possibility of horizontal scale.
The optimal number of shards per node is always 1. However, then you cannot scale horizontally more than your current number of nodes.
The main point is that shards have an inherent cost to both indexing and querying. Each shard is actually a separate Lucene index. When you run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability. In other words, it's a trade-off.
Finally, it should be noted that any more than 1 shard per node will introduce I/O considerations. Since each shard must be indexed and queried individually, a node with 2 or more shards would require 2 or more separate I/O operations, which can't be run at the same time. If you have SSDs on your nodes then the actual cost of this can be reduced, since all the I/O happens much quicker. Still, it's something to be aware of.
That, then, begs the question of why would you want to have more than one shard per node? The answer to that is planned scalability. The number of shards in an index is fixed. The only way to add more shards later is to recreate the index and reindex all the data. Depending on the size of your index that may or may not be a big deal. At the time of writing, Stack Overflow's index is 203GB (see: https://stackexchange.com/performance). That's kind of a big deal to recreate all that data, so resharding would be a nightmare. If you have 3 nodes and a total of 6 shards, that means that you can scale out to up to 6 nodes at a later point easily without resharding.
There are three condition you consider before sharding..
Situation 1) You want to use elasticsearch with failover and high availability. Then you go for sharding.
In this case, you need to select number of shards according to number of nodes[ES instance] you want to use in production.
Consider you wanna give 3 nodes in production. Then you need to choose 1 primary shard and 2 replicas for every index. If you choose more shards than you need.
Situation 2) Your current server will hold the current data. But due to dynamic data increase future you may end up with no space on disk or your server cannot handle much data means, then you need to configure more no of shards like 2 or 3 shards (its up to your requirements) for each index. But there shouldn't any replica.
Situation 3) In this situation you the combined situation of situation 1 & 2. then you need to combine both configuration. Consider your data increased dynamically and also you need high availability and failover. Then you configure a index with 2 shards and 1 replica. Then you can share data among nodes and get an optimal performance..!
Note: Then query will be processed in each shard and perform mapreduce on results from all shards and return the result to us. So the map reduce process is expensive process. Minimum shards gives us optimal performance
If you are using only one node in production then, only one primary shards is optimal no of shards for each index.
Hope it helps..!
Just got back from configuring some log storage for 10 TB so let's talk sharding :D
Node limitations
Main source: The definitive guide to elasticsearch
HEAP: 32 GB at most:
If the heap is less than 32 GB, the JVM can use compressed pointers, which saves a lot of memory: 4 bytes per pointer instead of 8 bytes.
HEAP: 50% of the server memory at most. The rest is left to filesystem caches (thus 64 GB servers are a common sweet spot):
Lucene makes good use of the filesystem caches, which are managed by the kernel. Without enough filesystem cache space, performance will suffer. Furthermore, the more memory dedicated to the heap means less available for all your other fields using doc values.
[An index split in] N shards can spread the load over N servers:
1 shard can use all the processing power from 1 node (it's like an independent index). Operations on sharded indices are run concurrently on all shards and the result is aggregated.
Less shards is better (the ideal is 1 shard):
The overhead of sharding is significant. See this benchmark for numbers https://blog.trifork.com/2014/01/07/elasticsearch-how-many-shards/
Less servers is better (the ideal is 1 server (with 1 shard)]):
The load on an index can only be split across nodes by sharding (A shard is enough to use all resources on a node). More shards allow to use more servers but more servers bring more overhead for data aggregation... There is no free lunch.
Configuration
Usage: A single big index
We put everything in a single big index and let elasticsearch do all the hard work relating to sharding data. There is no logic whatsoever in the application so it's easier to dev and maintain.
Let's suppose that we plan for the index to be at most 111 GB in the future and we've got 50 GB servers (25 GB heap) from our cloud provider.
That means we should have 5 shards.
Note: Most people tend to overestimate their growth, try to be realistic. For instance, this 111GB example is already a BIG index. For comparison the stackoverflow index is 430 GB (2016) and it's a top 50 site worldwide, made entirely of written texts by millions of people.
Usage: Index by time
When there're too much data for a single index or it's getting too annoying to manage, the next thing is to split the index by time period.
The most extreme example is logging applications (logstach and graylog) which are using a new index every day.
The ideal configuration of 1-single-shard-per-index makes perfect sense in scenario. The index rotation period can be adjusted, if necessary, to keep the index smaller than the heap.
Special case: Let's imagine a popular internet forum with monthly indices. 99% of requests are hitting the last index. We have to set multiple shards (e.g. 3) to spread the load over multiple nodes. (Note: It's probably unnecessary optimization. A 99% hitrate is unlikely in the real world and the shard replica could distribute part of the read-only load anyway).
Usage: Going Exascale (just for the record)
ElasticSearch is magic. It's the easiest database to setup in cluster and it's one of the very few able to scale to many nodes (excluding Spanner ).
It's possible to go exascale with hundreds of elasticsearch nodes. There must be many indices and shards to spread the load on that many machines and that takes an appropriate sharding configuration (eventually adjusted per index).
The final bit of magic is to tune elasticsearch routing to target specific nodes for specific operations.
It might be also a good idea to have more than one primary shard per node, depends on use case. I have found out that bulk indexing was pretty slow, only one CPU core was used - so we had idle CPU power and very low IO, definitely hardware was not a bottleneck. Thread pool stats shown, that during indexing only one bulk thread was active. We have a lot of analyzers and complex tokenizer (decomposed analysis of German words). Increasing number of shards per node has resulted in more bulk threads being active (one per shard on node) and it has dramatically improved speed of indexing.
Number of primary shards and replicas depend upon following parameters:
No of Data Nodes: The replica shards for the given primary shard meant to be present on different data nodes, which means if there are 3 data Nodes: DN1, DN2, DN3 then if primary shard is in DN1 then the replica shard should be present in DN2 and/or DN3. Hence no of replicas should be less than total no of Data Nodes.
Capacity of each of the Data Nodes: Size of the shard cannot be more than the size of the data nodes hard disk and hence depending upon the expected size for the given index, no of primary shards should be defined.
Recovering mechanism in case of failure: If the data on the given index has quick recovering mechanism then 1 replica should be enough.
Performance requirement from the given index: As sharding helps in directing the client node to appropriate shard to improve the performance and hence depending upon the query parameter and size of the data belonging to that query parameter should be considered in defining the no of primary shards.
These are the ideal and basic guidelines to be followed, it should be optimized depending upon the actual use cases.
I have not tested this yet, but aws has a good articale about ES best practises. Look at Choosing Instance Types and Testing part.
Elastic.co recommends to:
[…] keep the number of shards per node below 20 per GB heap it has configured

Shards / Replicas settings for high availability

We have java application with embedded Elasticsearch in a cluster of 14 nodes. All the data resides in a central database, and they are indexed in elasticsearch for querying. A full reindex can be done at any time.
The system are very query-heavy, the amount of writes are small. The number of documents will not be higher than, say, 300.000.
The size of each document varies greatly, from just a couple of ids, to extracted text from e.g word-documents of several pages.
I want to make sure that in case of a total breakdown, it should be sufficient that one or two nodes are available for the system to work.
Write consistency should not be a problem since the master copy of the data is in the database, and it seems that ES is capable of resolving conflicting data by using the newest version (which should be all right in our case)
My first though is to use 1 shard, and 13 replicas. This will naturally ensure that all nodes have access to all data. This could also be accomplished by having 2 shards / 13 replicas, so this yield that to ensure that all data is available, the number of replicas should be the number of nodes - 1, not depending on the number of shards (which could be anything).
If the requirement of number of nodes are reduced to "2 nodes should be up at any time", then a shards / replica distribution of "x/number of nodes - 2" should be sufficient.
So, for the question:
Asserting the above setup and that my thoughts is correct, would a setup with 1 shard / 13 replicas make sense or would there be anything to gain by adding more shards and run e.g a 4 shards/13 replicas setup?
After a good bit of research and talking to ES-gurus;
As long as the shard size is small enough, the most efficient way of setting up this cluster would indeed be 1 shard only, with 13 replicas. I have not been able to pinpoint the threshold size of the shard for this starting to perform worse.
If the index is big... you will need more than one shard (if you want perfomance). Do You really need 13 replica? When you put only 2 replicas, ES manage that to keep it that way, if the principal node fail, ES will create a new reply. May be you will need a balancer node too.

Resources