I have 4 nodes,3 shards and 3 replicas in Elastic search cluster. I understands that data is split and stored into the shards. When query is requested, How data is fetched from all three shards?
Storage Structure :
Node 1: Shard1(Primary), Shard2(replica),Shard3(replica)
Node 2: Shard2(Primary), Shard1(replica)
Node 3: Shard3(Primary), Shard1(replica)
Node 4: Shard2(replica), Shard3(replica)
Thanks,
Kumar.
For search query, data can be searched in a primary shard or its replica shard, as Elastic never assigns the replica shard of a primary on the same data node, replica improves the high-availability also as replica contains the copy so data can be searched from replica shard as well.
But as your data is divided into 3 primary shards, so for search query at least it needs to search 3 shards(primary or primary's shard).
Related
I'm trying to understand how search queries are devided between primary and replica shards.
I don't clearly underatsnad why ones are said that one of benefit of replica is increasing search performance. But as I understand primary shard can serve search queries as replica does. If so let's assume that we have ES cluster with 2 nodes and queries are read-only. Will index with 10 shard work with such performanse as index with 5 shards and 1 replica?
First you need to understand what is primary shard and replica shard.
Primary shard is where you first write request goes and then replicates to its replica shards, based on the replication factor Elasticsearch will create n number of replica shard for a primary shard.
Now one document is always a part of a single primary shard but its copies are present in all the replica shards.
Now when you search you can search either in its primary shard(just one main copy) or any one of the replica shards. And these copies can be present on different nodes in the cluster(Elasticsearch is a distributed system). And to improve the performance Elasticsearch can query depends on the load and various other factors any shard which contains the copy, this explains why having replicas increase search performance and why you can search from the replicas.
Do all shards (within index) have the same content?
If yes, more shards = longer propagation (save) time?
If no, when one of shards failed = data is incomplete when merging?
First, you need to understand what is sharding and why it's important in distributed systems like elasticsearch. You can read some good resources on shards here here and here.
Now Coming to your question,
Do all shards (within index) have the same content.
The answer, is no (assuming you are referring to primary shards here, of course, replica shard is just a copy of primary shard), let's take an example.
Your Index contains around 100 million docs and you have a 10 data nodes cluster, then you want to horizontally scale your index, so you started with the setting of 10 primary shards and 1 replica shards. In this case, elasticsearch will physically divide your data into 10 primary shards and each primary shard will be on a different node of a cluster as there are 10 data nodes and similarly every primary shards copy which is called replica of a shard which is on a different node of its primary shard.
Now coming to your follow-up question.
If yes, more shards = longer propagation (save) time? If no, when one
of shards failed = data is incomplete when merging?
As elasticsearch doesn't store the same data in all the primary shards, so more shards mean longer propagation or save time is invalid and also when one of the shards is failed then elasticsearch recover its data from its replica shard as it's present physically on a different data node server.
Bonus tip:- Shards are used to split your data and to make your application horizontal scalable, while the replica is to make your application is highly available as it contains the duplicated data, so the application can recover easily from the scenario you just asked in your follow-up question.
Let me know if you need any clarification or more details.
short answer:
Q-1: no
if-no: if index has not a replica, it affects the whole index but not other shards of the index .
please read this document:
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_basic_concepts.html
How many data nodes do we need in a cluster as a function of total number of indexes and shards per index to ensure that if a node goes down a primary and replicate shard isn't on the same node.
Note, in out cluster all the indexes contain similar data and all indexes have the same sharding number
First of all, Primary and Replica shards are NEVER allocated on the same node. So, if you have 2 nodes and one of them goes down all the shards on that node will be unassigned and any replica shards on the other healthy node will be promoted to become primary shards.
So, if your index has replica shards configured then the minimum required nodes is 2. However, depending on your data size and usage you will need to size your ES cluster with appropriate number of nodes.
For example in a cluster of 6 nodes, with 3 primary shards et 1 replica ...
Is the load of the queries shared over all the nodes?
Yes. Any request that is sent to ES will be routed to one of the shards (primary or replica).
You can see in detail How elastic search runs a query after any one of the nodes gets the search request.
Also this will give you an idea about how you can select how many shards are good for a particular scenario.
And just as you mentioned about having 6 nodes. If all those nodes are master + data (which is the default config) it is a bad idea to have even number of master nodes in ES. You can search for split brain issue of elasticsearch.
why does indexing data into one node of a cluster reflect in all nodes of that cluster in elastic search?
ex: have 3 nodes ex1,ex2,ex3 and all these nodes share same index say example. once I update one of node say ex1 why will that reflect in all the three nodes?
This is because the index example may have multiple shards and each shard may in turn have one or more replicas. These shards (and their replicas) are distributed across data nodes in Elasticsearch to improve resiliency and increase availability. Read more about them here and here.