why original is not distributed to one data node? - elasticsearch

i made a elasticsearch cluster.
cluster setting is here.
server1 - master node, data node1
server2 - data node2, data node3
I want the original data to be distributed evenly across all data nodes.
But the original data is stored only in two data nodes(data node1, data node2).
Data node 3 only stores a copy. Why?????
How do I spread the original data evenly?
here is my cluster....
enter image description here

I think that you're probably using the elasticsearch defaults, which means 5 shards and 1 replica per index.
You can have more control over the distribution of your shards and replicas using shard allocation filtering and cluster level sharding, specially tuning shard balancing heuristics.

Related

In a 3 node Elasticsearch cluster, a search is distributed through all nodes?

If I have 3 data nodes and perform a query with a lot of aggregations, this search is distributed through all cluster data nodes?
Or the Elasticsearch elects one node to query and aggregate the data? Acting as a load balancer and not as like a "distributed map/reduce"
If the index you're querying contains more than one shard (whether primary or replica), then those shards will be located on different nodes, hence the query will be distributed to each node that hosts a shard of the index you're querying.
One data node will receive your request and act as the coordinating node. It will check the cluster state to figure out where the shards are located, then it will forward the request to each node hosting a shard, gather the results and send them back to the client.

duplicate data in nodes of cluster in elastic search

why does indexing data into one node of a cluster reflect in all nodes of that cluster in elastic search?
ex: have 3 nodes ex1,ex2,ex3 and all these nodes share same index say example. once I update one of node say ex1 why will that reflect in all the three nodes?
This is because the index example may have multiple shards and each shard may in turn have one or more replicas. These shards (and their replicas) are distributed across data nodes in Elasticsearch to improve resiliency and increase availability. Read more about them here and here.

Elasticsearch - Whats the ideal Shard config for 4 node cluster

I have 4 servers installed and running ES. I am looking to setup 2 shards and corresponding replica (1 to 1 shard).
My challenge is, do I need to make 2 nodes as masters and the other 2 nodes as just node datas?
The plan is
Node A acts as Master with 2 Primary Shards = Replica is Node B
Node C acts as Master with 2 Primary Shards = Replica is Node D
Is this an ideal configuration or is there a better alternative. Also since they are all clustered, when datas are pushed to the cluster, would either of the master node take responsibility to distribute the shard between the 2 master nodes?
If I make all 4 nodes both master and data, which config settings will make node A primary shard and node B the replica or which config will tell node A that its replica is node B. Same for nodes C & D.
Thanks
You have two separate problems here:
Cluster Topology
It is recommended to have exactly 3 master nodes in an Elasticsearch cluster. You need this to increase resiliency towards node failures and avoid split brain problems.
An Elasticsearch node can act both as master as well as data node. Note that if a node is set to be a master node but not data node, it cannot store any indexed data (read shards). Hence depending on how much data you want to index, you can set one, two, three or even all four nodes as data nodes.
Data Topology
The number of primary and replica shards again depend on how much data you want to index and the disk capacity of the data nodes. If unsure, you can start with the default settings of 5 primary shards and 1 replica shard.
Shards will only be present in data nodes (doesn't matter if they are also master nodes). Regarding balancing shards between the data nodes, you don't need to worry about it; master node will take care of it.

Elasticsearch with two nodes and the default 5 shards?

I have set up a cluster with two nodes but I have some confusions about shard and replica.
What I intend is a setup where there is a master(node A) handling write and a slave(node B) that helps with read and search operation. Ideally if the master is not functional I can recover the data from the slave.
I read that the default is 5 shards and 1 replica. Does it mean that my primary data would then be automatically split between node A and node B. Would that means if one node is down I would lost half the data?
Given the description of my need above, am I doing it right?
The only config I have changed at this point is the following
cluster:
name: maincluster
node:
name: masternode
master: true
I am really new to elasticsearch and please kindly point out if I am missing anything.
5 shards and 1 replica means that your data will be split into 5 shards per index.
Each shard will have one replica (5 more backup shards) for a total of 10 shards spread across your set of nodes.
The replica shard will be placed onto a different node than the primary shard (so that if one node fails you have redundancy).
With 2 nodes and replication set to 1 or more, losing a node will still give you access to all of your data, since the primary shard and replication shard will not ever be on same node.
I would install the elasticsearch head plugin it provides a very graphical view of nodes and shards (primary and replica).

What happens if an ElasticSearch node/index/shard gets corrupted

I'm new to ES. We've recently setup a 3 node elasticsearch cluster for our Prod App. Just want to understand what would happen if ElasticSearch node or index or shard gets corrupted.
Thanks!
What would happen actually depends on how you have set up your ES cluster.
With respect to DATA
If you have a singular cluster, a corruption would render your ES setup useless.You would,pretty much,need to setup everything from scratch.
If you have multiple nodes in your cluster,there can be following scenarios-
If you configure a single node as data node and if that goes down,you would have the cluster running but queries would not return any result. You will then need to re-configure a node to behave as data node and restart the cluster.
If you have multiple nodes designated as data node,then a corruption/failure of a node will only affect that node.The rest of the nodes and the ES will in essence perform as usual. The only effect is that the data stored in the corrupted node will obviously be not available. The shards in the corrupted node will become unassigned shards and have to be reassigned to some other data node.
If you have replicas enabled,then there will be no effects in term of data loss. It would simply require the unassigned shards to be re-assigned to some new data node(if and when it is added).
Its best to have a multi-node cluster with at least 2 data nodes and replicas enabled to mitigate shards/data nodes corruption.
This Stackoverflow post explains shards and replicas in an excellent way.
Edit 1:
This is in response to your comment.
Default settings dictate that each node is master eligible and also stores data and hence,each of your nodes can become Master and will also store data.
Lets consider nodes as A,B & C.
Initially, one of them will be designated as master node,e.g. Node A.
Now if Node A goes down,one of the remaining nodes (B & C) will become the master nodes. Queries will now only return results from data stored in Node B & C.
Check out this page for more insights into how cluster works
One way is to you need to take incremental snapshots of your indices and need to restore from that snapshot.

Resources