I am new to ElasticSearch and I am confused about the meaning of index recovery.
What are the operations index recovery performs?
Does it mean recovering the data inside the index or allocating unassigned shards?
Index recovery means loading shards from disk and making it usable for your query operations. This can happen if you start a node, make new replicas, add or remove new node to your cluster or if some node has crashed and restarting. There can be multiple operations involved in the process. If a shard is coming up, it will ask all other shards what data they have and try to do an integrity check. If a new node has been added and there is no shared disk, then there will be data movements. If a new primary shard is to be selected, then primary should be the one holding most of the data at that time, so nodes need to to be in sync. To handle all these cases, there must be dozens of other tasks being done in recovery process.
According to ElasticSearch Reference:
A recovery event occurs anytime an index shard moves to a different node in the cluster. This can happen during a snapshot recovery, a change in replication level, node failure, or on node startup. This last type is called a local store recovery and is the normal way for shards to be loaded from disk when a node starts up.
Related
I'm managing an ElasticSearch cluster and I need to add an analyzer to one of my indices. The particular index I want to update is a bit more than 3TB. Will there be an excessive amount of downtime associated with closing and reopening this large of an index to add the analyzer? The documentation doesn't seem to say anything about the processing required to close and open an index.
I have done many rolling restarts and the shard recovery is pretty quick, but I'm guessing that closing and opening an index cannot be done one node at a time with a rolling restart.
As per the official document of open index API
When opening or closing an index, the master is responsible for
restarting the index shards to reflect the new state of the index. The
shards will then go through the normal recovery process. The data of
opened/closed indices is automatically replicated by the cluster to
ensure that enough shard copies are safely kept around at all times.
This clearly explains that its not a cheap operation, and if you have many shards in your cluster and your cluster state is big, updating that to all the nodes can cause significant overhead.
Apart from this, opening and closing an index also allocates the shards, again explained in the same document section of wait for active shards
Because opening or closing an index allocates its shards, the
wait_for_active_shards setting on index creation applies to the _open
and _close index actions as well.
And this one is a major overhead as it involves moving the data ie shards in the cluster and your is a very index, so it can cause huge data movement is your cluster.
Hope this helps.
We operate an elasticsearch stack which uses 3 nodes to store log data. Our current config is to have indices with 3 primaries and 1 replica. (We have just eyeballed this config and are happy with the performance, so we decided to not (yet) spend time for optimization)
After a node outage (let's assume a full disk), I have observed that elasticsearch automatically redistributes its shards to the remaining instances - as advertised.
However this increases disk usage on the remaining two instances, making it a candidate for cascading failure.
Durability of the log data is not paramount. I am therefore thinking about reconfiguring elasticsearch to not create a new replica after a node outage. Instead, it could just run on the primaries only. This means that after a single node outage, we would run without redundancy. But that seems better than a cascading failure. (This is a one time cost)
An alternative would be to just increase disk size. (This is an ongoing cost)
My question
(How) can I configure elasticsearch to not create new replicas after the first node has failed? Or is this considered a bad idea and the canonical way is to just increase disk capacity?
Rebalancing is expensive
When a node leaves the cluster, some additional load is generated on the remaining nodes:
Promoting a replica shard to primary to replace any primaries that were on the node.
Allocating replica shards to replace the missing replicas (assuming there are enough nodes).
Rebalancing shards evenly across the remaining nodes.
This can lead to quite some data being moved around.
Sometimes, a node is only missing for a short period of time. A full rebalance is not justified in such a case. To take account for that, when a node goes down, then elasticsearch immediatelly promotes a replica shard to primary for each primary that was on the missing node, but then it waits for one minute before creating new replicas to avoid unnecessary copying.
Only rebalance when required
The duration of this delay is a tradeoff and can therefore be configured. Waiting longer means less chance of useless copying but also more chance for downtime due to reduced redundancy.
Increasing the delay to a few hours results in what I am looking for. It gives our engineers some time to react, before a cascading failure can be created from the additional rebalancing load.
I learned that from the official elasticsearch documentation.
I had basic knowledge about elastic search.I come across the following phrase . From https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-replication.html
In the case that the primary itself fails, the node hosting the primary will send a message to the master about it. The indexing operation will wait (up to 1 minute, by default) for the master to promote one of the replicas to be a new primary.
The question, How node hosting the shard knows about the failure of the shard ? As I understand , shard is a lucene instance that runs on a data node.
Most likely (with some improvements since elasticsearch version 1.4), this would be detected via checksum if any segment file within the shard has incorrect checksum, then the shard is marked corrupt.
This may happen on recovery (after node starts up) or when any IO operation is done on the segment (ie when it is read by searching or via the merge policy)
Potentially, this page for 7.8 (select the version you use for accurate doc) mentions how to dismiss corrupt data or if data is important best way is to restore from snapshot :
https://www.elastic.co/guide/en/elasticsearch/reference/7.8/shard-tool.html#_description_7
I guess, you are getting confused in this statement
How node hosting the shard knows about the failure of the shard ? As I
understand , shard is a lucene instance that runs on a data node.
while its true that every shard is a Lucene instance(index) but its not a 1:1 mapping and 1 data node of elasticsearch can host multiple shards not just 1 shard and failure of Lucene shard doesn't always mean the failure of data node.
Node holding the primary shard knows if its connected to network, whether its able to index the data or not or shard is corrupted or not as mentioned by #julian and then it can send that information to master node, which then promote other replicas to primary which is contained in cluster state which all nodes holds.
In network failure case, all the primary shards hosted on the nodes will be replaced by other shards and it's easy to detect as master will not a heart beat from that data node.
Hope bold part of my answer is what you were looking for, otherwise feel free to comment and would try to explain further.
It's confusing at first sight. But if you look deeper, it is still a valid scenario and same mentioned in the document at high level.
Let's, say coordinator node receives a request to index the data. Master node maintains list of in-sync shards. Then master forwards the request to the node which has the primary shard. As you mentioned, shard is a Lucene core. The node which received has to index the data in the primary shard. Incase if it is not possible due to the portion of shard corrupted or so, then it will inform the master to elect another primary.
And master also monitors each shards and informs the other node to prepare a primary shard if needed. Demotes a shard from primary if needed. Master does more in this cases.
Elasticsearch maintains a list of shard copies that should receive the operation. This list is called the in-sync copies and is maintained by the master node
Once the replication group has been determined, the operation is forwarded internally to the current primary shard of the group
I have read that when a new indexing request is sent to ES cluster. ES will specify which shard should that document be stored in depending on routing. Then that node which hosts that primary shard (aka coordinating node) will broadcast the indexing request to each node containing a replica for that shard and it will respond to the client that the document has been indexed successfully if the primary shard and it's replicas stored/indexed that document.
Does that mean that ES supports high availability(node tolerant) for reading requests and not for writing request or it's the default behavior and can be changed?
The main purpose of replicas is for failover, if the node holding a primary shard dies, a replica is promoted to the role of primary. Also, replica shards can serve read requests thus improving search performance.
For write requests though, indexing will be affected if one of your nodes (which has the primary shard for a live index) in the cluster suddenly runs out of disk space because if a node disk usage hits configured watermark levels then ES throws a cluster block exception preventing any writes to the node. If ALL nodes are down/ unreachable indexing will stop however if only one or some nodes go down, indexing shouldn't completely stop as replica shards on other nodes are promoted to primary if the node holding the original primary is offline. Ideally, it goes without saying that some analysis and effort should go to right size an ES cluster and have monitoring in place to prevent any issues.
I have manually allocated 3 primary shards to a particular node in ElasticSearch. The replicas of these shards reside in different nodes. Now, let's say, primary shard number 2 goes down (for example, due to overflow of data) without the node on which it is residing going down. Then is it possible to retrieve the data residing on that particular shard, after I manually re-allocate it to a different node? If yes, how?
Yes.
Once the node with primary shard number 2 goes down then the replica shard on the other node will be upgraded to a primary shard - allowing you to retrieve the data. See here:
Coping with failure (ES Definitive Guide)