Elasticsearch - How node detects shard failure - elasticsearch

I had basic knowledge about elastic search.I come across the following phrase . From https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-replication.html
In the case that the primary itself fails, the node hosting the primary will send a message to the master about it. The indexing operation will wait (up to 1 minute, by default) for the master to promote one of the replicas to be a new primary.
The question, How node hosting the shard knows about the failure of the shard ? As I understand , shard is a lucene instance that runs on a data node.

Most likely (with some improvements since elasticsearch version 1.4), this would be detected via checksum if any segment file within the shard has incorrect checksum, then the shard is marked corrupt.
This may happen on recovery (after node starts up) or when any IO operation is done on the segment (ie when it is read by searching or via the merge policy)
Potentially, this page for 7.8 (select the version you use for accurate doc) mentions how to dismiss corrupt data or if data is important best way is to restore from snapshot :
https://www.elastic.co/guide/en/elasticsearch/reference/7.8/shard-tool.html#_description_7

I guess, you are getting confused in this statement
How node hosting the shard knows about the failure of the shard ? As I
understand , shard is a lucene instance that runs on a data node.
while its true that every shard is a Lucene instance(index) but its not a 1:1 mapping and 1 data node of elasticsearch can host multiple shards not just 1 shard and failure of Lucene shard doesn't always mean the failure of data node.
Node holding the primary shard knows if its connected to network, whether its able to index the data or not or shard is corrupted or not as mentioned by #julian and then it can send that information to master node, which then promote other replicas to primary which is contained in cluster state which all nodes holds.
In network failure case, all the primary shards hosted on the nodes will be replaced by other shards and it's easy to detect as master will not a heart beat from that data node.
Hope bold part of my answer is what you were looking for, otherwise feel free to comment and would try to explain further.

It's confusing at first sight. But if you look deeper, it is still a valid scenario and same mentioned in the document at high level.
Let's, say coordinator node receives a request to index the data. Master node maintains list of in-sync shards. Then master forwards the request to the node which has the primary shard. As you mentioned, shard is a Lucene core. The node which received has to index the data in the primary shard. Incase if it is not possible due to the portion of shard corrupted or so, then it will inform the master to elect another primary.
And master also monitors each shards and informs the other node to prepare a primary shard if needed. Demotes a shard from primary if needed. Master does more in this cases.
Elasticsearch maintains a list of shard copies that should receive the operation. This list is called the in-sync copies and is maintained by the master node
Once the replication group has been determined, the operation is forwarded internally to the current primary shard of the group

Related

Elasticsearch and CAP Theorem

Elasticsearch is a distributed system. As per the CAP theorem, it can satisfy any 2 out of 3 properties. Which one is compromised in Elasticsearch?
The answer is not so straightforward. It depends upon how the system is configured and how you want to use it. I will try to go into the details.
Paritioning in ElasticSearch
Each index is partitioned in shards, meaning data in each shard is mutually exclusive to other shards. Each shard further has multiple Lucence indices, which are not in the scope of this answer.
Each shard can have a replica running (most setups have) and in an event of a failure, the replica can be promoted to a master. Let's call a shard that has a primary working and is reachable from the ES node that our application server is hitting as an Active shard. Thus, a shard with no copies in primary and is not reachable is considered as failed shard. (Eg: An error saying "all shards failed" means no primaries in that index are available)
ES has a feature to have multiple primaries (divergent shards). It is not a good situation as we lose both read/write consistencies.
In an event of a network partition, what will happen to:
Reads:
By default reads will continue to happen on shards that are active. Thus, the data from failed shards will be excluded from our search queries. In this context, we consider the system to be AP. However, the situation is temporary and does not require manual effort to synchronize shard when the cluster is connected again.
By setting a search option allow_partial_search_results [1] to false, we can force the system to error when some of the shards have failed, guaranteeing consistent results. In this context, we consider the system to be CP.
In case no primaries are reachable from the node(s) that our application server is connecting to, the system will completely fail. Even if we say that our partition tolerance has failed, we also see that availability has taken a hit. This situation can be called be just C or CP
There can be cases where the team has to anyhow bring up the shards and their out of sync replica(s) were reachable. So they decide to make it a primary (manually). Note that there can be some un-synced data resulting in divergent shards. This results in the AP situation. Consistency will be hard to restore when the situation normalizes (sync shards manually)
Writes
Only if all shards fail, writes will stop working. But even if one shard is active writes will work and are consistent (by default). This will be CP
However, we can set option index-wait-for-active-shards [2] as all to ensure writes only happen when all shards in the index are active. I only see a little advantage of the flag, which would be to keep all shards balanced at any cost. This will be still CP (but lesser availability than the previous case)
Like in the last read network partition case, if we make un-synced replicas as primary (manually) there can be some data loss and divergent shards. The situation will be AP here and consistency will be hard to restore when the situation normalizes (sync shards manually)
Based on the above, you can make a more informed decision and tweak ElasticSearch according to your requirements.
References:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-wait-for-active-shards
I strongly disagree with Harshit, Elasticsearch compromises on availability as he also mentioned few requests are returned error due to unavailability of shards.
ES guarantees consistency - as data read/write are always consistent. guarantees ES gaurantees Partition tolerance - if any node which was partitioned, joined back to the cluster after some time, it is able to recover the missed data to the current state.
Moreover, there is no distributed system that gives up on Partition Tolerance, cause without a guaranty of PT distributed system can't exist.
CAP theorem states that a distributed system can have at most two of the following:
Consistency.
Availability.
Partition Tolerance.
Elasticsearch gives up on "Partition Tolerance"
Reason: It means that if the creation of the node fails, the cluster health will turn red and it will not proceed to operate on the newly created index.
It will not give up on "Availability" because every Elasticsearch query will be returning a response from cluster either true (results) / false (error).
It will not give up on "Consistency" either. If it gives up on consistency then there will not be any document versioning and no index recovery.
You read more here: https://discuss.elastic.co/t/elasticsearch-and-the-cap-theorem/15102/8

Does Elasticsearch stop indexing data when some nodes go down?

I have read that when a new indexing request is sent to ES cluster. ES will specify which shard should that document be stored in depending on routing. Then that node which hosts that primary shard (aka coordinating node) will broadcast the indexing request to each node containing a replica for that shard and it will respond to the client that the document has been indexed successfully if the primary shard and it's replicas stored/indexed that document.
Does that mean that ES supports high availability(node tolerant) for reading requests and not for writing request or it's the default behavior and can be changed?
The main purpose of replicas is for failover, if the node holding a primary shard dies, a replica is promoted to the role of primary. Also, replica shards can serve read requests thus improving search performance.
For write requests though, indexing will be affected if one of your nodes (which has the primary shard for a live index) in the cluster suddenly runs out of disk space because if a node disk usage hits configured watermark levels then ES throws a cluster block exception preventing any writes to the node. If ALL nodes are down/ unreachable indexing will stop however if only one or some nodes go down, indexing shouldn't completely stop as replica shards on other nodes are promoted to primary if the node holding the original primary is offline. Ideally, it goes without saying that some analysis and effort should go to right size an ES cluster and have monitoring in place to prevent any issues.

How does Elasticsearch recover from a quorum that is not unanimous

When using replication with a quorum, Elasticsearch allows writes to fail for some (a small number of) replica shards. Writing to a replica might fail only because it is temporarily unavailable (because of a temporary network partition, for example). When that shard becomes available again (the network is fixed, for example), what happens?
Does Elasticsearch automatically detect that the shard is out of date (stale, inconsistent with the primary shard) and update it in the background? Or must you perform a manual operation? When the shard returns from being unavailable, but is out of date, does Elasticsearch automatically refrain from querying that shard (and retrieving stale data) until it is brought up to date? Or must you provide special query parameters it ensure that out-of-date shards are not used?
Elasticsearch manages automatically the replica that it are out of date. No manual operation or special query are necessary.
In case of nodes/network failure you have to ensure that a quorum of the cluster remain online, otherwise you will encounter the split brain problem in which you cannot known which of the replica is in line and which is out of date.
Be careful: Quorum is generally associated with the election of the one master node out of all master eligible nodes. That master maintains the cluster state, which keeps track of the one primary shard (plus 0 or more replica shards) — there is no quorum involved for this.
The replication protocol has been improved a lot in 6.0 with sequence numbers and primary terms. A good overview is the blog post about it. Basically all operations are numbered (per shard), so missing operations can be detected and replayed; see the recovery part in the blog post in particular.
With failing primary shards it can get a little more interesting; one great post about more details is available on Elastic's discuss.

Elasticsearch indices recovery

I'm learning how Elasticsearch (version 5.3.0) works in order to try and use it. I've read documentation, Elasticsearch Reference and some ES blog posts too but I couldn't find how indices (shards?) recovery works.
Let assume a node A turn off and, then, become active again. If the cluster didn't stop its activity and some documents were indexed, how are those changes synchronized with the node A? Does ES replace all files or there is a mechanism to communicate only changes to that node?
References and documentation are welcomed.
Thank you in advance for the responses.
These days Elasticsearch is doing a diff between the segments (files) in primary shard and the ones in the replica shard. What is different is copied over new from the primary.
In future though (ES 6), there will be sequence IDs: https://github.com/elastic/elasticsearch/issues/10708
The advantage of having these is that ES will make a first attempt to compare the sequence IDs from the primary and replica and see how "far" they are apart. If the translog from the primary shard still has all the changes since the replica went offline, ES will simply replay the operations in the primary shard translog on the replica shard. If not all the operations are there anymore, then it will get back to the segments diffing (the current approach).

If a shard goes down, then, after re-allocating that shard, will data residing in that shard be retrievable

I have manually allocated 3 primary shards to a particular node in ElasticSearch. The replicas of these shards reside in different nodes. Now, let's say, primary shard number 2 goes down (for example, due to overflow of data) without the node on which it is residing going down. Then is it possible to retrieve the data residing on that particular shard, after I manually re-allocate it to a different node? If yes, how?
Yes.
Once the node with primary shard number 2 goes down then the replica shard on the other node will be upgraded to a primary shard - allowing you to retrieve the data. See here:
Coping with failure (ES Definitive Guide)

Resources