How does Elasticsearch recover from a quorum that is not unanimous - elasticsearch

When using replication with a quorum, Elasticsearch allows writes to fail for some (a small number of) replica shards. Writing to a replica might fail only because it is temporarily unavailable (because of a temporary network partition, for example). When that shard becomes available again (the network is fixed, for example), what happens?
Does Elasticsearch automatically detect that the shard is out of date (stale, inconsistent with the primary shard) and update it in the background? Or must you perform a manual operation? When the shard returns from being unavailable, but is out of date, does Elasticsearch automatically refrain from querying that shard (and retrieving stale data) until it is brought up to date? Or must you provide special query parameters it ensure that out-of-date shards are not used?

Elasticsearch manages automatically the replica that it are out of date. No manual operation or special query are necessary.
In case of nodes/network failure you have to ensure that a quorum of the cluster remain online, otherwise you will encounter the split brain problem in which you cannot known which of the replica is in line and which is out of date.

Be careful: Quorum is generally associated with the election of the one master node out of all master eligible nodes. That master maintains the cluster state, which keeps track of the one primary shard (plus 0 or more replica shards) — there is no quorum involved for this.
The replication protocol has been improved a lot in 6.0 with sequence numbers and primary terms. A good overview is the blog post about it. Basically all operations are numbered (per shard), so missing operations can be detected and replayed; see the recovery part in the blog post in particular.
With failing primary shards it can get a little more interesting; one great post about more details is available on Elastic's discuss.

Related

How does elasticsearch prevent cascading failure after node outage due to disk pressure?

We operate an elasticsearch stack which uses 3 nodes to store log data. Our current config is to have indices with 3 primaries and 1 replica. (We have just eyeballed this config and are happy with the performance, so we decided to not (yet) spend time for optimization)
After a node outage (let's assume a full disk), I have observed that elasticsearch automatically redistributes its shards to the remaining instances - as advertised.
However this increases disk usage on the remaining two instances, making it a candidate for cascading failure.
Durability of the log data is not paramount. I am therefore thinking about reconfiguring elasticsearch to not create a new replica after a node outage. Instead, it could just run on the primaries only. This means that after a single node outage, we would run without redundancy. But that seems better than a cascading failure. (This is a one time cost)
An alternative would be to just increase disk size. (This is an ongoing cost)
My question
(How) can I configure elasticsearch to not create new replicas after the first node has failed? Or is this considered a bad idea and the canonical way is to just increase disk capacity?
Rebalancing is expensive
When a node leaves the cluster, some additional load is generated on the remaining nodes:
Promoting a replica shard to primary to replace any primaries that were on the node.
Allocating replica shards to replace the missing replicas (assuming there are enough nodes).
Rebalancing shards evenly across the remaining nodes.
This can lead to quite some data being moved around.
Sometimes, a node is only missing for a short period of time. A full rebalance is not justified in such a case. To take account for that, when a node goes down, then elasticsearch immediatelly promotes a replica shard to primary for each primary that was on the missing node, but then it waits for one minute before creating new replicas to avoid unnecessary copying.
Only rebalance when required
The duration of this delay is a tradeoff and can therefore be configured. Waiting longer means less chance of useless copying but also more chance for downtime due to reduced redundancy.
Increasing the delay to a few hours results in what I am looking for. It gives our engineers some time to react, before a cascading failure can be created from the additional rebalancing load.
I learned that from the official elasticsearch documentation.

elasticsearch cluster without replica

I want to create an Elasticsearch cluster without replica because the cluster is created in a VMinfra and the VMInfra is fully redundant and the Disks are configured in RAID.
Is it fine?
also What will be the advantage of creating a cluster without a replica?
Replicas increase availability. So let's say you have a data node that's overloaded or down, if you have another node with a replica of the same shard on it, your data will still be available. I'm not familiar with VMInfra so I'm not sure whether "fully redundant" means you can get the same level of availability in those cases.
For most production uses you'll want to have at least one replica for each shard - meaning, the primary is on one node, and the replica is on another, and therefore at least two separate data nodes.
Having more replicas might make sense as load increases. Replicas have costs in memory and disk space - you're essentially storing the data several times.
Having not any replica shards even in optimistic view is bad trade
off, Replicas provide redundant copies of your data to protect
against hardware failure beside that they are increasing the
throughput of your query results because the queries executed on all
replicas in parallel with out interrupting each others so they also
tune your search performance, On the other hand having too many
shards can makes your system to overhead.
If your virtual server environment has good configuration i recommend
to always have at least one replica shards on the different hard
drivers. You may find out more on pros and cons in Here and an old discussion on how replica shards work on this stackoverflow chat.

Elasticsearch and CAP Theorem

Elasticsearch is a distributed system. As per the CAP theorem, it can satisfy any 2 out of 3 properties. Which one is compromised in Elasticsearch?
The answer is not so straightforward. It depends upon how the system is configured and how you want to use it. I will try to go into the details.
Paritioning in ElasticSearch
Each index is partitioned in shards, meaning data in each shard is mutually exclusive to other shards. Each shard further has multiple Lucence indices, which are not in the scope of this answer.
Each shard can have a replica running (most setups have) and in an event of a failure, the replica can be promoted to a master. Let's call a shard that has a primary working and is reachable from the ES node that our application server is hitting as an Active shard. Thus, a shard with no copies in primary and is not reachable is considered as failed shard. (Eg: An error saying "all shards failed" means no primaries in that index are available)
ES has a feature to have multiple primaries (divergent shards). It is not a good situation as we lose both read/write consistencies.
In an event of a network partition, what will happen to:
Reads:
By default reads will continue to happen on shards that are active. Thus, the data from failed shards will be excluded from our search queries. In this context, we consider the system to be AP. However, the situation is temporary and does not require manual effort to synchronize shard when the cluster is connected again.
By setting a search option allow_partial_search_results [1] to false, we can force the system to error when some of the shards have failed, guaranteeing consistent results. In this context, we consider the system to be CP.
In case no primaries are reachable from the node(s) that our application server is connecting to, the system will completely fail. Even if we say that our partition tolerance has failed, we also see that availability has taken a hit. This situation can be called be just C or CP
There can be cases where the team has to anyhow bring up the shards and their out of sync replica(s) were reachable. So they decide to make it a primary (manually). Note that there can be some un-synced data resulting in divergent shards. This results in the AP situation. Consistency will be hard to restore when the situation normalizes (sync shards manually)
Writes
Only if all shards fail, writes will stop working. But even if one shard is active writes will work and are consistent (by default). This will be CP
However, we can set option index-wait-for-active-shards [2] as all to ensure writes only happen when all shards in the index are active. I only see a little advantage of the flag, which would be to keep all shards balanced at any cost. This will be still CP (but lesser availability than the previous case)
Like in the last read network partition case, if we make un-synced replicas as primary (manually) there can be some data loss and divergent shards. The situation will be AP here and consistency will be hard to restore when the situation normalizes (sync shards manually)
Based on the above, you can make a more informed decision and tweak ElasticSearch according to your requirements.
References:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-wait-for-active-shards
I strongly disagree with Harshit, Elasticsearch compromises on availability as he also mentioned few requests are returned error due to unavailability of shards.
ES guarantees consistency - as data read/write are always consistent. guarantees ES gaurantees Partition tolerance - if any node which was partitioned, joined back to the cluster after some time, it is able to recover the missed data to the current state.
Moreover, there is no distributed system that gives up on Partition Tolerance, cause without a guaranty of PT distributed system can't exist.
CAP theorem states that a distributed system can have at most two of the following:
Consistency.
Availability.
Partition Tolerance.
Elasticsearch gives up on "Partition Tolerance"
Reason: It means that if the creation of the node fails, the cluster health will turn red and it will not proceed to operate on the newly created index.
It will not give up on "Availability" because every Elasticsearch query will be returning a response from cluster either true (results) / false (error).
It will not give up on "Consistency" either. If it gives up on consistency then there will not be any document versioning and no index recovery.
You read more here: https://discuss.elastic.co/t/elasticsearch-and-the-cap-theorem/15102/8

Elasticsearch - How node detects shard failure

I had basic knowledge about elastic search.I come across the following phrase . From https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-replication.html
In the case that the primary itself fails, the node hosting the primary will send a message to the master about it. The indexing operation will wait (up to 1 minute, by default) for the master to promote one of the replicas to be a new primary.
The question, How node hosting the shard knows about the failure of the shard ? As I understand , shard is a lucene instance that runs on a data node.
Most likely (with some improvements since elasticsearch version 1.4), this would be detected via checksum if any segment file within the shard has incorrect checksum, then the shard is marked corrupt.
This may happen on recovery (after node starts up) or when any IO operation is done on the segment (ie when it is read by searching or via the merge policy)
Potentially, this page for 7.8 (select the version you use for accurate doc) mentions how to dismiss corrupt data or if data is important best way is to restore from snapshot :
https://www.elastic.co/guide/en/elasticsearch/reference/7.8/shard-tool.html#_description_7
I guess, you are getting confused in this statement
How node hosting the shard knows about the failure of the shard ? As I
understand , shard is a lucene instance that runs on a data node.
while its true that every shard is a Lucene instance(index) but its not a 1:1 mapping and 1 data node of elasticsearch can host multiple shards not just 1 shard and failure of Lucene shard doesn't always mean the failure of data node.
Node holding the primary shard knows if its connected to network, whether its able to index the data or not or shard is corrupted or not as mentioned by #julian and then it can send that information to master node, which then promote other replicas to primary which is contained in cluster state which all nodes holds.
In network failure case, all the primary shards hosted on the nodes will be replaced by other shards and it's easy to detect as master will not a heart beat from that data node.
Hope bold part of my answer is what you were looking for, otherwise feel free to comment and would try to explain further.
It's confusing at first sight. But if you look deeper, it is still a valid scenario and same mentioned in the document at high level.
Let's, say coordinator node receives a request to index the data. Master node maintains list of in-sync shards. Then master forwards the request to the node which has the primary shard. As you mentioned, shard is a Lucene core. The node which received has to index the data in the primary shard. Incase if it is not possible due to the portion of shard corrupted or so, then it will inform the master to elect another primary.
And master also monitors each shards and informs the other node to prepare a primary shard if needed. Demotes a shard from primary if needed. Master does more in this cases.
Elasticsearch maintains a list of shard copies that should receive the operation. This list is called the in-sync copies and is maintained by the master node
Once the replication group has been determined, the operation is forwarded internally to the current primary shard of the group

Elasticsearch indices recovery

I'm learning how Elasticsearch (version 5.3.0) works in order to try and use it. I've read documentation, Elasticsearch Reference and some ES blog posts too but I couldn't find how indices (shards?) recovery works.
Let assume a node A turn off and, then, become active again. If the cluster didn't stop its activity and some documents were indexed, how are those changes synchronized with the node A? Does ES replace all files or there is a mechanism to communicate only changes to that node?
References and documentation are welcomed.
Thank you in advance for the responses.
These days Elasticsearch is doing a diff between the segments (files) in primary shard and the ones in the replica shard. What is different is copied over new from the primary.
In future though (ES 6), there will be sequence IDs: https://github.com/elastic/elasticsearch/issues/10708
The advantage of having these is that ES will make a first attempt to compare the sequence IDs from the primary and replica and see how "far" they are apart. If the translog from the primary shard still has all the changes since the replica went offline, ES will simply replay the operations in the primary shard translog on the replica shard. If not all the operations are there anymore, then it will get back to the segments diffing (the current approach).

Resources