How etcd cluster handles node failure and how many can fail? - etcd

I have 7 nodes running etcd cluster.
4 of them fails.
Will etcd stop working when majority of nodes is down?

If 4 out of 7 nodes in an etcd cluster fail, the cluster will stop working due to majority loss. Please refer to the following explanation about fault tolerance: (source: https://coreos.com/etcd/docs/latest/admin_guide.html)
Fault Tolerance Table
It is recommended to have an odd number of members in a cluster. Having an odd cluster size doesn't change the number needed for majority, but you gain a higher tolerance for failure by adding the extra member. You can see this in practice when comparing even and odd sized clusters:

Related

Why is it recommended to create clusters with odd number of nodes

There are several resources about distributed systems, like the mongo db documentation that recommend odd number of nodes in a cluster.
What are the benefits of having odd number of nodes?
Short answer: in this case of MongoDB, having an odd number of nodes increases your clustered system's availability (uptime).
Look at the table in the MongoDB documentation you linked:
+-------------------+------------------------------------------+-----------------+
| Number of Members | Majority Required to Elect a New Primary | Fault Tolerance |
+-------------------+------------------------------------------+-----------------+
| 3 | 2 | 1 |
+-------------------+------------------------------------------+-----------------+
| 4 | 3 | 1 |
+-------------------+------------------------------------------+-----------------+
| 5 | 3 | 2 |
+-------------------+------------------------------------------+-----------------+
| 6 | 4 | 2 |
+-------------------+------------------------------------------+-----------------+
Notice that how when you have an odd number of members and add one more (to become even), your fault tolerance does not go up! (Meaning, your cluster cannot tolerate more failed members than it originally could)
This is because MongoDB requires a majority of members to be up in order to elect a primary. This property is not specific to MongoDB, but any clustered system that requires a majority of members to be up (for example, see also etcd).
Your system availability actually goes down when increasing to an even number of nodes because, although your fault tolerance remains the same, there are more nodes that can fail so the probability of a fault occurring goes up.
In addition, having an even number of members decreases the probability that if there is a network partition then some subset of your nodes will be able to continue running. For example, if you've got a 6 node cluster then it opens up the possibility that a network partition could partition your nodes into 2 3-node partitions. In such a case then neither partition will be able to communicate with a majority of members and your cluster becomes unavailable.
The counter-intuitive conclusion is that, if you have an even-membered cluster then it is actually beneficial (from a high-availability standpoint) to remove one of the members.
The odd number of nodes help - and not necessary - in electing a leader in a cluster. It is essential to avoid multiple leaders getting elected, a condition known as split-brain problem. consensus algorithms use voting for electing the leader. i.e, elect the node with majority votes.
consider a cluster of 5 nodes. the minimum majority required is 3 (5/2 or 2 + 2 + 1- the deal breaker).
It is important to note that the majority of the cluster votes are required for leader election even under failure conditions.
consider 1 out of 5 nodes failed. we can still elect a leader with majority votes of 3. well, what if out of the 4 nodes, two leaders get elected with equal votes of 2? that's left to the consensus algorithm to resolve the contention (maybe, simply reinitiate election)
let's say, 2 out of 5 nodes failed. we can still elect a leader with majority votes of 3 i.e when all 3 available nodes vote for the same node.
One usually gets confused about achieving majority when one of the odd nodes fails, leaving them even in number. it should be clear by now, that the majority of the initial cluster size (preferably odd) is required to elect the leader.
we have seen how the odd number clusters help in case of node failures.
another point to add here is, how this helps in case of network partitions.
In the worst case, the network partition can split the cluster into exactly two equal halves which cannot happen in an odd-numbered cluster.
As long as the part of cluster or the number of operating nodes are greater or equal to floor(n/2)+1, to reach consensus based on majority w.r.t the initial cluster size, the cluster can continue to operate
Short Answer: Higher Fault Tolerance.
This is a general principle that applies to many other clusters that uses RAFT alike leader election algorithms such as Kubernetes ETCD clusters.
If it uses RAFT for leader selection, cluster needs a majority of nodes, a quorum, to agree on a leader. For a cluster with n members, quorum is (n/2)+1.
In terms of fault tolerance, adding an additional node to an odd-sized cluster decreases the fault tolerance. How? We still have the same number of nodes that may fail without losing quorum however we have more nodes that can fail which means possibility of losing quorum is actually higher than before.
For fault tolerance please check this official etcd doc for more information.

How to deal with Split Brain with an cluster have the two number of nodes?

I am leaning some basic concept of cluster computing and I have some questions to ask.
According to this article:
If a cluster splits into two (or more) groups of nodes that can no longer communicate with each other (aka.partitions), quorum is used to prevent resources from starting on more nodes than desired, which would risk data corruption.
A cluster has quorum when more than half of all known nodes are online in the same partition, or for the mathematically inclined, whenever the following equation is true:
total_nodes < 2 * active_nodes
For example, if a 5-node cluster split into 3- and 2-node paritions, the 3-node partition would have quorum and could continue serving resources. If a 6-node cluster split into two 3-node partitions, neither partition would have quorum; pacemaker’s default behavior in such cases is to stop all resources, in order to prevent data corruption.
Two-node clusters are a special case.
By the above definition, a two-node cluster would only have quorum when both nodes are running. This would make the creation of a two-node cluster pointless
Questions:
From above,I came out with some confuse, why we can not stop all cluster resources like “6-node cluster”?What`s the special lies in the two node cluster?
You are correct that a two node cluster can only have quorum when they are in communication. Thus if the cluster was to split, using the default behavior, the resources would stop.
The solution is to not use the default behavior. Simply set Pacemaker to no-quorum-policy=ignore. This will instruct Pacemaker to continue to run resources even when quorum is lost.
...But wait, now what happens if the cluster communication is broke but both nodes are still operational. Will they not consider their peers dead and both become the active nodes? Now I have two primaries, and potentially diverging data, or conflicts on my network, right? This issue is addressed via STONITH. Properly configured STONITH will ensure that only one node is ever active at a given time and essentially prevent split-brains from even occurring.
An excellent article further explaining STONITH and it's importance was written by LMB back in 2010 here: http://advogato.org/person/lmb/diary/105.html

Multiple datacenter replication and local quorum?

I created a cluster from 6 nodes.
3 nodes in Eu west1 and 3 nodes in EU west2
I set the locality for every group of nodes like : --locality=region=europe,datacenter=west1
I also set the replica to 6 to have all ranges and all data on every node.
What will happen if the connection between data centers is lost the whole cluster goes down ?
I tried to kill 3 nodes in one of the datacenters and cluster is not operational because the majority of the nodes are down and quorum is less that 4.
Is it possible to make the 2 datacentes to work with their local quorum 2/3
I also played a bit with replications settings and sometimes cluster is healthy if I kill 3 nodes from 6 and was I was able to write to the cluster. Sometimes I can only read from the cluster. Cluster is working with replica of 5 and 3 nodes killed from 6. Still paying with this but if someone can give me more information will be very helpful.
To be able to replicate across datacentes is very cool feature but if I lost the whole cluster when one of the datacenters is down ruin the whole good idea at least for me.
CockroachDB requires a majority of replicas to be fully operational, which means > half, not >= half. In order to survive the loss of a full datacenter or region, you must have three DCs/regions, not two. Try running two nodes in each of three regions instead of three nodes in two regions.
Is it possible to make the 2 datacenters to work with their local quorum 2/3
Not for a single table (because it would be impossible to guarantee consistency if each datacenter were able to act in isolation from the other). You've configured the data to be replicated across all six replicas, which means four replicas are required to make a quorum. If you want each datacenter to be able to operate independently of the other, you would need two separate tables, with each one configured to be located within one of the datacenters.
Thanks for the answer just to clear few thing. But looks like you got my point and what I want to accomplish.
But as far as I understand if I have 2x3 node in 2 different DC's if one DC goes down. I have 3 live nodes for the quorum I need at least 4 . N/2 +1.
So if I have 3x3 I can lost one DC because if I have 2 DC's live I will have a quorum .
And one last question if I don't set replication to 9 if I loose 3 nodes some in one DC some ranges will be not available right ?

Datastax Cassandra - Spanning Cluster node across amazon region

I planning to launch three EC2 instance across Amazon hosting region. For say, Region-A,Region-B and Region-C.
Based on the above plan, Each region act as Cluster(Or Datacenter) and have one node.(Correct me if I am wrong).
Using this infrastructure, Can I attain below configuration?
Replication Factor : 2
Write and Read Level:QUORUM.
My basic intention to do these are to achieve "If two region are went down, I can be survive with remaining one region".
Please help me with your inputs.
Note: I am very new to cassandra, hence whatever your inputs you are given will be useful for me.
Thanks
If you have a replication factor of 2 and use CL of Quorum, you will not tolerate failure i.e. if a node goes down, and you only get 1 ack - thats not a majority of responses.
If you deploy across multiple regions, each region is, as you mention, a DC in your Cluster. Each individual DC is a complete replica of all your data i.e. it will hold all the data for your keyspace. If you read/write at a LOCAL_* consistency (eg. LOCAL_ONE, LOCAL_QUORUM) level within each region, then you can tolerate the loss of the other regions.
The number of replicas in each DC/Region and the consistency level you are using to read/write in that DC will determine how much failure you can tolerate. If you are using QUORUM - this is a cross-DC consistency level. It will require a majority of acks from ALL replicas in your cluster in all DCs. If you loose 2 regions then its unlikely that you will be getting a quorum of responses.
Also, its worth remembering that Cassandra can be made aware of the AZ's it is deployed on in the Region and can do its best to ensure replicas of your data are placed in multiple AZs. This will give you even better tolerance to failure.
If this was me and I didnt need to have a strong cross-DC consistency level (like QUORUM). I would have 4 nodes in each region, deployed across each AZ and then a replication factor of 3 in each region. I would then be reading/writing at LOCAL_QUORUM or LOCAL_ONE (preferably). If you go with LOCAL_ONE than you could have fewer replicas in each DC e.g a replication factor of 2 with LOCAL_ONE means you could tolerate the loss of 1 replica.
However, this would be more expensive than what your initially suggesting but (for me) that would be the minimum setup I would need if I wanted to be in multiple regions and tolerate the loss of 2. You could go with 3 nodes in each region if you wanted to really save costs.

Cassandra: 6 node cluster, RF=2: What to do when 2 nodes crash?

Good Day
We have a 6 node casssandra cluster witha replication factor of 3 on our keyspaces. Our applications make use of QUORUM so we can survive the loss of a single node wihtout it affecting the application.
Lets assume I lose 2 nodes at the same time. If my application was using consistency level of ONE then it would have been fine and my application would have run without any issues but we would like to keep the level at QUORUM.
My question is if 2 nodes crash at the same time and I do a nodetool removenode for each of the crashed nodes, will the cluster then rebalance the data over the remaining 4 nodes (and getting ir back to a 3 replica) and if done should my application then be able to work again usinng QUORUM?
In title you write RF=2, in text RF=3. You did not specify Cassandra version and if you are using single-token or vnodes. Quorum CL means, in a RF = 3 that 2 nodes must write/read before returning. It is possible that you face minimal issues/no issue even if 2 nodes dies, it depends on how many common ranges (partitions) the nodes shares.
Give a look at this distribution example that is exactly like the one you describe: RF3, 6 nodes.
using single tokens:
if you loose couples like (1,4) - (2,5) - (3,6) -- your cluster should allow all writes and reads, no issues. A good client will recognize nodes down and won't use them anymore as coordinators. Other situations, for example loss of nodes (1,6) might lead to a situation in which any r/w of F and E tokens will fail (assuming an equal distribution about 33% r/w operation will fail)
using vnodes:
here the situation is slightly different and also depends on couples you loose -- now if you repeat the worst scenario above -- you loose couple of nodes like (1,6) only B tokens will be affected in r/w operations since it's the only token shared between them.
Said that, just to clarify the possible scenarios, here's your answer. Nodetool removenode should be used like explained in this document. Use removenode IF AND ONLY IF you want reduce the cluster size (here what to do if you want replace a dead node). Once you did that your application will start working again using Quorum since other nodes will be responsible for partitions previously assigned to a dead node.
If you are using the official Datastax Java Driver you might want to let the driver temporary fight your monsters specifying a DowngradingConsistencyRetryPolicy
HTH,
Carlo

Resources