Why is it recommended to create clusters with odd number of nodes - cluster-computing

There are several resources about distributed systems, like the mongo db documentation that recommend odd number of nodes in a cluster.
What are the benefits of having odd number of nodes?

Short answer: in this case of MongoDB, having an odd number of nodes increases your clustered system's availability (uptime).
Look at the table in the MongoDB documentation you linked:
+-------------------+------------------------------------------+-----------------+
| Number of Members | Majority Required to Elect a New Primary | Fault Tolerance |
+-------------------+------------------------------------------+-----------------+
| 3 | 2 | 1 |
+-------------------+------------------------------------------+-----------------+
| 4 | 3 | 1 |
+-------------------+------------------------------------------+-----------------+
| 5 | 3 | 2 |
+-------------------+------------------------------------------+-----------------+
| 6 | 4 | 2 |
+-------------------+------------------------------------------+-----------------+
Notice that how when you have an odd number of members and add one more (to become even), your fault tolerance does not go up! (Meaning, your cluster cannot tolerate more failed members than it originally could)
This is because MongoDB requires a majority of members to be up in order to elect a primary. This property is not specific to MongoDB, but any clustered system that requires a majority of members to be up (for example, see also etcd).
Your system availability actually goes down when increasing to an even number of nodes because, although your fault tolerance remains the same, there are more nodes that can fail so the probability of a fault occurring goes up.
In addition, having an even number of members decreases the probability that if there is a network partition then some subset of your nodes will be able to continue running. For example, if you've got a 6 node cluster then it opens up the possibility that a network partition could partition your nodes into 2 3-node partitions. In such a case then neither partition will be able to communicate with a majority of members and your cluster becomes unavailable.
The counter-intuitive conclusion is that, if you have an even-membered cluster then it is actually beneficial (from a high-availability standpoint) to remove one of the members.

The odd number of nodes help - and not necessary - in electing a leader in a cluster. It is essential to avoid multiple leaders getting elected, a condition known as split-brain problem. consensus algorithms use voting for electing the leader. i.e, elect the node with majority votes.
consider a cluster of 5 nodes. the minimum majority required is 3 (5/2 or 2 + 2 + 1- the deal breaker).
It is important to note that the majority of the cluster votes are required for leader election even under failure conditions.
consider 1 out of 5 nodes failed. we can still elect a leader with majority votes of 3. well, what if out of the 4 nodes, two leaders get elected with equal votes of 2? that's left to the consensus algorithm to resolve the contention (maybe, simply reinitiate election)
let's say, 2 out of 5 nodes failed. we can still elect a leader with majority votes of 3 i.e when all 3 available nodes vote for the same node.
One usually gets confused about achieving majority when one of the odd nodes fails, leaving them even in number. it should be clear by now, that the majority of the initial cluster size (preferably odd) is required to elect the leader.
we have seen how the odd number clusters help in case of node failures.
another point to add here is, how this helps in case of network partitions.
In the worst case, the network partition can split the cluster into exactly two equal halves which cannot happen in an odd-numbered cluster.
As long as the part of cluster or the number of operating nodes are greater or equal to floor(n/2)+1, to reach consensus based on majority w.r.t the initial cluster size, the cluster can continue to operate

Short Answer: Higher Fault Tolerance.
This is a general principle that applies to many other clusters that uses RAFT alike leader election algorithms such as Kubernetes ETCD clusters.
If it uses RAFT for leader selection, cluster needs a majority of nodes, a quorum, to agree on a leader. For a cluster with n members, quorum is (n/2)+1.
In terms of fault tolerance, adding an additional node to an odd-sized cluster decreases the fault tolerance. How? We still have the same number of nodes that may fail without losing quorum however we have more nodes that can fail which means possibility of losing quorum is actually higher than before.
For fault tolerance please check this official etcd doc for more information.

Related

Multiple datacenter replication and local quorum?

I created a cluster from 6 nodes.
3 nodes in Eu west1 and 3 nodes in EU west2
I set the locality for every group of nodes like : --locality=region=europe,datacenter=west1
I also set the replica to 6 to have all ranges and all data on every node.
What will happen if the connection between data centers is lost the whole cluster goes down ?
I tried to kill 3 nodes in one of the datacenters and cluster is not operational because the majority of the nodes are down and quorum is less that 4.
Is it possible to make the 2 datacentes to work with their local quorum 2/3
I also played a bit with replications settings and sometimes cluster is healthy if I kill 3 nodes from 6 and was I was able to write to the cluster. Sometimes I can only read from the cluster. Cluster is working with replica of 5 and 3 nodes killed from 6. Still paying with this but if someone can give me more information will be very helpful.
To be able to replicate across datacentes is very cool feature but if I lost the whole cluster when one of the datacenters is down ruin the whole good idea at least for me.
CockroachDB requires a majority of replicas to be fully operational, which means > half, not >= half. In order to survive the loss of a full datacenter or region, you must have three DCs/regions, not two. Try running two nodes in each of three regions instead of three nodes in two regions.
Is it possible to make the 2 datacenters to work with their local quorum 2/3
Not for a single table (because it would be impossible to guarantee consistency if each datacenter were able to act in isolation from the other). You've configured the data to be replicated across all six replicas, which means four replicas are required to make a quorum. If you want each datacenter to be able to operate independently of the other, you would need two separate tables, with each one configured to be located within one of the datacenters.
Thanks for the answer just to clear few thing. But looks like you got my point and what I want to accomplish.
But as far as I understand if I have 2x3 node in 2 different DC's if one DC goes down. I have 3 live nodes for the quorum I need at least 4 . N/2 +1.
So if I have 3x3 I can lost one DC because if I have 2 DC's live I will have a quorum .
And one last question if I don't set replication to 9 if I loose 3 nodes some in one DC some ranges will be not available right ?

Raft nodes count

Raft leader node sends append entries RPC to all followers. Obviously we increase network usage, when we add new follower, so my question is about how much nodes we can add to cluster. In Raft paper and in other places I read that 5 nodes in cluster is optimal choice, but what you can say if we will have 100 nodes in cluster?
Yes I understand that I can calculate limit, will be enough network bandwidth or not. My question is more general, is cluster with tens of nodes sign of bad architecture?
Yes, a cluster with tens of nodes is generally a bad idea. Typically, we see clusters go up to 7 nodes, but not really beyond that, and even that's atypical. 3 or 5 nodes is the most common.
If you want to scale across more than 3/5/7 nodes you typically just shard the cluster, where each shard runs a completely separate and independent instance of the Raft protocol. If you need to scale for fault tolerance, you will have to relax consistency requirements.

discovery.zen.minimum_master_nodes value for a cluster of two nodes

I have two dedicated Windows Servers (Windows Server 2012R2, 128GB memory on each server) for ES (2.2.0). If I have one node on each server and the two nodes form a cluster. What is the proper value for
discovery.zen.minimum_master_nodes
I read this general rule in elasticsearch.yml:
Prevent the "split brain" by configuring the majority of nodes (total
number of nodes / 2 + 1):
I saw this SO thread:
Proper value of ES_HEAP_SIZE for a dedicated machine with two nodes in a cluster
There is an answer saying:
As described in Elasticsearch Pre-Flight Checklist, you can set
discovery.zen.minimum_master_nodes to at least (N/2)+1 on clusters
with N > 2 nodes.
Please note "N > 2". What is the proper value in my case?
N is the number of ES nodes (not physical machines but ES processes) that can be part of the cluster.
In your case, with one node on two machines, N = 2 (note that it was 4 here), so the formula N/2 + 1 yields 2, which means that both of your nodes MUST be eligible as master nodes if you want to prevent split brain situations.
If you set that value to 1 (which is the default value!) and you experience networking issues and both of your nodes can't see each other for a brief moment, each node will think it is alone in the cluster and both will elect themselves as master. You end up in a situation where you have two masters and that's not a good thing. Whereas if you set that value to 2 and you experience networking issues, the current master node will stay elected and the second node will never decide to elect itself as new master. Whenever network is back up, the second node will rejoin the cluster and continue serving requests.
The ideal topology is to have 3 dedicated master nodes (i.e. with master: true and data:false) and have discovery.zen.minimum_master_nodes set to 2. That way you'll never have to change the setting whatever the number of data nodes are part of your cluster.
So the N > 2 constraint should indeed be N >= 2, but I guess it was somehow implied, because otherwise you're creating a fertile ground for split brain situations.
Interestingly, in ES 7 discovery.zen.minimum_master_nodes is no longer need to be defined.
discovery.zen.minimum_master_nodes value for a cluster of two nodes
https://www.elastic.co/blog/a-new-era-for-cluster-coordination-in-elasticsearch

How etcd cluster handles node failure and how many can fail?

I have 7 nodes running etcd cluster.
4 of them fails.
Will etcd stop working when majority of nodes is down?
If 4 out of 7 nodes in an etcd cluster fail, the cluster will stop working due to majority loss. Please refer to the following explanation about fault tolerance: (source: https://coreos.com/etcd/docs/latest/admin_guide.html)
Fault Tolerance Table
It is recommended to have an odd number of members in a cluster. Having an odd cluster size doesn't change the number needed for majority, but you gain a higher tolerance for failure by adding the extra member. You can see this in practice when comparing even and odd sized clusters:

Cassandra: 6 node cluster, RF=2: What to do when 2 nodes crash?

Good Day
We have a 6 node casssandra cluster witha replication factor of 3 on our keyspaces. Our applications make use of QUORUM so we can survive the loss of a single node wihtout it affecting the application.
Lets assume I lose 2 nodes at the same time. If my application was using consistency level of ONE then it would have been fine and my application would have run without any issues but we would like to keep the level at QUORUM.
My question is if 2 nodes crash at the same time and I do a nodetool removenode for each of the crashed nodes, will the cluster then rebalance the data over the remaining 4 nodes (and getting ir back to a 3 replica) and if done should my application then be able to work again usinng QUORUM?
In title you write RF=2, in text RF=3. You did not specify Cassandra version and if you are using single-token or vnodes. Quorum CL means, in a RF = 3 that 2 nodes must write/read before returning. It is possible that you face minimal issues/no issue even if 2 nodes dies, it depends on how many common ranges (partitions) the nodes shares.
Give a look at this distribution example that is exactly like the one you describe: RF3, 6 nodes.
using single tokens:
if you loose couples like (1,4) - (2,5) - (3,6) -- your cluster should allow all writes and reads, no issues. A good client will recognize nodes down and won't use them anymore as coordinators. Other situations, for example loss of nodes (1,6) might lead to a situation in which any r/w of F and E tokens will fail (assuming an equal distribution about 33% r/w operation will fail)
using vnodes:
here the situation is slightly different and also depends on couples you loose -- now if you repeat the worst scenario above -- you loose couple of nodes like (1,6) only B tokens will be affected in r/w operations since it's the only token shared between them.
Said that, just to clarify the possible scenarios, here's your answer. Nodetool removenode should be used like explained in this document. Use removenode IF AND ONLY IF you want reduce the cluster size (here what to do if you want replace a dead node). Once you did that your application will start working again using Quorum since other nodes will be responsible for partitions previously assigned to a dead node.
If you are using the official Datastax Java Driver you might want to let the driver temporary fight your monsters specifying a DowngradingConsistencyRetryPolicy
HTH,
Carlo

Resources