The distributed value-store etcd uses the raft algorithm. The docs link to animations explaining: how the replica nodes vote to make one node the leader (to be the recipient of external write instructions), and thereafter the leader broadcasts all instructions to all nodes (attaching those instructions to a heartbeat signal that is bounced off of the other nodes, in a star topology, with confirmation after a majority acknowledge).
The replication obviously provides resilience (against failures of individual nodes), and presumably the read performance scales up with replica count.
Is it correct to understand that write performance is constant, and does not scale with replica count?
It is true. write requires majority of nodes to ack new entry in order to commit it. It may happen that write is even slower with increased number of replicas (it is as fast as slowest node out of quorum). In regards to read, you might find etcd docs about linearizability interesting. TL;DR; default reads also need quorum.
Related
We operate an elasticsearch stack which uses 3 nodes to store log data. Our current config is to have indices with 3 primaries and 1 replica. (We have just eyeballed this config and are happy with the performance, so we decided to not (yet) spend time for optimization)
After a node outage (let's assume a full disk), I have observed that elasticsearch automatically redistributes its shards to the remaining instances - as advertised.
However this increases disk usage on the remaining two instances, making it a candidate for cascading failure.
Durability of the log data is not paramount. I am therefore thinking about reconfiguring elasticsearch to not create a new replica after a node outage. Instead, it could just run on the primaries only. This means that after a single node outage, we would run without redundancy. But that seems better than a cascading failure. (This is a one time cost)
An alternative would be to just increase disk size. (This is an ongoing cost)
My question
(How) can I configure elasticsearch to not create new replicas after the first node has failed? Or is this considered a bad idea and the canonical way is to just increase disk capacity?
Rebalancing is expensive
When a node leaves the cluster, some additional load is generated on the remaining nodes:
Promoting a replica shard to primary to replace any primaries that were on the node.
Allocating replica shards to replace the missing replicas (assuming there are enough nodes).
Rebalancing shards evenly across the remaining nodes.
This can lead to quite some data being moved around.
Sometimes, a node is only missing for a short period of time. A full rebalance is not justified in such a case. To take account for that, when a node goes down, then elasticsearch immediatelly promotes a replica shard to primary for each primary that was on the missing node, but then it waits for one minute before creating new replicas to avoid unnecessary copying.
Only rebalance when required
The duration of this delay is a tradeoff and can therefore be configured. Waiting longer means less chance of useless copying but also more chance for downtime due to reduced redundancy.
Increasing the delay to a few hours results in what I am looking for. It gives our engineers some time to react, before a cascading failure can be created from the additional rebalancing load.
I learned that from the official elasticsearch documentation.
I am leaning some basic concept of cluster computing and I have some questions to ask.
According to this article:
If a cluster splits into two (or more) groups of nodes that can no longer communicate with each other (aka.partitions), quorum is used to prevent resources from starting on more nodes than desired, which would risk data corruption.
A cluster has quorum when more than half of all known nodes are online in the same partition, or for the mathematically inclined, whenever the following equation is true:
total_nodes < 2 * active_nodes
For example, if a 5-node cluster split into 3- and 2-node paritions, the 3-node partition would have quorum and could continue serving resources. If a 6-node cluster split into two 3-node partitions, neither partition would have quorum; pacemaker’s default behavior in such cases is to stop all resources, in order to prevent data corruption.
Two-node clusters are a special case.
By the above definition, a two-node cluster would only have quorum when both nodes are running. This would make the creation of a two-node cluster pointless
Questions:
From above,I came out with some confuse, why we can not stop all cluster resources like “6-node cluster”?What`s the special lies in the two node cluster?
You are correct that a two node cluster can only have quorum when they are in communication. Thus if the cluster was to split, using the default behavior, the resources would stop.
The solution is to not use the default behavior. Simply set Pacemaker to no-quorum-policy=ignore. This will instruct Pacemaker to continue to run resources even when quorum is lost.
...But wait, now what happens if the cluster communication is broke but both nodes are still operational. Will they not consider their peers dead and both become the active nodes? Now I have two primaries, and potentially diverging data, or conflicts on my network, right? This issue is addressed via STONITH. Properly configured STONITH will ensure that only one node is ever active at a given time and essentially prevent split-brains from even occurring.
An excellent article further explaining STONITH and it's importance was written by LMB back in 2010 here: http://advogato.org/person/lmb/diary/105.html
Raft leader node sends append entries RPC to all followers. Obviously we increase network usage, when we add new follower, so my question is about how much nodes we can add to cluster. In Raft paper and in other places I read that 5 nodes in cluster is optimal choice, but what you can say if we will have 100 nodes in cluster?
Yes I understand that I can calculate limit, will be enough network bandwidth or not. My question is more general, is cluster with tens of nodes sign of bad architecture?
Yes, a cluster with tens of nodes is generally a bad idea. Typically, we see clusters go up to 7 nodes, but not really beyond that, and even that's atypical. 3 or 5 nodes is the most common.
If you want to scale across more than 3/5/7 nodes you typically just shard the cluster, where each shard runs a completely separate and independent instance of the Raft protocol. If you need to scale for fault tolerance, you will have to relax consistency requirements.
In raft, all operation requests will be forwarded to the leader node, and then the leader will send logs to all followers. So under a heavy loaded environment, the leader node will be a bottleneck. How to solve this?
This can be solved in different ways depending on your desires. Here are some example solutions.
Partition the data. Many large-scale systems partition the data to spread the load (as well as reduce hurt if a partition goes down). But transactions cannot cross partitions. That could be a bummer, depending on your application.
Chain Consensus. This protocol spreads the work of moving data to all the nodes in the cluster. There is still a leader that is a bottleneck for accepting data, but its burden is smaller. Chain consensus also leads to slightly higher latencies than a broadcast system.
I read how current master election algorithms like Raft, Paxos or Zab elect master on a cluster and couldn't understand why they use sophisticated algorithms instead of simple bully algorithm.
I'm developing a cluster library and use UDP Multicast for heartbeat messages. Each node joins a multicast address and also send datagram packets periodically to that address. If the nodes find out there is a new node that sends packets to this multicast address, the node is simply added to cluster and similarly when the nodes in the cluster don't get any package from a node, they remove it from the cluster. When I need to choose a master node, I simply iterate over the nodes in the cluster and choose the oldest one.
I read some articles that implies this approach is not effective and more sophisticated algorithms like Paxos should be used in order to elect a master or detect failures via heartbeat messages. I couldn't understand why Paxos is better for split-brain scenarios or other network failures than traditional bully algorithm because I can easily find out when quorum of nodes leave from the cluster without using Raft. The only benefit I see is the count of packets that each server have to handle; only master sends heartbeat messages in Raft while in this case each node has to send heartbeat message to each other. However I don't think this is a problem since I can simply implement similar heartbeat algorithm without changing my master election algorithm.
Can someone elaborate on that?
From a theoretical perspective, Raft, Paxos and Zab are not leader election algorithms. They solve a different problem called consensus.
In your concrete scenario, the difference is the following. With a leader election algorithm, you can only guarantee that eventually one node is a leader. That means that during a period of time, multiple nodes might think they are the leader and, consequently, might act like one. In contrast, with the consensus algorithms above, you can guarantee that there is at most one leader in a logical time instant.
The consequence is this. If the safety of the system depends on the presence of a single leader, you might get in trouble relying only on a leader election. Consider an example. Nodes receive messages from the UDP multicast and do A if the sender is the leader, but do B if the sender is not the leader. If two nodes check for the oldest node in the cluster at slightly different points in time, they might see different leaders. These two nodes might then receive a multicast message and process it in different manners, possibly violating some safety property of the system you'd like to hold (eg, that all nodes either do A or B, but never one does A and another does B).
With Raft, Paxos, and Zab, you can overcome that problem since these algorithms create sort of logical epochs, having at most one leader in each of them.
Two notes here. First, the bully algorithm is defined for synchronous systems. If you really implement it as described in the paper by Garcia-Molina, I believe you might experience problems in your partially synchronous system. Second, the Zab algorithm relies on a sort of bully algorithm for asynchronous systems. The leader is elected by comparing the length of their histories (that minimizes the recovery time of the system). Once the leader is elected, it tries to start the Zab protocol, which in turn locks the epoch for the leader.