How to investigate the etcd leader change reason? - etcd

v3.5.1
The leader change can be initiated, if some follower hasn't received any heartbeats from the leader within election timeout interval only.
Q1: Is the statement above correct?
Q2: Is there some way to investigate the reason of a leader change?
Like getting the leader's heartbeats rate at the time of leader change with some (may be circular) tracing, logging, some fodc call (if any), etc. It would be enough to understand, if we had a runaway leader (because of slow logging, busy cpu, code latency) or slow network.
What I see in a one of the follower's log is that it suddenly started an election process:
{"level":"info","ts":"2023-01-25T19:40:38.423+0300","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"2ff6c96540e39fa9 is starting a new election at term 2"}
{"level":"info","ts":"2023-01-25T19:40:38.423+0300","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"2ff6c96540e39fa9 became pre-candidate at term 2"}
...
But the reason is unknown...

Related

Increment term in Raft algorithm?

i'm trying to learn the raft algorithm to implement it, i don't have understood when the term is incremented, apart from when a node pass from stato follower to state candidate there are others cases when the term is incremented? For instance when the leader increment the last commit index? Thanks
If you're looking from a standpoint of a single peer in the Raft cluster, you will update your term if:
you receive a RPC request or response that contains a term higher than yours (note here that if you're in the leader mode, you also need to step down as a leader and turn to follower)
you're in the follower mode and didn't hear from the current leader in the minimum election timeout time (you'll be switching to candidate mode)
you're in the candidate mode and didn't get enough votes to win the election and the election timeout elapses
Generally observing the system, the term will be increased only when a new election starts.

Why or why not use RequestVote RPC as heartbeat in Raft implementation?

As introduced in the paper, we use empty AppendEntries RPC for heartbeat. Then how about the RequestVote RPC? When FOLLOWER or CANDIDATE receives RequestVote RPC call, is it suppose to reset the election timeout as well? Why or why not to do so?
One benefit in my mind is that when RequestVote RPC call also treated as heartbeat, we can potentially prevent the multiple candidates condition. Since multiple candidates may split votes and take longer time in the election stage. By using that as heartbeat, the RequestVote RPC calls from one candidate will reset the election timer so that other live peers are less likely to timeout and become a candidate as well.
Well, there’s probably not anything inherently unsafe about it. But the problem is nodes that can’t win an election can still start one. So, if a node that can’t win starts an election and requests votes from all the other nodes, resetting their timers would block the election. And since the can’t-win candidate started its timer first, it would likely also timeout and start another election first, thus blocking the cluster again, and another election, and so on.
Of course, the fix for this could be to only reset election timeouts when a vote is cast. This could be safe. After all, election timeouts are randomized anyways. But the question is whether it’s effective. It wouldn’t prevent split votes since it doesn’t stop multiple nodes from requesting votes concurrently, and during split votes it would only make the election take that much longer. I suspect the pre-vote protocol is much more efficient for that reason and probably avoids split votes as well as they can be avoided.

Consul support or alternative for 2 nodes

I want to use consul for a 2-node cluster. Drawback is there's no failure tolerance for two nodes :
https://www.consul.io/docs/internals/consensus.html
Is there a way in Consul to make a consistent leader election with only two nodes? Can Consul Raft Consensus algorithm be changed?
Thanks a lot.
It sounds like you're limited to 2 machines of this type, because they are expensive. Consider acquiring three or five cheaper machines to run your orchestration layer.
To answer protocol question, no, there is no way to run a two-node cluster with failure tolerance in Raft. To be clear, you can safely run a two-node cluster just fine - it will be available and make progress like any other cluster. It's just when one machine goes down, because your fault tolerance is zero you will lose availability and no longer make no progress. But safety is never compromised - your data is still persisted consistently on these machines.
Even outside Raft, there is no way to run a two-node cluster and guarantee progress upon a single failure. This is a fundamental limit. In general, if you want to support f failures (meaning remain safe and available), you need 2f + 1 nodes.
There are non-Raft ways to improve the situation. For example, Flexible Paxos shows that we can require both nodes for leader election (as it already is in Raft), but only require a single node for replication. This would allow your cluster to continue working in some failure cases where Raft would have stopped. But the worst case is still the same: there are always failures that will cause any two-node cluster to become unavailable.
That said, I'm not aware of any practical flexible paxos implementations anyway.
Considering the expense of even trying to hack up a solution to this, your best bet is to either get a larger set of cheaper machines, or just run your two-node cluster and accept unavailability upon failure.
Talking about changing the protocol, there is impossibility proof by FLP which states that consensus cannot be reached if systems are less than 2f + 1 for f failures (fail-stop). Although, safety is provided but progress (liveness) cannot be ensured.
I think, the options suggested in earlier post are the best.
The choice of leader election on top of the Consul’s documentation itself requires 3 nodes. This relies on the health-checks mechanism, as well as the sessions. Sessions are essentially distributed locks automatically released by TTL or when the service crashes.
To build 2-node Consul cluster we have to use another approach, supposedly called Leader Lease. Since we already have Consul KV-storage with CAS support, we can simply write to it which machine is the leader before the expiration of such and such time. As long as the leader is alive and well, it can periodically extend it's time. If the leader dies, someone will replace it quickly. For this approach to work, it is enough to synchronize the time on the machines using ntpd and when the leader performs any action, verify that it has enough time left to complete this action.
A key is created in the KV-storage, containing something like “node X is the leader before time Y”, where Y is calculated as the current time + some time interval(T). As a leader, node X updates the record once every T/2 or T/3 units of time, thereby extending it's leadership role. If a node falls or cannot reach the KV-storage, after the interval(T) its place will be taken by the node, which will be the first to discover that the leadership role has been released.
CAS is needed to prevent a race condition if the two nodes simultaneously try to become a leader. CAS Specifies to use a Check-And-Set operation. This is very useful as a building block for more complex synchronization primitives. If the index is 0, Consul will only put the key if it does not already exist. If the index is non-zero, the key is only set if the index matches the ModifyIndex of that key.

How do Raft guarantee consistency when network partition occurs?

Suppose a network partition occurs and the leader A is in minority. Raft will elect a new leader B but A thinks it's still the leader for some time. And we have two clients. Client 1 writes a key/value pair to B, then Client 2 reads the key from A before A steps down. Because A still believes it's the leader, it will return stale data.
The original paper says:
Second, a leader must check whether it has been deposed
before processing a read-only request (its information
may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heartbeat
messages with a majority of the cluster before responding
to read-only requests.
Isn't it too expensive? The leader has to talk to majority nodes for every read request?
I'm surprised there's so much ambiguity in the answers, as this is quite well known:
Yes, to get linearizable reads from Raft you must round-trip through the quorum.
There are no shortcuts here. In fact, both etcd and Consul committed an error in their implementations of Raft and caused linearizability violations. The implementors erroneously believed (as did many people, including myself) that if a node thought of itself as a leader, it was the leader.
Raft does not guarantee this at all. A node can be a leader and not learn of its loss of leadership because of the very network partition that caused someone else to step up in the first place. Because clock error is taken as unbounded in distributed systems literature, no amount of waiting can solve this race condition. New leaders cannot simply "wait it out" and then decide "okay, the old leader must have realized it by now". This is just typical lease lock stuff - you can't use clocks with unbounded error to make distributed decisions.
Jepsen covered this error detail, but to quote the conclusion:
[There are] three types of reads, for varying performance/correctness needs:
Anything-goes reads, where any node can respond with its last known value. Totally available, in the CAP sense, but no guarantees of monotonicity. Etcd does this by default, and Consul terms this “stale”.
Mostly-consistent reads, where only leaders can respond, and stale reads are occasionally allowed. This is what etcd currently terms “consistent”, and what Consul does by default.
Consistent reads, which require a round-trip delay so the leader can confirm it is still authoritative before responding. Consul now terms this consistent.
Just to tie in with some other results from literature, this very problem was one of the things Flexible Paxos showed it could handle. The key realization in FPaxos is that you have two quorums: one for leader election and one for replication. The only requirement is that these quorums intersect, and while a majority quorum is guaranteed to do so, it is not the only configuration.
For example, one could require that every node participate in leader election. The winner of this election could be the sole node serving requests - now it is safe for this node to serve reads locally, because it knows for a new leader to step up the leadership quorum would need to include itself. (Of course, the tradeoff is that if this node went down, you could not elect a new leader!)
The point of FPaxos is that this is an engineering tradeoff you get to make.
The leader doesn't have to talk to a majority for each read request. Instead, as it continuously heartbeats with its peers it maintains a staleness measure: how long it has been since it has received an okay from a quorum? The leader will check this staleness measure and return a StalenessExceeded error. This gives the calling system the chance to connect to another host.
It may be better to push that staleness check to the calling systems; let the low-level raft system have higher Availability (in CAP terms) and let the calling systems decide at what staleness level to fail over. This can be done in various ways. You could have the calling systems heartbeat to the raft system, but my favorite is to return the staleness measure in the response. This last can be improved when the client includes its timestamp in the request, the raft server echos it back in the response and the client adds the round trip time to raft staleness. (NB. Always use the nano clock in measuring time differences because it doesn't go backwards like the system clock does.)
Not sure whether timeout configure can solve this problem:
2 x heartbeat interval <= election timeout
which means when network partition happens leader A is single leader and write will fail because leader A locates in the minority and leader A can not get echo back from majority of the node and step back as a follower.
After that leader B is selected, it can catch up the latest change from at least one of the followers and then client can perform read and write on leader B.
Question
The leader has to talk to majority nodes for every read request
Answer: No.
Explaination
Let's understand it with code example from HashiCorp's raft implementation.
There are 2 timeouts involved: (their names are self explanatory but link has been included to read detailed definition.)
LeaderLease timeout[1]
Election timeout[2]
Example of their values are 500ms & 1000ms respectively[3]
Must condition for node to start is: LeaderLease timeout < Election timeout [4,5]
Once a node becomes Leader, it is checked "whether it is heartbeating with quorum of followers or not"[6, 7]. If heartbeat stops then its tolerated till LeaderLease timeout[8]. If Leader is not able to contact quorum of nodes for LeaderLease timeout then Leader node has to become Follower[9]
Hence for example given in question, Node-A must step down as Leader before Node-B becomes Leader. Since Node-A knows its not a Leader before Node-B becomes Leader, Node-A will not serve the read or write request.
[1]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L141
[2]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L179
[3]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L230
[4]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L272
[5]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L275
[6]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L456
[7]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L762
[8]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L891
[9]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L894

How does the Raft algorithm guarantee consensus if there are multiple leaders?

As the paper says:
Election Safety: at most one leader can be elected in a given term. §5.2
However, there may be more than one leader in the system. Raft only can promise that there is only one leader in a given term. So If I have more than one client, wouldn't I get different data? How does this allow Raft to be a consensus algorithm?
Is there something I don't understand here, that someone could explain?
Only a candidate node which has a majority of votes can lead. Only one majority exists in cluster the other node cannot hear from a majority without contacting at least one node which has already voted for another leader. The candidate who hears of the other leader will step down. Here is a nice animation which shows how it happens: http://thesecretlivesofdata.com/raft/#election
Yes you are right. There can be multiple leaders at the same time, but not in the same term, so the guarantee still holds. A possible situation is in a 3-server (A, B, C) cluster, A becomes elected. And then a network partition happens and the cluster is separated into 2 partitions: {A} and {B, C}. In this case, A would not step down as it does not receive any RPC with a higher term and remains a leader. In the majority partition, a new leader can still be elected. But notice that this new leader is in a greater term than A.
Then how about the request from the client? Two cases.
1. For a WRITE request, the leader cannot reply to the client unless the entry log committed, which is impossible for the outdated leader. So no problem. Only the true leader would be able to commit the entry by replicating it on a majority of servers.
2. For a READ-ONLY request, the leader can get away without consulting the log or committing the entry. You are right and this is explicitly mentioned in the paper at the end of section 8.
Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log. First, a leader must have the latest information on which entries are committed. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests.
Hope this helps.
this is the question I asked three years ago. Right now, I can answer the question myself.
The key point here is that even the read operation, the client need to go through the raft protocol. If the client request the old leader, the old leader launch AppendEntry to confirm that whether it is the newest leader. It will notice that it is the old leader or the AppendEntry is timeout, then it will return to client timeout or error.
Every machine in the cluster compares its current term against the term it recieves along with all the requests it gets from the other machines. And whenever a "leader" tries to act as a leader, it will not get a majority accepts from the rest of the cluster since the majority of the machines have greater term then the "leader". That guarantees that only the actual leader will be able to reply on clients requests.
Additionally, according to Raft, this "leader" will become a follower immediately after it recieves a reject with a greater term.

Resources