i'm trying to learn the raft algorithm to implement it, i don't have understood when the term is incremented, apart from when a node pass from stato follower to state candidate there are others cases when the term is incremented? For instance when the leader increment the last commit index? Thanks
If you're looking from a standpoint of a single peer in the Raft cluster, you will update your term if:
you receive a RPC request or response that contains a term higher than yours (note here that if you're in the leader mode, you also need to step down as a leader and turn to follower)
you're in the follower mode and didn't hear from the current leader in the minimum election timeout time (you'll be switching to candidate mode)
you're in the candidate mode and didn't get enough votes to win the election and the election timeout elapses
Generally observing the system, the term will be increased only when a new election starts.
Related
As introduced in the paper, we use empty AppendEntries RPC for heartbeat. Then how about the RequestVote RPC? When FOLLOWER or CANDIDATE receives RequestVote RPC call, is it suppose to reset the election timeout as well? Why or why not to do so?
One benefit in my mind is that when RequestVote RPC call also treated as heartbeat, we can potentially prevent the multiple candidates condition. Since multiple candidates may split votes and take longer time in the election stage. By using that as heartbeat, the RequestVote RPC calls from one candidate will reset the election timer so that other live peers are less likely to timeout and become a candidate as well.
Well, there’s probably not anything inherently unsafe about it. But the problem is nodes that can’t win an election can still start one. So, if a node that can’t win starts an election and requests votes from all the other nodes, resetting their timers would block the election. And since the can’t-win candidate started its timer first, it would likely also timeout and start another election first, thus blocking the cluster again, and another election, and so on.
Of course, the fix for this could be to only reset election timeouts when a vote is cast. This could be safe. After all, election timeouts are randomized anyways. But the question is whether it’s effective. It wouldn’t prevent split votes since it doesn’t stop multiple nodes from requesting votes concurrently, and during split votes it would only make the election take that much longer. I suspect the pre-vote protocol is much more efficient for that reason and probably avoids split votes as well as they can be avoided.
Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.
Terms act as a logical clock [14] in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time.
From this paper, we got know term begins with an election and increase monotonically.
My question is when will term increase?
Does it increase when over physical time? e.g. every minute, or every hour.
Is it relating to logical time?
Does it only increase when new election happens?
How frequently will term change?
How many log entries will be generated within a term?
The term is a logical timestamp, or what’s more generally referred to in distributed systems as an epoch. The frequency with which terms change is entirely dependent on node and network conditions. Terms are incremented only when a member starts a new election. So, the term will be incremented e.g. after a leader crashes, if a network partition leads to some members’ election timers expiring, if there’s enough latency in the network to expire election timers, or if an election ends without a winner.
I am using the following pdf as reference.
It says that lastApplied is the highest log entry applied to state machine, but how is that any different than the commitIndex?
Also is the matchIndex on leader just the commitIndex on followers? If not what is the difference?
Your observation is reasonable: most of the time, nextIndex equals matchIndex + 1, but it is not always the case.
For example, when a leader is initiated, matchIndex is initiated to the 0, while nextIndex is initiated to the last log index + 1.
The difference here is because these two fields are used for different purposes: matchIndex is an accurate value indicating the index up to which all the log entries in leader and follower match. However, nextIndex is only an optimistic "guess" indicating which index the leader should try for the next AppendEntries operation, it can be a good guess (i.e. it equals matchIndex + 1) in which case the AppendEntries operation will succeed, but it can also be a bad guess (e.g. in the case when a leader was just initiated) in which case the AppendEntries will fail so that the leader will decrement nextIndex and retry.
As for lastApplied, it's simply another accurate value indicating the index up to which all the log entries in a follower have been applied to the underlying state machine. It's similar to matchIndex in that they both are both accurate values instead of heuristic "guess", but they really mean different things and serve for different purposes.
... lastApplied is the highest log entry applied to state machine, but how is that any different than the commitIndex?
These are different in a practical system because the component that commits the data in the log is typically separate from the component that applies it to replicated state machine or database. The commitIndex is typically just nanoseconds or maybe a few milliseconds more up-to-date than lastApplied.
Is the matchIndex on leader just the commitIndex on followers? If not what is the difference?
They are different. There is a period of time when the data is on a server and not yet committed, such as during the replication itself.
The leader keeps track of the latest un-committed data on each of its peers and only need to send log[matchIndex[peer], ...] to each peer instead of the whole log. This is especially useful if the peer is significantly behind the leader; because the leader can update the peer with a series of small AppendEntries calls, incrementally bringing the peer up to date.
commit is not mean already applied, there is time different between them. but eventually applied will catch up commit index.
matchIndex[i] which is saved in leader is equal to follower_i's commitIndex, and they are try to catch up to nextIndex
Suppose a network partition occurs and the leader A is in minority. Raft will elect a new leader B but A thinks it's still the leader for some time. And we have two clients. Client 1 writes a key/value pair to B, then Client 2 reads the key from A before A steps down. Because A still believes it's the leader, it will return stale data.
The original paper says:
Second, a leader must check whether it has been deposed
before processing a read-only request (its information
may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heartbeat
messages with a majority of the cluster before responding
to read-only requests.
Isn't it too expensive? The leader has to talk to majority nodes for every read request?
I'm surprised there's so much ambiguity in the answers, as this is quite well known:
Yes, to get linearizable reads from Raft you must round-trip through the quorum.
There are no shortcuts here. In fact, both etcd and Consul committed an error in their implementations of Raft and caused linearizability violations. The implementors erroneously believed (as did many people, including myself) that if a node thought of itself as a leader, it was the leader.
Raft does not guarantee this at all. A node can be a leader and not learn of its loss of leadership because of the very network partition that caused someone else to step up in the first place. Because clock error is taken as unbounded in distributed systems literature, no amount of waiting can solve this race condition. New leaders cannot simply "wait it out" and then decide "okay, the old leader must have realized it by now". This is just typical lease lock stuff - you can't use clocks with unbounded error to make distributed decisions.
Jepsen covered this error detail, but to quote the conclusion:
[There are] three types of reads, for varying performance/correctness needs:
Anything-goes reads, where any node can respond with its last known value. Totally available, in the CAP sense, but no guarantees of monotonicity. Etcd does this by default, and Consul terms this “stale”.
Mostly-consistent reads, where only leaders can respond, and stale reads are occasionally allowed. This is what etcd currently terms “consistent”, and what Consul does by default.
Consistent reads, which require a round-trip delay so the leader can confirm it is still authoritative before responding. Consul now terms this consistent.
Just to tie in with some other results from literature, this very problem was one of the things Flexible Paxos showed it could handle. The key realization in FPaxos is that you have two quorums: one for leader election and one for replication. The only requirement is that these quorums intersect, and while a majority quorum is guaranteed to do so, it is not the only configuration.
For example, one could require that every node participate in leader election. The winner of this election could be the sole node serving requests - now it is safe for this node to serve reads locally, because it knows for a new leader to step up the leadership quorum would need to include itself. (Of course, the tradeoff is that if this node went down, you could not elect a new leader!)
The point of FPaxos is that this is an engineering tradeoff you get to make.
The leader doesn't have to talk to a majority for each read request. Instead, as it continuously heartbeats with its peers it maintains a staleness measure: how long it has been since it has received an okay from a quorum? The leader will check this staleness measure and return a StalenessExceeded error. This gives the calling system the chance to connect to another host.
It may be better to push that staleness check to the calling systems; let the low-level raft system have higher Availability (in CAP terms) and let the calling systems decide at what staleness level to fail over. This can be done in various ways. You could have the calling systems heartbeat to the raft system, but my favorite is to return the staleness measure in the response. This last can be improved when the client includes its timestamp in the request, the raft server echos it back in the response and the client adds the round trip time to raft staleness. (NB. Always use the nano clock in measuring time differences because it doesn't go backwards like the system clock does.)
Not sure whether timeout configure can solve this problem:
2 x heartbeat interval <= election timeout
which means when network partition happens leader A is single leader and write will fail because leader A locates in the minority and leader A can not get echo back from majority of the node and step back as a follower.
After that leader B is selected, it can catch up the latest change from at least one of the followers and then client can perform read and write on leader B.
Question
The leader has to talk to majority nodes for every read request
Answer: No.
Explaination
Let's understand it with code example from HashiCorp's raft implementation.
There are 2 timeouts involved: (their names are self explanatory but link has been included to read detailed definition.)
LeaderLease timeout[1]
Election timeout[2]
Example of their values are 500ms & 1000ms respectively[3]
Must condition for node to start is: LeaderLease timeout < Election timeout [4,5]
Once a node becomes Leader, it is checked "whether it is heartbeating with quorum of followers or not"[6, 7]. If heartbeat stops then its tolerated till LeaderLease timeout[8]. If Leader is not able to contact quorum of nodes for LeaderLease timeout then Leader node has to become Follower[9]
Hence for example given in question, Node-A must step down as Leader before Node-B becomes Leader. Since Node-A knows its not a Leader before Node-B becomes Leader, Node-A will not serve the read or write request.
[1]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L141
[2]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L179
[3]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L230
[4]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L272
[5]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L275
[6]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L456
[7]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L762
[8]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L891
[9]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L894
As the paper says:
Election Safety: at most one leader can be elected in a given term. §5.2
However, there may be more than one leader in the system. Raft only can promise that there is only one leader in a given term. So If I have more than one client, wouldn't I get different data? How does this allow Raft to be a consensus algorithm?
Is there something I don't understand here, that someone could explain?
Only a candidate node which has a majority of votes can lead. Only one majority exists in cluster the other node cannot hear from a majority without contacting at least one node which has already voted for another leader. The candidate who hears of the other leader will step down. Here is a nice animation which shows how it happens: http://thesecretlivesofdata.com/raft/#election
Yes you are right. There can be multiple leaders at the same time, but not in the same term, so the guarantee still holds. A possible situation is in a 3-server (A, B, C) cluster, A becomes elected. And then a network partition happens and the cluster is separated into 2 partitions: {A} and {B, C}. In this case, A would not step down as it does not receive any RPC with a higher term and remains a leader. In the majority partition, a new leader can still be elected. But notice that this new leader is in a greater term than A.
Then how about the request from the client? Two cases.
1. For a WRITE request, the leader cannot reply to the client unless the entry log committed, which is impossible for the outdated leader. So no problem. Only the true leader would be able to commit the entry by replicating it on a majority of servers.
2. For a READ-ONLY request, the leader can get away without consulting the log or committing the entry. You are right and this is explicitly mentioned in the paper at the end of section 8.
Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log. First, a leader must have the latest information on which entries are committed. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests.
Hope this helps.
this is the question I asked three years ago. Right now, I can answer the question myself.
The key point here is that even the read operation, the client need to go through the raft protocol. If the client request the old leader, the old leader launch AppendEntry to confirm that whether it is the newest leader. It will notice that it is the old leader or the AppendEntry is timeout, then it will return to client timeout or error.
Every machine in the cluster compares its current term against the term it recieves along with all the requests it gets from the other machines. And whenever a "leader" tries to act as a leader, it will not get a majority accepts from the rest of the cluster since the majority of the machines have greater term then the "leader". That guarantees that only the actual leader will be able to reply on clients requests.
Additionally, according to Raft, this "leader" will become a follower immediately after it recieves a reject with a greater term.