How does raft handle committing entries from previous one? - algorithm

In raft paper section 5.4.2
If a leader crashes before
committing an entry, future leaders will attempt to
finish replicating the entry. However, a leader cannot immediately
conclude that an entry from a previous term is
committed once it is stored on a majority of servers. There could be a situation where an old log entry is stored
on a majority of servers, yet can still be overwritten by a
future leader.
The author mentioned that to avoid the situation above
To eliminate problems like the one in Figure 8, Raft
never commits log entries from previous terms by counting
replicas. Only log entries from the leader’s current
term are committed by counting replicas; once an entry
from the current term has been committed in this way,
then all prior entries are committed indirectly because
of the Log Matching Property.
But wouldn't the same problem still occur?
Given the following situation that the author provided
When S5 is elected leader it only looks at its current committed log which is (term3, index1) and this is going to override term2 entries in all followers.
How does making a leader looking at its own committed log solve the problem?

Read the caption on this image. Both (d) and (e) are possible resolutions to the state of the log produced by (a), (b), and (c). The problem is, even though in (c) entry (2, 2) was replicated to a majority of the cluster, this is illustrating that it could still be overwritten when S5 is elected in (d). So, the solution is to only allow nodes to commit entries from their own term. In other words, replicating an entry on a majority of nodes does not equal commitment. In (c), entry (2, 2) is replicated to a majority of the cluster, but because it's not in the leader's term (at least 4) it's not committed. But in (e), after the leader replicates an entry from its current term (4), that prevents the situation in (d) from occurring since S5 can no longer be elected leader.

After S1 replicates entry 4 with a higher term than 2 and 3. S5 will no longer be elected as leader, since the leader election strategy of Raft:
Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.
So, in my opinion, the appended log entry 4 in (e) implicitly promote all the entries' term before it. Because what we only care about is the term or the last entry, rather than entry 2 any more.
This is just like what the proposer do in Phase 2 of Paxos:
If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
That's say, propose the learned value 2 with a higher propose number.

I think both situations in figure 8 (d) and (e) are legal in Raft because the paper says:
To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas.
In figure 8(d) the entries with term 2 is not in the local log of leader S5, and they are not committed to the state machine. It is ok to overwrite them with entries with term 3. Only entries in leader's current log are eligible to be considered as committed by counting the number of replicas.

If we allow entry from the previous term being committed, after (c), entry numbered 2 will be committed. After that, if 3 is selected as the leader, it will overwrite the committed 2. Thus, S5 and S1 will execute different commands. To prevent that, we will not allow 2 committed. Thus, the commands that are executed in all state machines will become consistent.

I think the question is that the leader crashed after few followers applied log to the state machine.
In (c) in the figure, the Leader has copied the log of term 2 to more than half of the nodes, and then the Leader updates the commitIndex to 2 and sends the heartbeat. And before leader crash, only S2 received the heartbeat and applies the log of term 2 to the state machine .According to the paper, S5 can be new leader by votes from S3 and S4, and try to append the log of term 3 to S2~S4. But, this operation should not be allowed, because S2 has already applied the log at index 2 to state machine.
Seems Raft does not cover this situation

Related

Raft algorithm: When will term increase?

Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.
Terms act as a logical clock [14] in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time.
From this paper, we got know term begins with an election and increase monotonically.
My question is when will term increase?
Does it increase when over physical time? e.g. every minute, or every hour.
Is it relating to logical time?
Does it only increase when new election happens?
How frequently will term change?
How many log entries will be generated within a term?
The term is a logical timestamp, or what’s more generally referred to in distributed systems as an epoch. The frequency with which terms change is entirely dependent on node and network conditions. Terms are incremented only when a member starts a new election. So, the term will be incremented e.g. after a leader crashes, if a network partition leads to some members’ election timers expiring, if there’s enough latency in the network to expire election timers, or if an election ends without a winner.

What is lastApplied and matchIndex in raft protocol for volatile state in server?

I am using the following pdf as reference.
It says that lastApplied is the highest log entry applied to state machine, but how is that any different than the commitIndex?
Also is the matchIndex on leader just the commitIndex on followers? If not what is the difference?
Your observation is reasonable: most of the time, nextIndex equals matchIndex + 1, but it is not always the case.
For example, when a leader is initiated, matchIndex is initiated to the 0, while nextIndex is initiated to the last log index + 1.
The difference here is because these two fields are used for different purposes: matchIndex is an accurate value indicating the index up to which all the log entries in leader and follower match. However, nextIndex is only an optimistic "guess" indicating which index the leader should try for the next AppendEntries operation, it can be a good guess (i.e. it equals matchIndex + 1) in which case the AppendEntries operation will succeed, but it can also be a bad guess (e.g. in the case when a leader was just initiated) in which case the AppendEntries will fail so that the leader will decrement nextIndex and retry.
As for lastApplied, it's simply another accurate value indicating the index up to which all the log entries in a follower have been applied to the underlying state machine. It's similar to matchIndex in that they both are both accurate values instead of heuristic "guess", but they really mean different things and serve for different purposes.
... lastApplied is the highest log entry applied to state machine, but how is that any different than the commitIndex?
These are different in a practical system because the component that commits the data in the log is typically separate from the component that applies it to replicated state machine or database. The commitIndex is typically just nanoseconds or maybe a few milliseconds more up-to-date than lastApplied.
Is the matchIndex on leader just the commitIndex on followers? If not what is the difference?
They are different. There is a period of time when the data is on a server and not yet committed, such as during the replication itself.
The leader keeps track of the latest un-committed data on each of its peers and only need to send log[matchIndex[peer], ...] to each peer instead of the whole log. This is especially useful if the peer is significantly behind the leader; because the leader can update the peer with a series of small AppendEntries calls, incrementally bringing the peer up to date.
commit is not mean already applied, there is time different between them. but eventually applied will catch up commit index.
matchIndex[i] which is saved in leader is equal to follower_i's commitIndex, and they are try to catch up to nextIndex

How do Raft guarantee consistency when network partition occurs?

Suppose a network partition occurs and the leader A is in minority. Raft will elect a new leader B but A thinks it's still the leader for some time. And we have two clients. Client 1 writes a key/value pair to B, then Client 2 reads the key from A before A steps down. Because A still believes it's the leader, it will return stale data.
The original paper says:
Second, a leader must check whether it has been deposed
before processing a read-only request (its information
may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heartbeat
messages with a majority of the cluster before responding
to read-only requests.
Isn't it too expensive? The leader has to talk to majority nodes for every read request?
I'm surprised there's so much ambiguity in the answers, as this is quite well known:
Yes, to get linearizable reads from Raft you must round-trip through the quorum.
There are no shortcuts here. In fact, both etcd and Consul committed an error in their implementations of Raft and caused linearizability violations. The implementors erroneously believed (as did many people, including myself) that if a node thought of itself as a leader, it was the leader.
Raft does not guarantee this at all. A node can be a leader and not learn of its loss of leadership because of the very network partition that caused someone else to step up in the first place. Because clock error is taken as unbounded in distributed systems literature, no amount of waiting can solve this race condition. New leaders cannot simply "wait it out" and then decide "okay, the old leader must have realized it by now". This is just typical lease lock stuff - you can't use clocks with unbounded error to make distributed decisions.
Jepsen covered this error detail, but to quote the conclusion:
[There are] three types of reads, for varying performance/correctness needs:
Anything-goes reads, where any node can respond with its last known value. Totally available, in the CAP sense, but no guarantees of monotonicity. Etcd does this by default, and Consul terms this “stale”.
Mostly-consistent reads, where only leaders can respond, and stale reads are occasionally allowed. This is what etcd currently terms “consistent”, and what Consul does by default.
Consistent reads, which require a round-trip delay so the leader can confirm it is still authoritative before responding. Consul now terms this consistent.
Just to tie in with some other results from literature, this very problem was one of the things Flexible Paxos showed it could handle. The key realization in FPaxos is that you have two quorums: one for leader election and one for replication. The only requirement is that these quorums intersect, and while a majority quorum is guaranteed to do so, it is not the only configuration.
For example, one could require that every node participate in leader election. The winner of this election could be the sole node serving requests - now it is safe for this node to serve reads locally, because it knows for a new leader to step up the leadership quorum would need to include itself. (Of course, the tradeoff is that if this node went down, you could not elect a new leader!)
The point of FPaxos is that this is an engineering tradeoff you get to make.
The leader doesn't have to talk to a majority for each read request. Instead, as it continuously heartbeats with its peers it maintains a staleness measure: how long it has been since it has received an okay from a quorum? The leader will check this staleness measure and return a StalenessExceeded error. This gives the calling system the chance to connect to another host.
It may be better to push that staleness check to the calling systems; let the low-level raft system have higher Availability (in CAP terms) and let the calling systems decide at what staleness level to fail over. This can be done in various ways. You could have the calling systems heartbeat to the raft system, but my favorite is to return the staleness measure in the response. This last can be improved when the client includes its timestamp in the request, the raft server echos it back in the response and the client adds the round trip time to raft staleness. (NB. Always use the nano clock in measuring time differences because it doesn't go backwards like the system clock does.)
Not sure whether timeout configure can solve this problem:
2 x heartbeat interval <= election timeout
which means when network partition happens leader A is single leader and write will fail because leader A locates in the minority and leader A can not get echo back from majority of the node and step back as a follower.
After that leader B is selected, it can catch up the latest change from at least one of the followers and then client can perform read and write on leader B.
Question
The leader has to talk to majority nodes for every read request
Answer: No.
Explaination
Let's understand it with code example from HashiCorp's raft implementation.
There are 2 timeouts involved: (their names are self explanatory but link has been included to read detailed definition.)
LeaderLease timeout[1]
Election timeout[2]
Example of their values are 500ms & 1000ms respectively[3]
Must condition for node to start is: LeaderLease timeout < Election timeout [4,5]
Once a node becomes Leader, it is checked "whether it is heartbeating with quorum of followers or not"[6, 7]. If heartbeat stops then its tolerated till LeaderLease timeout[8]. If Leader is not able to contact quorum of nodes for LeaderLease timeout then Leader node has to become Follower[9]
Hence for example given in question, Node-A must step down as Leader before Node-B becomes Leader. Since Node-A knows its not a Leader before Node-B becomes Leader, Node-A will not serve the read or write request.
[1]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L141
[2]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L179
[3]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L230
[4]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L272
[5]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L275
[6]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L456
[7]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L762
[8]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L891
[9]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L894

How does raft deal with log in this scenario?

Assume d is elected as an leader in the above picture, how will it deal with the log with index 11 and 12. In my opinion, it should delete the two logs, but I don't find any clues in the raft paper about how to deal logs like the above scenario.
If (d) is elected leader, then it will replicate its log to the followers, it won't remove the items at index 11 & 12. See section 5.3 on log replication in the raft paper where it says
In Raft, the leader handles inconsistencies by forcing the followers’
logs to duplicate its own. This means that conflicting entries in
follower logs will be overwritten with entries from the leader’s log.
The rules around leader election ensure that this is a safe decision to make.

How does the Raft algorithm guarantee consensus if there are multiple leaders?

As the paper says:
Election Safety: at most one leader can be elected in a given term. §5.2
However, there may be more than one leader in the system. Raft only can promise that there is only one leader in a given term. So If I have more than one client, wouldn't I get different data? How does this allow Raft to be a consensus algorithm?
Is there something I don't understand here, that someone could explain?
Only a candidate node which has a majority of votes can lead. Only one majority exists in cluster the other node cannot hear from a majority without contacting at least one node which has already voted for another leader. The candidate who hears of the other leader will step down. Here is a nice animation which shows how it happens: http://thesecretlivesofdata.com/raft/#election
Yes you are right. There can be multiple leaders at the same time, but not in the same term, so the guarantee still holds. A possible situation is in a 3-server (A, B, C) cluster, A becomes elected. And then a network partition happens and the cluster is separated into 2 partitions: {A} and {B, C}. In this case, A would not step down as it does not receive any RPC with a higher term and remains a leader. In the majority partition, a new leader can still be elected. But notice that this new leader is in a greater term than A.
Then how about the request from the client? Two cases.
1. For a WRITE request, the leader cannot reply to the client unless the entry log committed, which is impossible for the outdated leader. So no problem. Only the true leader would be able to commit the entry by replicating it on a majority of servers.
2. For a READ-ONLY request, the leader can get away without consulting the log or committing the entry. You are right and this is explicitly mentioned in the paper at the end of section 8.
Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log. First, a leader must have the latest information on which entries are committed. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests.
Hope this helps.
this is the question I asked three years ago. Right now, I can answer the question myself.
The key point here is that even the read operation, the client need to go through the raft protocol. If the client request the old leader, the old leader launch AppendEntry to confirm that whether it is the newest leader. It will notice that it is the old leader or the AppendEntry is timeout, then it will return to client timeout or error.
Every machine in the cluster compares its current term against the term it recieves along with all the requests it gets from the other machines. And whenever a "leader" tries to act as a leader, it will not get a majority accepts from the rest of the cluster since the majority of the machines have greater term then the "leader". That guarantees that only the actual leader will be able to reply on clients requests.
Additionally, according to Raft, this "leader" will become a follower immediately after it recieves a reject with a greater term.

Resources