How does ZooKeeper deal with time ?
Are the Znodes/Clients synchronized ? and How?
Otherwise, how does the algorithm work without time Synchronization?
I see relative Question here, but it does not answer my question
How does Zookeeper synchronize the clock in the cluster
Thanks in Advance
As you may have heard, zookeeper elects a leader for the ensemble. Thereafter, all the write requests go through the leader.
Therefore, leader is responsible for preserving the order of write requests. (Yes, the order is determined by the time at which a request reaches the leader). When all the write requests are served by the leader, no need to worry about synchronization right? Zookeeper doesn't depend on synchronizations of clocks.
How the leader is transmitting new values to the followers is another problem which is addressed through consensus algorithm, ZAB (Zookeeper Atomic Broadcast). This protocol make sure that the majority of the ensemble have updated the new value before sending OK response to the write request.
Related
I want to use consul for a 2-node cluster. Drawback is there's no failure tolerance for two nodes :
https://www.consul.io/docs/internals/consensus.html
Is there a way in Consul to make a consistent leader election with only two nodes? Can Consul Raft Consensus algorithm be changed?
Thanks a lot.
It sounds like you're limited to 2 machines of this type, because they are expensive. Consider acquiring three or five cheaper machines to run your orchestration layer.
To answer protocol question, no, there is no way to run a two-node cluster with failure tolerance in Raft. To be clear, you can safely run a two-node cluster just fine - it will be available and make progress like any other cluster. It's just when one machine goes down, because your fault tolerance is zero you will lose availability and no longer make no progress. But safety is never compromised - your data is still persisted consistently on these machines.
Even outside Raft, there is no way to run a two-node cluster and guarantee progress upon a single failure. This is a fundamental limit. In general, if you want to support f failures (meaning remain safe and available), you need 2f + 1 nodes.
There are non-Raft ways to improve the situation. For example, Flexible Paxos shows that we can require both nodes for leader election (as it already is in Raft), but only require a single node for replication. This would allow your cluster to continue working in some failure cases where Raft would have stopped. But the worst case is still the same: there are always failures that will cause any two-node cluster to become unavailable.
That said, I'm not aware of any practical flexible paxos implementations anyway.
Considering the expense of even trying to hack up a solution to this, your best bet is to either get a larger set of cheaper machines, or just run your two-node cluster and accept unavailability upon failure.
Talking about changing the protocol, there is impossibility proof by FLP which states that consensus cannot be reached if systems are less than 2f + 1 for f failures (fail-stop). Although, safety is provided but progress (liveness) cannot be ensured.
I think, the options suggested in earlier post are the best.
The choice of leader election on top of the Consul’s documentation itself requires 3 nodes. This relies on the health-checks mechanism, as well as the sessions. Sessions are essentially distributed locks automatically released by TTL or when the service crashes.
To build 2-node Consul cluster we have to use another approach, supposedly called Leader Lease. Since we already have Consul KV-storage with CAS support, we can simply write to it which machine is the leader before the expiration of such and such time. As long as the leader is alive and well, it can periodically extend it's time. If the leader dies, someone will replace it quickly. For this approach to work, it is enough to synchronize the time on the machines using ntpd and when the leader performs any action, verify that it has enough time left to complete this action.
A key is created in the KV-storage, containing something like “node X is the leader before time Y”, where Y is calculated as the current time + some time interval(T). As a leader, node X updates the record once every T/2 or T/3 units of time, thereby extending it's leadership role. If a node falls or cannot reach the KV-storage, after the interval(T) its place will be taken by the node, which will be the first to discover that the leadership role has been released.
CAS is needed to prevent a race condition if the two nodes simultaneously try to become a leader. CAS Specifies to use a Check-And-Set operation. This is very useful as a building block for more complex synchronization primitives. If the index is 0, Consul will only put the key if it does not already exist. If the index is non-zero, the key is only set if the index matches the ModifyIndex of that key.
Suppose a network partition occurs and the leader A is in minority. Raft will elect a new leader B but A thinks it's still the leader for some time. And we have two clients. Client 1 writes a key/value pair to B, then Client 2 reads the key from A before A steps down. Because A still believes it's the leader, it will return stale data.
The original paper says:
Second, a leader must check whether it has been deposed
before processing a read-only request (its information
may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heartbeat
messages with a majority of the cluster before responding
to read-only requests.
Isn't it too expensive? The leader has to talk to majority nodes for every read request?
I'm surprised there's so much ambiguity in the answers, as this is quite well known:
Yes, to get linearizable reads from Raft you must round-trip through the quorum.
There are no shortcuts here. In fact, both etcd and Consul committed an error in their implementations of Raft and caused linearizability violations. The implementors erroneously believed (as did many people, including myself) that if a node thought of itself as a leader, it was the leader.
Raft does not guarantee this at all. A node can be a leader and not learn of its loss of leadership because of the very network partition that caused someone else to step up in the first place. Because clock error is taken as unbounded in distributed systems literature, no amount of waiting can solve this race condition. New leaders cannot simply "wait it out" and then decide "okay, the old leader must have realized it by now". This is just typical lease lock stuff - you can't use clocks with unbounded error to make distributed decisions.
Jepsen covered this error detail, but to quote the conclusion:
[There are] three types of reads, for varying performance/correctness needs:
Anything-goes reads, where any node can respond with its last known value. Totally available, in the CAP sense, but no guarantees of monotonicity. Etcd does this by default, and Consul terms this “stale”.
Mostly-consistent reads, where only leaders can respond, and stale reads are occasionally allowed. This is what etcd currently terms “consistent”, and what Consul does by default.
Consistent reads, which require a round-trip delay so the leader can confirm it is still authoritative before responding. Consul now terms this consistent.
Just to tie in with some other results from literature, this very problem was one of the things Flexible Paxos showed it could handle. The key realization in FPaxos is that you have two quorums: one for leader election and one for replication. The only requirement is that these quorums intersect, and while a majority quorum is guaranteed to do so, it is not the only configuration.
For example, one could require that every node participate in leader election. The winner of this election could be the sole node serving requests - now it is safe for this node to serve reads locally, because it knows for a new leader to step up the leadership quorum would need to include itself. (Of course, the tradeoff is that if this node went down, you could not elect a new leader!)
The point of FPaxos is that this is an engineering tradeoff you get to make.
The leader doesn't have to talk to a majority for each read request. Instead, as it continuously heartbeats with its peers it maintains a staleness measure: how long it has been since it has received an okay from a quorum? The leader will check this staleness measure and return a StalenessExceeded error. This gives the calling system the chance to connect to another host.
It may be better to push that staleness check to the calling systems; let the low-level raft system have higher Availability (in CAP terms) and let the calling systems decide at what staleness level to fail over. This can be done in various ways. You could have the calling systems heartbeat to the raft system, but my favorite is to return the staleness measure in the response. This last can be improved when the client includes its timestamp in the request, the raft server echos it back in the response and the client adds the round trip time to raft staleness. (NB. Always use the nano clock in measuring time differences because it doesn't go backwards like the system clock does.)
Not sure whether timeout configure can solve this problem:
2 x heartbeat interval <= election timeout
which means when network partition happens leader A is single leader and write will fail because leader A locates in the minority and leader A can not get echo back from majority of the node and step back as a follower.
After that leader B is selected, it can catch up the latest change from at least one of the followers and then client can perform read and write on leader B.
Question
The leader has to talk to majority nodes for every read request
Answer: No.
Explaination
Let's understand it with code example from HashiCorp's raft implementation.
There are 2 timeouts involved: (their names are self explanatory but link has been included to read detailed definition.)
LeaderLease timeout[1]
Election timeout[2]
Example of their values are 500ms & 1000ms respectively[3]
Must condition for node to start is: LeaderLease timeout < Election timeout [4,5]
Once a node becomes Leader, it is checked "whether it is heartbeating with quorum of followers or not"[6, 7]. If heartbeat stops then its tolerated till LeaderLease timeout[8]. If Leader is not able to contact quorum of nodes for LeaderLease timeout then Leader node has to become Follower[9]
Hence for example given in question, Node-A must step down as Leader before Node-B becomes Leader. Since Node-A knows its not a Leader before Node-B becomes Leader, Node-A will not serve the read or write request.
[1]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L141
[2]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L179
[3]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L230
[4]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L272
[5]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L275
[6]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L456
[7]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L762
[8]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L891
[9]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L894
In the Raft paper, they mentioned that all the client interaction happens with the leader node. What I don't understand is that the leader keeps changing. So let's say my cluster is behind a load balancer. How do I notify the load balancer that the leader has changed? Or if I'm correct, is it that load balancer can send out client request to any of the node (follower or leader) and it is the responsibility of the follower node to send the request to the leader?
After the voting finish, you will have a leader (new or old). It is the responsibility of the leader to notify all the nodes in the network to send heartbeats at a regular interval(smaller than the keep-alive time of network but bigger than the maximum round trip time) to all the nodes.
Your load balancer should update the leader every time it gets heartbeats. Load balancer will send data only to the leader as according to raft algorithm all client requests directly goes to leader only, other nodes can't send data but only acknowledgments to voting and append commands.
There is a really good presentation here on the same:- Raft: Log-Replication
There are really two ways this can be done: either the load balancer needs to understand where the leader is or the followers can proxy requests to the leader.
There's nothing wrong with proxying client requests to the leader through a follower, and in fact there can be major benefits to it. Many Raft implementations allow clients to read from followers while maintaining sequential consistency. This can still be safely done with a load balancer sending requests to arbitrary nodes if the client keeps track of the last index it has seen and sends that with each request to ensure it does not see state go back in time. I won't write the full algorithm here, but this is described in the Raft dissertation which you should consult.
But using a load balancer in this manner can become unsafe in certain cases. If clients are allowed to send multiple concurrent requests, the load balancer could route those requests through different nodes and they could arrive at the leader out of order. This can be accounted for by attaching a sequence number to client requests and reordering requests on the leader. But to do so, the implementation has to include sessions to allow the leader to track per-client state.
Typically, though, Raft clients connect to specific nodes and stay connected to them for as long as possible to reduce the overhead of maintaining consistency while switching servers. If an implementation supports reading from followers, it can still be costly to switch servers since servers have to wait for state to catch up to maintain sequential consistency.
I have read the Raft algorithm paper's and got a question related to the sequence of operations Raft executes upon receiving a client request:
In order to overcome a single point of failure scenario, Raft relies on maintaining a replicated log on other machines, the algorithm also consults a consensus module for the complete logging management. The sequence of operations work as follow:
Client request is received at the leader's state machine, leader appends command to its log.
The leader sends AppendEntries RPCs to his followers to clone the command in their local logs', and waits for an acknowledgment from majority of the followers that the entry has been successfully appended to their local log file.
Once an acknowledgment has been received that the request has been successfully logged in majority of the followers logs', then the request is committed to the leader's state machine causing a transition to happen, returning back the output of that transition to the client.
Ultimately, the leader notifies followers of committed entries in subsequent AppendEntries RPCs.
If above understanding is correct, then I can claim that the client request is being held for a bit of time for the replication process to complete, also I may also claim that the success of a client request is heavily dependent on the success of the replication process (since the client command / request is not executed on the leader's machine until a majority acknowledgment has been received). The question is, how long it is expected to take on average for a client request to receive a response after the replication procedure is completed, also does that work efficiently for real-time systems?
http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed suggests that systems such as Raft requesting the Consistency and Availability parts of the CAP theorem's trinity will suffer performance limits. You may also be interested in https://pdfs.semanticscholar.org/7c45/54d064128043897ea2226021f6fda4c64251.pdf (A review of experiences with reliable multicast, by Birman), which describes experience with reliable multicast groups in high assurance systems such as air traffic control.
My takeaway from this is that a real system may want to be very careful about what information it guards with Raft, Paxos, and friends, and what it can guard less tightly. The other point of view is to go for a very sophisticated implementation of Paxos, such as Google Spanner, so that programmers don't have to worry about the problems of non-ACID systems.
If above understanding is correct, then I can claim that the client request is being held for a bit of time for the replication process to complete
Correct, the leader of the current term will acknowledge a client request only after the command has been replicated to majority of nodes in the cluster.
I may also claim that the success of a client request is heavily dependent on the success of the replication process
That's also correct. At least of majority of nodes in the cluster (including the leader) need to be available and responsive, in order for the command to be replicated successfully and the leader to acknowledge the request.
how long it is expected to take on average for a client request to receive a response after the replication procedure is completed
That depends on the topology of your network. The latency of the response to a client request will be composed of the following parts (assuming no leader crashes):
* the latency required for the client request to be transmitted between the client and the leader.
* the latency of an AppendEntries request from the leader to followers to replicate the entry (sent in parallel to all the followers.
* the latency of an AppendEntries response from the followers to the leader.
* the time required by the leader to apply the command to its state machine (i.e. a disk write in the best case)
* the latency of the client response to be transmitted from the leader to the client
The latency of the various messages depends on the distance between nodes, but it will probably be in the order of tenths to hundreds of milliseconds.
also does that work efficiently for real-time systems?
It depends on what are your requirements for your specific case. But in general, real-time systems require latencies that are under a few milliseconds, so the answer is most likely no. Also, keep in mind that during periods of crashes and instability where new leader elections happen, latency can increase significantly.
As the paper says:
Election Safety: at most one leader can be elected in a given term. §5.2
However, there may be more than one leader in the system. Raft only can promise that there is only one leader in a given term. So If I have more than one client, wouldn't I get different data? How does this allow Raft to be a consensus algorithm?
Is there something I don't understand here, that someone could explain?
Only a candidate node which has a majority of votes can lead. Only one majority exists in cluster the other node cannot hear from a majority without contacting at least one node which has already voted for another leader. The candidate who hears of the other leader will step down. Here is a nice animation which shows how it happens: http://thesecretlivesofdata.com/raft/#election
Yes you are right. There can be multiple leaders at the same time, but not in the same term, so the guarantee still holds. A possible situation is in a 3-server (A, B, C) cluster, A becomes elected. And then a network partition happens and the cluster is separated into 2 partitions: {A} and {B, C}. In this case, A would not step down as it does not receive any RPC with a higher term and remains a leader. In the majority partition, a new leader can still be elected. But notice that this new leader is in a greater term than A.
Then how about the request from the client? Two cases.
1. For a WRITE request, the leader cannot reply to the client unless the entry log committed, which is impossible for the outdated leader. So no problem. Only the true leader would be able to commit the entry by replicating it on a majority of servers.
2. For a READ-ONLY request, the leader can get away without consulting the log or committing the entry. You are right and this is explicitly mentioned in the paper at the end of section 8.
Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log. First, a leader must have the latest information on which entries are committed. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests.
Hope this helps.
this is the question I asked three years ago. Right now, I can answer the question myself.
The key point here is that even the read operation, the client need to go through the raft protocol. If the client request the old leader, the old leader launch AppendEntry to confirm that whether it is the newest leader. It will notice that it is the old leader or the AppendEntry is timeout, then it will return to client timeout or error.
Every machine in the cluster compares its current term against the term it recieves along with all the requests it gets from the other machines. And whenever a "leader" tries to act as a leader, it will not get a majority accepts from the rest of the cluster since the majority of the machines have greater term then the "leader". That guarantees that only the actual leader will be able to reply on clients requests.
Additionally, according to Raft, this "leader" will become a follower immediately after it recieves a reject with a greater term.