How do I practically use Raft algorithm - cluster-computing

In the Raft paper, they mentioned that all the client interaction happens with the leader node. What I don't understand is that the leader keeps changing. So let's say my cluster is behind a load balancer. How do I notify the load balancer that the leader has changed? Or if I'm correct, is it that load balancer can send out client request to any of the node (follower or leader) and it is the responsibility of the follower node to send the request to the leader?

After the voting finish, you will have a leader (new or old). It is the responsibility of the leader to notify all the nodes in the network to send heartbeats at a regular interval(smaller than the keep-alive time of network but bigger than the maximum round trip time) to all the nodes.
Your load balancer should update the leader every time it gets heartbeats. Load balancer will send data only to the leader as according to raft algorithm all client requests directly goes to leader only, other nodes can't send data but only acknowledgments to voting and append commands.
There is a really good presentation here on the same:- Raft: Log-Replication

There are really two ways this can be done: either the load balancer needs to understand where the leader is or the followers can proxy requests to the leader.
There's nothing wrong with proxying client requests to the leader through a follower, and in fact there can be major benefits to it. Many Raft implementations allow clients to read from followers while maintaining sequential consistency. This can still be safely done with a load balancer sending requests to arbitrary nodes if the client keeps track of the last index it has seen and sends that with each request to ensure it does not see state go back in time. I won't write the full algorithm here, but this is described in the Raft dissertation which you should consult.
But using a load balancer in this manner can become unsafe in certain cases. If clients are allowed to send multiple concurrent requests, the load balancer could route those requests through different nodes and they could arrive at the leader out of order. This can be accounted for by attaching a sequence number to client requests and reordering requests on the leader. But to do so, the implementation has to include sessions to allow the leader to track per-client state.
Typically, though, Raft clients connect to specific nodes and stay connected to them for as long as possible to reduce the overhead of maintaining consistency while switching servers. If an implementation supports reading from followers, it can still be costly to switch servers since servers have to wait for state to catch up to maintain sequential consistency.

Related

Is Akka cluster a good choice for running multiple socket clients

I need to run 1000+ socket clients at several nodes. Each client listens to unique endpoint.
https://doc.akka.io/docs/akka-http/current/client-side/websocket-support.html
I plan to represent each client as actor with supervision, recovery, event handling e.t.c. and hope that akka cluster will
spawn my actors where each actor represents socket client
supervision policies will help to recover actors from failing states
evenly distribute actors across several nodes
gives me guarantee that I have at most once actor for each client
akka messaging will help me to enable communication between actors with at least once semantics.
What else should I expect?
In general the Akka ecosystem has some reasonable patterns for what you're looking for:
actor per client is a reasonable starting point; supervision helps a lot
cluster sharding will evenly distribute actors across the cluster nodes and generally maintain at most one incarnation per sharding ID (you thus need to map each client to a distinct sharding ID). In a split-brain scenario, there is the chance that you'd get the same client on both sides of a network partition: the split-brain resolver configuration governs the consistency vs. availability trade-off (basically it's up to you whether you'd rather have the failure mode be no active instance of a client or multiple active instances of a client).
For ensuring that when a shard moves in a rebalancing, the clients in that shard are started on the new node, the two options are "remember entities" in cluster sharding and having a cluster singleton which regularly pings each client actor. If the set to keep-alive is reasonably static, I tend to prefer the singleton approach.
Akka's basic message semantics are at-most once, but there are well established and well documented patterns (most notably the ask pattern) for at-least-once.

ZooKeeper TIme Synchronization

How does ZooKeeper deal with time ?
Are the Znodes/Clients synchronized ? and How?
Otherwise, how does the algorithm work without time Synchronization?
I see relative Question here, but it does not answer my question
How does Zookeeper synchronize the clock in the cluster
Thanks in Advance
As you may have heard, zookeeper elects a leader for the ensemble. Thereafter, all the write requests go through the leader.
Therefore, leader is responsible for preserving the order of write requests. (Yes, the order is determined by the time at which a request reaches the leader). When all the write requests are served by the leader, no need to worry about synchronization right? Zookeeper doesn't depend on synchronizations of clocks.
How the leader is transmitting new values to the followers is another problem which is addressed through consensus algorithm, ZAB (Zookeeper Atomic Broadcast). This protocol make sure that the majority of the ensemble have updated the new value before sending OK response to the write request.

How do Raft guarantee consistency when network partition occurs?

Suppose a network partition occurs and the leader A is in minority. Raft will elect a new leader B but A thinks it's still the leader for some time. And we have two clients. Client 1 writes a key/value pair to B, then Client 2 reads the key from A before A steps down. Because A still believes it's the leader, it will return stale data.
The original paper says:
Second, a leader must check whether it has been deposed
before processing a read-only request (its information
may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heartbeat
messages with a majority of the cluster before responding
to read-only requests.
Isn't it too expensive? The leader has to talk to majority nodes for every read request?
I'm surprised there's so much ambiguity in the answers, as this is quite well known:
Yes, to get linearizable reads from Raft you must round-trip through the quorum.
There are no shortcuts here. In fact, both etcd and Consul committed an error in their implementations of Raft and caused linearizability violations. The implementors erroneously believed (as did many people, including myself) that if a node thought of itself as a leader, it was the leader.
Raft does not guarantee this at all. A node can be a leader and not learn of its loss of leadership because of the very network partition that caused someone else to step up in the first place. Because clock error is taken as unbounded in distributed systems literature, no amount of waiting can solve this race condition. New leaders cannot simply "wait it out" and then decide "okay, the old leader must have realized it by now". This is just typical lease lock stuff - you can't use clocks with unbounded error to make distributed decisions.
Jepsen covered this error detail, but to quote the conclusion:
[There are] three types of reads, for varying performance/correctness needs:
Anything-goes reads, where any node can respond with its last known value. Totally available, in the CAP sense, but no guarantees of monotonicity. Etcd does this by default, and Consul terms this “stale”.
Mostly-consistent reads, where only leaders can respond, and stale reads are occasionally allowed. This is what etcd currently terms “consistent”, and what Consul does by default.
Consistent reads, which require a round-trip delay so the leader can confirm it is still authoritative before responding. Consul now terms this consistent.
Just to tie in with some other results from literature, this very problem was one of the things Flexible Paxos showed it could handle. The key realization in FPaxos is that you have two quorums: one for leader election and one for replication. The only requirement is that these quorums intersect, and while a majority quorum is guaranteed to do so, it is not the only configuration.
For example, one could require that every node participate in leader election. The winner of this election could be the sole node serving requests - now it is safe for this node to serve reads locally, because it knows for a new leader to step up the leadership quorum would need to include itself. (Of course, the tradeoff is that if this node went down, you could not elect a new leader!)
The point of FPaxos is that this is an engineering tradeoff you get to make.
The leader doesn't have to talk to a majority for each read request. Instead, as it continuously heartbeats with its peers it maintains a staleness measure: how long it has been since it has received an okay from a quorum? The leader will check this staleness measure and return a StalenessExceeded error. This gives the calling system the chance to connect to another host.
It may be better to push that staleness check to the calling systems; let the low-level raft system have higher Availability (in CAP terms) and let the calling systems decide at what staleness level to fail over. This can be done in various ways. You could have the calling systems heartbeat to the raft system, but my favorite is to return the staleness measure in the response. This last can be improved when the client includes its timestamp in the request, the raft server echos it back in the response and the client adds the round trip time to raft staleness. (NB. Always use the nano clock in measuring time differences because it doesn't go backwards like the system clock does.)
Not sure whether timeout configure can solve this problem:
2 x heartbeat interval <= election timeout
which means when network partition happens leader A is single leader and write will fail because leader A locates in the minority and leader A can not get echo back from majority of the node and step back as a follower.
After that leader B is selected, it can catch up the latest change from at least one of the followers and then client can perform read and write on leader B.
Question
The leader has to talk to majority nodes for every read request
Answer: No.
Explaination
Let's understand it with code example from HashiCorp's raft implementation.
There are 2 timeouts involved: (their names are self explanatory but link has been included to read detailed definition.)
LeaderLease timeout[1]
Election timeout[2]
Example of their values are 500ms & 1000ms respectively[3]
Must condition for node to start is: LeaderLease timeout < Election timeout [4,5]
Once a node becomes Leader, it is checked "whether it is heartbeating with quorum of followers or not"[6, 7]. If heartbeat stops then its tolerated till LeaderLease timeout[8]. If Leader is not able to contact quorum of nodes for LeaderLease timeout then Leader node has to become Follower[9]
Hence for example given in question, Node-A must step down as Leader before Node-B becomes Leader. Since Node-A knows its not a Leader before Node-B becomes Leader, Node-A will not serve the read or write request.
[1]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L141
[2]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L179
[3]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L230
[4]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L272
[5]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L275
[6]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L456
[7]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L762
[8]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L891
[9]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L894

Raft Algorithm Normal Operations

I have read the Raft algorithm paper's and got a question related to the sequence of operations Raft executes upon receiving a client request:
In order to overcome a single point of failure scenario, Raft relies on maintaining a replicated log on other machines, the algorithm also consults a consensus module for the complete logging management. The sequence of operations work as follow:
Client request is received at the leader's state machine, leader appends command to its log.
The leader sends AppendEntries RPCs to his followers to clone the command in their local logs', and waits for an acknowledgment from majority of the followers that the entry has been successfully appended to their local log file.
Once an acknowledgment has been received that the request has been successfully logged in majority of the followers logs', then the request is committed to the leader's state machine causing a transition to happen, returning back the output of that transition to the client.
Ultimately, the leader notifies followers of committed entries in subsequent AppendEntries RPCs.
If above understanding is correct, then I can claim that the client request is being held for a bit of time for the replication process to complete, also I may also claim that the success of a client request is heavily dependent on the success of the replication process (since the client command / request is not executed on the leader's machine until a majority acknowledgment has been received). The question is, how long it is expected to take on average for a client request to receive a response after the replication procedure is completed, also does that work efficiently for real-time systems?
http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed suggests that systems such as Raft requesting the Consistency and Availability parts of the CAP theorem's trinity will suffer performance limits. You may also be interested in https://pdfs.semanticscholar.org/7c45/54d064128043897ea2226021f6fda4c64251.pdf (A review of experiences with reliable multicast, by Birman), which describes experience with reliable multicast groups in high assurance systems such as air traffic control.
My takeaway from this is that a real system may want to be very careful about what information it guards with Raft, Paxos, and friends, and what it can guard less tightly. The other point of view is to go for a very sophisticated implementation of Paxos, such as Google Spanner, so that programmers don't have to worry about the problems of non-ACID systems.
If above understanding is correct, then I can claim that the client request is being held for a bit of time for the replication process to complete
Correct, the leader of the current term will acknowledge a client request only after the command has been replicated to majority of nodes in the cluster.
I may also claim that the success of a client request is heavily dependent on the success of the replication process
That's also correct. At least of majority of nodes in the cluster (including the leader) need to be available and responsive, in order for the command to be replicated successfully and the leader to acknowledge the request.
how long it is expected to take on average for a client request to receive a response after the replication procedure is completed
That depends on the topology of your network. The latency of the response to a client request will be composed of the following parts (assuming no leader crashes):
* the latency required for the client request to be transmitted between the client and the leader.
* the latency of an AppendEntries request from the leader to followers to replicate the entry (sent in parallel to all the followers.
* the latency of an AppendEntries response from the followers to the leader.
* the time required by the leader to apply the command to its state machine (i.e. a disk write in the best case)
* the latency of the client response to be transmitted from the leader to the client
The latency of the various messages depends on the distance between nodes, but it will probably be in the order of tenths to hundreds of milliseconds.
also does that work efficiently for real-time systems?
It depends on what are your requirements for your specific case. But in general, real-time systems require latencies that are under a few milliseconds, so the answer is most likely no. Also, keep in mind that during periods of crashes and instability where new leader elections happen, latency can increase significantly.

Real world example of Paxos

Can someone give me a real-world example of how Paxos algorithm is used in a distributed database? I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.
A simple example could be a banking application where an account is being modified through multiple sessions (i.e. a deposit at a teller, a debit operation etc..). Is Paxos used to decide which operation happens first? Also, what does one mean by multiple instances of Paxos protocol? How is when is this used? Basically, I am trying to understand all this through a concrete example rather than abstract terms.
For example, we have MapReduce system where master consists of 3 hosts. One is master and others are slaves. The procedure of choosing master uses Paxos algorithm.
Also Chubby of Google Big Table uses Paxos: The Chubby Lock Service for Loosely-Coupled Distributed Systems, Bigtable: A Distributed Storage System for Structured Data
The Clustrix database is a distributed database that uses Paxos in the transaction manager. Paxos is used by the database internals to coordinate messages and maintain transaction atomicity in a distributed system.
The Coordinator is the node the transaction originated on
Participants are the nodes that modified the database on behalf of
the transaction Readers are nodes that executed code on behalf of the
transaction but did not modify any state
Acceptors are the nodes that log the state of the transaction.
The following steps are taken when performing a transaction commit:
Coordinator sends a PREPARE message to each Participant.
The Participants lock transaction state. They send PREPARED messages back to the Coordinator.
Coordinator sends ACCEPT messages to Acceptors.
The Acceptors log the membership id, transaction, commit id, and participants. They send ACCEPTED messages back to the Coordinator.
Coordinator tells the user the commit succeeded.
Coordinator sends COMMIT messages to each Participant and Reader.
The Participants and Readers commit the transaction and update transaction state accordingly. They send COMMITTED messages back to the Coordinator.
Coordinator removes internal state and is now done.
This is all transparent to the application and is implemented in the database internals. So for your banking application, all the application level would need to do is perform exception handling for deadlock conflicts. The other key to implementing a database at scale is concurrency, which is generally helped via MVCC (Multi-Version concurrency control).
Can someone give me a real-world example of how Paxos algorithm is
used in a distributed database?
MySQL uses Paxos. This is why a highly available MySQL setup needs three servers. In contrast, a typical Postgres setup is a master-slave two-node configuration which isn't running Paxos.
I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.
Here is a fairly detailed explanation of Paxos for transaction log replication. And here is the source code that implements it in Scala. Paxos (aka multi-Paxos) is optimally efficient in terms of messages as in a three node cluster, in steady state, the leader accepts it's own next value, transmits to both of the other two nodes, and knows the value is fixed when it gets back one response. It can then put the commit message (the learning message) into the front of the next value that it sends.
A simple example could be a banking application where an account is
being modified through multiple sessions (i.e. a deposit at a teller,
a debit operation etc..). Is Paxos used to decide which operation
happens first?
Yes if you use a MySQL database cluster to hold the bank accounts then Paxos is being used to ensure that the replicas agree with the master as to the order that transactions were applied to the customer bank accounts. If all the nodes agree on the order that transactions were applied they will all hold the same balances.
Operations on a bank account cannot be reordered without coming up with different balances that may violate the business rules of not exceeding your credit. The trivial way to ensure the order is to just use one server process that decides the official order simply based on the order of the messages that it receives. It can then track the balances of each bank account and enforce the business rules. Yet you don't want just a single server as it may crash. You want replica servers that are also receiving the credit and debit commands and agree with the master.
The challenge with having replicas that should hold the same balances are that messages may be lost and resent and messages are buffered by switches that may deliver some messages late. The net effect is that if the network is unstable it is hard to prove that fast replication protocols will never cause different servers to see that the messages arrived in different orders. You will end up with different servers in the same cluster holding different balances.
You don't have to use Paxos to solve the bank accounts problem. You can just do simple master-slave replication. You have one master, one or more slaves, and the master waits until it has got acknowledgements from the slaves before telling any client the outcome of a command. The challenge there is lost and reordered messages. Before Paxos was invented database vendors just created expensive hardware designed to have very high redundancy and reliability to run master-slave. What was revolutionary about Paxos is that it does work with commodity networking and without specialist hardware.
Since banking applications were profitable with expensive custom hardware it is likely that many real-world banking systems are still running that way. In such scenarios, the database vendor supplies the specialist hardware with built-in reliable networking that the database software runs on. That is very expensive and not something that smaller companies want. Cost-conscious companies can set up a MySQL cluster on VMs in any public cloud with normal networking and Paxos will make it reliable rather than using specialist hardware.
Also, what does one mean by multiple instances of Paxos protocol? How
is when is this used?
I wrote a blog about multi-Paxos being the original Paxos protocol. Simply put, in the case of choosing the order of transactions in a cluster, you want to stream the transactions as a stream of values. Each value is fixed in a separate logical instance of the protocol. As described in my blog about Paxos for cluster replication the algorithm is very efficient in steady-state needing only one round trip between the master and enough nodes to have a majority which is one other node in a three node cluster. When there are crashes or network issues the algorithm is always safe but needs more messages. So to answer your question typical applications need multiple rounds of Paxos to establish the order of client commands in the cluster.
I should note that Raft was specifically invented as a detailed description of how to perform cluster replication. The original Paxos papers require you to figure out many of the details to do cluster replication. So we can expect that people who are specifically trying to implement cluster replication would use Raft as it leaves nothing for the implementor to have to figure out for themselves.
So when might you use Paxos? It can be used to change the cluster membership of a cluster that is writing values based on a different protocol that can only be correct when you know the exact cluster membership. Corfu is a great example of that where it removes the bottleneck of writing via a single master by having clients write to shards of servers concurrently. Yet it can only do that accurately when all clients have an accurate view of the current cluster membership and shard layout. When nodes crash or you need to expand the cluster you propose a new cluster membership and shard layout and run it through Paxos to get consensus across the cluster.

Resources