Simple leader election (Stateless leader election) - go

I am building an app in golang that I would like to be fault-tolerant. I looked at different algorithms like RAFT and Paxos and their implementations in golang (etcd's raft, hashicorp's raft), but I feel like they might be an overkill for my specific use case.
In my application, the nodes just wait in standby and act as failovers in case the leader fails. I do not need to replicate any states throughout the cluster. All I need is the following properties:
If a node is a leader:
Run a given code
If a node is not a leader:
Wait for a leader to fail
Reelect the leader once the existing leader fails
Any suggestions?

Since you want a leader election protocol it sounds like you want to avoid having more than one node acting as the leader at once. The answer really depends on how strictly you require this property. In some cases it is acceptable to occasionally have more than one node acting as the leader; perhaps the worst that happens is a bit of duplicated work. In other cases the whole system may operate incorrectly if there's ever any duplicate leaders, so you must be much more careful.
If you can accept occasional cases of duplicate leaders then a simpler protocol may be for you. However, if you absolutely cannot tolerate having more than one leader at once then you will have to combine your leader election protocol with some kind of replication of state, and a proven implementation of Paxos or Raft or similar is a very good way to do this. There's lots of subtly different protocols for this but they're all basically doing the same thing.
The fundamental problem here is pinning down what "at once" means in a realistic network in which messages may sometimes be delivered after a very long delay. Typically one assumes that the network is completely asynchronous with no time bounds on delivery, and indeed Paxos, Raft etc. are all designed to work correctly under that assumption. These algorithms work around this by defining their own internal notion of time (ballots in Paxos, terms in Raft) and attaching this "internal time" to all state transitions under their control. This gives some very strong guarantees and, in particular, ensures that no two nodes may take actions as leader at the same "internal time".
If you don't replicate any state via something like Paxos or Raft then you won't be able to make use of this strong notion of internal time.

You can use the client go Kubernetes library if you will be deploying it in a Kubernetes cluster for your specific use case.
https://github.com/kubernetes-client/go

Related

which consensus algorithm is synchronous in nature?

There are different consensus algorithm, which are used in permission-oriented blockchain, such as
PAXOS
RAFT
Byzantine General Model
Which of the consensus algorithms are synchronous and asynchronous and why ? Please explain in detail.
Thanks
*I am not an expert on distributed systems still i will try to answer your question.
In distributed systems, People use an underlying model that assumes some properties about time (“how long will it take for this message to arrive?”) and some properties about the types of faults (“how can nodes in the protocol do the wrong thing?”).
There are three main types of timing models usually used for distributed systems the synchronous model, the asynchronous model and the partially synchronous model. Each of these models makes some guarantees about the length of time (“latency”) that can occur between the exchange of messages amongst nodes in a given round of the protocol execution. This categorization is important because in the distributed setting a single node cannot distinguish between a peer node that has failed and a peer node that is just taking a long time to respond.
In the synchronous model, there is some maximum value (“upper bound”) T on the time between when a node sends a message and when you can be certain that the receiving node hears the message. You also have an upper bound P on the relative difference in speed between nodes (so you can account for machines with slow processors).
In the asynchronous model, we remove both upper bounds T and P. Messages can take arbitrarily long to reach peers and each node can take an arbitrary amount of time to respond. When we say arbitrary, we include “infinity” meaning that it takes forever for some event to occur.
The partially synchronous model in a mix of the two: upper bounds exist for T and P but the protocol designer does not know them and the task is designing mechanisms that still come to consensus in light of this fact. In practice, protocol implementers can achieve systems resembling this model given the realistic characteristics of modern networks/machines (messages usually get where they are going) and use of tactics like timeouts to indicate when a node should retry sending a message.
Keeping in mind the above facts, Both Paxos and Raft belongs to the partial synchronous models.
The Byzantine Generals’ Problem is a classic problem faced by any distributed computer system network. Aim is to maintain same state on all participant nodes in presence of malicious nodes.
In distributed systems, there a collection of hard problems that you constantly need to deal with.
Things fail. You can never count on anything being reliable. Even if you have
perfectly bug-free software, and hardware that never breaks, you’ve still got
to deal with the fact that network connections can break, or messages within a
network can get lost, or that some bozo might sever your network connection
with a bulldozer. (That really happened while I was at Google!)
Given (1), you can never rely on one copy of anything, because that copy might
become unavailable due to a failure. So you need to keep multiple copies, and
those copies need to be consistent – meaning that at any time, all of the
copies agree about their contents.
There’s no way to maintain a single completely consistent view of time between
multiple computers. Due to inconsistencies in individual machine performance,
and variable network delays, variable storage latency, and several other
factors, there’s no canonical way of saying that for two events X and Y, “X
happened before Y”. What that means is that when you try to maintain a consistent set of data, you can’t just say “Run all of the events in order”, because while one server maintaining one copy might “know” that X happened before Y, another server maintaining another copy might be just as certain that Y happened before X.
In short, everything can fail at any time; after failure, participants can recover and rejoin the system; any no part of the system acts in an actively adversarial way(byzantine failures may be because of malware).
To solve this problem we have consensus algorithm with the aim to make all participants to agree on the same state.
Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers is available.
Paxos and Raft are consensus algorithms which solves byzantine general problem in distributed networks public or private.

Consul support or alternative for 2 nodes

I want to use consul for a 2-node cluster. Drawback is there's no failure tolerance for two nodes :
https://www.consul.io/docs/internals/consensus.html
Is there a way in Consul to make a consistent leader election with only two nodes? Can Consul Raft Consensus algorithm be changed?
Thanks a lot.
It sounds like you're limited to 2 machines of this type, because they are expensive. Consider acquiring three or five cheaper machines to run your orchestration layer.
To answer protocol question, no, there is no way to run a two-node cluster with failure tolerance in Raft. To be clear, you can safely run a two-node cluster just fine - it will be available and make progress like any other cluster. It's just when one machine goes down, because your fault tolerance is zero you will lose availability and no longer make no progress. But safety is never compromised - your data is still persisted consistently on these machines.
Even outside Raft, there is no way to run a two-node cluster and guarantee progress upon a single failure. This is a fundamental limit. In general, if you want to support f failures (meaning remain safe and available), you need 2f + 1 nodes.
There are non-Raft ways to improve the situation. For example, Flexible Paxos shows that we can require both nodes for leader election (as it already is in Raft), but only require a single node for replication. This would allow your cluster to continue working in some failure cases where Raft would have stopped. But the worst case is still the same: there are always failures that will cause any two-node cluster to become unavailable.
That said, I'm not aware of any practical flexible paxos implementations anyway.
Considering the expense of even trying to hack up a solution to this, your best bet is to either get a larger set of cheaper machines, or just run your two-node cluster and accept unavailability upon failure.
Talking about changing the protocol, there is impossibility proof by FLP which states that consensus cannot be reached if systems are less than 2f + 1 for f failures (fail-stop). Although, safety is provided but progress (liveness) cannot be ensured.
I think, the options suggested in earlier post are the best.
The choice of leader election on top of the Consul’s documentation itself requires 3 nodes. This relies on the health-checks mechanism, as well as the sessions. Sessions are essentially distributed locks automatically released by TTL or when the service crashes.
To build 2-node Consul cluster we have to use another approach, supposedly called Leader Lease. Since we already have Consul KV-storage with CAS support, we can simply write to it which machine is the leader before the expiration of such and such time. As long as the leader is alive and well, it can periodically extend it's time. If the leader dies, someone will replace it quickly. For this approach to work, it is enough to synchronize the time on the machines using ntpd and when the leader performs any action, verify that it has enough time left to complete this action.
A key is created in the KV-storage, containing something like “node X is the leader before time Y”, where Y is calculated as the current time + some time interval(T). As a leader, node X updates the record once every T/2 or T/3 units of time, thereby extending it's leadership role. If a node falls or cannot reach the KV-storage, after the interval(T) its place will be taken by the node, which will be the first to discover that the leadership role has been released.
CAS is needed to prevent a race condition if the two nodes simultaneously try to become a leader. CAS Specifies to use a Check-And-Set operation. This is very useful as a building block for more complex synchronization primitives. If the index is 0, Consul will only put the key if it does not already exist. If the index is non-zero, the key is only set if the index matches the ModifyIndex of that key.

How do Raft guarantee consistency when network partition occurs?

Suppose a network partition occurs and the leader A is in minority. Raft will elect a new leader B but A thinks it's still the leader for some time. And we have two clients. Client 1 writes a key/value pair to B, then Client 2 reads the key from A before A steps down. Because A still believes it's the leader, it will return stale data.
The original paper says:
Second, a leader must check whether it has been deposed
before processing a read-only request (its information
may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heartbeat
messages with a majority of the cluster before responding
to read-only requests.
Isn't it too expensive? The leader has to talk to majority nodes for every read request?
I'm surprised there's so much ambiguity in the answers, as this is quite well known:
Yes, to get linearizable reads from Raft you must round-trip through the quorum.
There are no shortcuts here. In fact, both etcd and Consul committed an error in their implementations of Raft and caused linearizability violations. The implementors erroneously believed (as did many people, including myself) that if a node thought of itself as a leader, it was the leader.
Raft does not guarantee this at all. A node can be a leader and not learn of its loss of leadership because of the very network partition that caused someone else to step up in the first place. Because clock error is taken as unbounded in distributed systems literature, no amount of waiting can solve this race condition. New leaders cannot simply "wait it out" and then decide "okay, the old leader must have realized it by now". This is just typical lease lock stuff - you can't use clocks with unbounded error to make distributed decisions.
Jepsen covered this error detail, but to quote the conclusion:
[There are] three types of reads, for varying performance/correctness needs:
Anything-goes reads, where any node can respond with its last known value. Totally available, in the CAP sense, but no guarantees of monotonicity. Etcd does this by default, and Consul terms this “stale”.
Mostly-consistent reads, where only leaders can respond, and stale reads are occasionally allowed. This is what etcd currently terms “consistent”, and what Consul does by default.
Consistent reads, which require a round-trip delay so the leader can confirm it is still authoritative before responding. Consul now terms this consistent.
Just to tie in with some other results from literature, this very problem was one of the things Flexible Paxos showed it could handle. The key realization in FPaxos is that you have two quorums: one for leader election and one for replication. The only requirement is that these quorums intersect, and while a majority quorum is guaranteed to do so, it is not the only configuration.
For example, one could require that every node participate in leader election. The winner of this election could be the sole node serving requests - now it is safe for this node to serve reads locally, because it knows for a new leader to step up the leadership quorum would need to include itself. (Of course, the tradeoff is that if this node went down, you could not elect a new leader!)
The point of FPaxos is that this is an engineering tradeoff you get to make.
The leader doesn't have to talk to a majority for each read request. Instead, as it continuously heartbeats with its peers it maintains a staleness measure: how long it has been since it has received an okay from a quorum? The leader will check this staleness measure and return a StalenessExceeded error. This gives the calling system the chance to connect to another host.
It may be better to push that staleness check to the calling systems; let the low-level raft system have higher Availability (in CAP terms) and let the calling systems decide at what staleness level to fail over. This can be done in various ways. You could have the calling systems heartbeat to the raft system, but my favorite is to return the staleness measure in the response. This last can be improved when the client includes its timestamp in the request, the raft server echos it back in the response and the client adds the round trip time to raft staleness. (NB. Always use the nano clock in measuring time differences because it doesn't go backwards like the system clock does.)
Not sure whether timeout configure can solve this problem:
2 x heartbeat interval <= election timeout
which means when network partition happens leader A is single leader and write will fail because leader A locates in the minority and leader A can not get echo back from majority of the node and step back as a follower.
After that leader B is selected, it can catch up the latest change from at least one of the followers and then client can perform read and write on leader B.
Question
The leader has to talk to majority nodes for every read request
Answer: No.
Explaination
Let's understand it with code example from HashiCorp's raft implementation.
There are 2 timeouts involved: (their names are self explanatory but link has been included to read detailed definition.)
LeaderLease timeout[1]
Election timeout[2]
Example of their values are 500ms & 1000ms respectively[3]
Must condition for node to start is: LeaderLease timeout < Election timeout [4,5]
Once a node becomes Leader, it is checked "whether it is heartbeating with quorum of followers or not"[6, 7]. If heartbeat stops then its tolerated till LeaderLease timeout[8]. If Leader is not able to contact quorum of nodes for LeaderLease timeout then Leader node has to become Follower[9]
Hence for example given in question, Node-A must step down as Leader before Node-B becomes Leader. Since Node-A knows its not a Leader before Node-B becomes Leader, Node-A will not serve the read or write request.
[1]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L141
[2]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L179
[3]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L230
[4]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L272
[5]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L275
[6]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L456
[7]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L762
[8]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L891
[9]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L894

Whats the difference between Paxos and W+R>=N in Cassandra?

Dynamo-like databases (e.g. Cassandra) can enforce consistency by means of quorum, i.e. a number of synchronously written replicas (W) and a number of replicas to read (R) should be chosen in such a way that W+R>N where N is a replication factor. On the other hand, PAXOS-based systems like Zookeeper are also used as a consistent fault-tolerant storage.
What is the difference between these two approaches? Does PAXOS provide guarantees that are not provided by W+R>N schema?
Yes, Paxos provides guarantees that are not provided by the Dynamo-like systems and their read-write quorums. The difference is how failures are handled and what happens during a write. After a successful write, both kind of systems behave similarly. The data will be saved and available for reading afterwards (until overwritten or deleted) and so on.
The difference appears during a write and after failures. Until you get a successful answer from W nodes when writing something to the eventually consistent systems, then the data may have been written to some nodes and not to others and there is no guarantee that the whole system agrees on the current value. If you try to read the data back at this point, some clients may get the new data back and some the old data back. In other words, the system is not immediately consistent. This is because writes aren't atomic across nodes in these systems. There are usually mechanisms to "heal" an inconsistency like this and "eventually" the system will become consistent again (i.e. reads will once again always return the same value, until something new is written). This is the reason why they are often called "eventually consistent". Inconsistencies can (and will) appear, but they will always be dealt with and reconciled eventually.
With Paxos, writes can be made atomic across nodes and inconsistencies between nodes are therefore possible to avoid. The Paxos algorithm makes it possible to guarantee that non-faulty nodes never disagree on the outcome of a write, at any point in time. Either the write succeeded everywhere or nowhere. There will never be any inconsistent reads at any point (if it's correctly implemented and if all the assumptions hold, of course). This comes at a cost, however. Mainly, the system may need to delay some requests and be unavailable when for example too many nodes (or the communication between them) aren't working. This is necessary to assure that no inconsistent replies are given.
To summarize: the main difference is that the Dynamo-like systems can return inconsistent results during writes or after failures for some time (but will eventually recover from it), whereas Paxos based systems can guarantee that there are never any such inconsistencies by sometimes being unavailable and delaying requests instead.
Paxos is non-trivial to implement, and expensive enough that many systems using it use hints as well, or use it only for leader election, or something. However, it does provide guaranteed consistency in the presence of failures - subject of course to the limits of its particular failure model.
The first quorum based systems I saw assumed some sort of leader or transaction infrastructure that would ensure enough consistency that you could trust that the quorum mechanism worked. This infrastructure might well be Paxos-based.
Looking at descriptions such as https://cloudant.com/blog/dynamo-and-couchdb-clusters/, it would appear that Dynamo is not based on an infrastructure that guarantees consistency for its quorum system - so is it being very clever or cutting corners? According to http://muratbuffalo.blogspot.co.uk/2010/11/dynamo-amazons-highly-available-key.html, "The Dynamo system emphasizes availability to the extent of sacrificing consistency. The abstract reads "Dynamo sacrifices consistency under certain failure scenarios". Actually, later it becomes clear that Dynamo sacrifices consistency even in the absence of failures: Dynamo may become inconsistent in the presence of multiple concurrent write requests since the replicas may diverge due to multiple coordinators." (end quote)
So, it would appear that in the case of quorums as implemented in Dynamo, Paxos provides stronger reliability guarantees.
Paxos and the W+R>N quorum try to solve slightly different problems. Paxos is usually described as a way to replicate a state machine, but in fact it is more of a distributed log: each item written to the log gets an index, and the different servers eventually hold the same log items + their index. (Replicated state machine can be achieved by writing to the log the inputs to the state machine and each server replays the state machine on the agreed inputs according to their index). You can read more about Paxos in a blog post I wrote here.
The W+R>N quorum solves the problem of sharing a single value among multiple servers. In the academia it is called "shared register". A shared register has two operations: read and write, where we expect the read to return the value of the previous write.
So, Paxos and the W+R>N quorum live in different domains, and have different properties (e.g., Paxos saves an ordered list of items). However, Paxos can be used to implement a shared register, and a W+R>N quorum can be used to implement a distributed log (although, very inefficiently).
Saying all the above, sometimes the W+R>N quorums aren't implemented in their "fully robust" way, as it will require more than one communication round. Thus, in systems that want low latency, it is possible that their implementation of W+R>N quorums provide weaker properties (e.g., conflicting values can co exist).
To sum up, theoretically, Paxos and the W+R>N can achieve the same goals. Practically, it would be very inefficient, and each one is better for something slightly different. Even more practically, W+R>N isn't always implemented fully, thus scarifying some consistency properties for speed.
Update: Paxos supports a very general failure model: messages can be dropped, nodes can crash and restart. The W+R>N quorum scheme has dfferent implementations, many of which assume less general failures. So, the difference between the two also depends on the assumption on the possible failures that are supported.
There is no difference. The definition of a quorum says that any two quorums' intersection is not empty. Simple majority quorum is an example NOT a definition. Take a look at Dr. Lamport's later paper "Vertical Paxos", where he gave some other possible configuration of quorums.
Multi-decree paxos protocol (AKA Multi-Paxos), in steady state it's just two phase commit. Ballot number changes are only needed when the leader fails.
Zookeeper's replication protocol (ZAB) , and RAFT are all based on Paxos. The differences are in fault-detection and transition after a leader fails.
As mentioned in other answers, in an R+W > N system, the writes are not atomic on all nodes which means that when a write is in progress (or during a write failure) some nodes will have newer values and some older ones. Take an example of a system where n=3, r=2, and w=2. For clarity let's assume the 3 nodes are named A, B, and C. Consider this scenario: a write is in progress; node A has been updated while B and C are still in process of receiving the updated value. Clients reading from A and B will see the newer value (resolved using version vectors or last write wins) and clients reading from B and C will see old values. This type of read is not considered linearizable. Such issues will not occur with proper linearizable systems such as Paxos or Raft.

When to use Paxos (real practical use cases)?

Could someone give me a list of real use cases of Paxos. That is real problems that require consensus as part of a bigger problem.
Is the following a use case of Paxos?
Suppose there are two clients playing poker against each other on a poker server. The poker server is replicated. My understanding of Paxos is that it could be used to maintain consistency of the inmemory data structures that represent the current hand of poker. That is, ensure that all replicas have the exact same inmemory state of the hand.
But why is Paxos necessary? Suppose a new card needs to be dealt. Each replica running the same code will generate the same card if everything went correct. Why can't the clients just request the latest state from all the replicated servers and choose the card that appears the most. So if one server had an error the client will still get the correct state from just choosing the majority.
You assume all the servers are in sync with each other (i.e., have the same state), so when a server needs to select the next card, each of the servers will select the exact same card (assuming your code is deterministic).
However, your servers' state also depends on the the user's actions. For example, if a user decided to raise by 50$ - your server needs to store that info somewhere. Now, suppose that your server replied "ok" to the web-client (I'm assuming a web-based poker game), and then the server crashed. Your other servers might not have the information regarding the 50$ raise, and your system will be inconsistent (in the sense that the client thinks that the 50$ raise was made, while the surviving servers are oblivious of it).
Notice that majority won't help here, since the data is lost. Moreover, suppose that instead of the main server crashing, the main server plus another one got the 50$ raise data. In this case, using majority could even be worse: if you get a response from the two servers with the data, you'll think the 50$ raise was performed. But if one of them fails, then you won't have majority, and you'll think that the raise wasn't performed.
In general, Paxos can be used to replicate a state machine, in a fault tolerant manner. Where "state machine" can be thought of as an algorithm that has some initial state, and it advances the state deterministically according to messages received from the outside (i.e., the web-client).
More properly, Paxos should be considered as a distributed log, you can read more about it here: Understanding Paxos – Part 1
Update 2018:
Mysql High Availability uses paxos: https://mysqlhighavailability.com/the-king-is-dead-long-live-the-king-our-homegrown-paxos-based-consensus/
Real world example:
Cassandra uses Paxos to ensure that clients connected to different cluster nodes can safely perform write operations by adding "IF NOT EXISTS" to write operations. Cassandra has no master node so two conflicting operations can to be issued concurrently at multiple nodes. When using the if-not-exists syntax the paxos algorithm is used order operations between machines to ensure only one succeeds. This can then be used by clients to store authoritative data with an expiration lease. As long as a majority of Cassandra nodes are up it will work. So if you define the replication factor of your keyspace to be 3 then 1 node can fail, of 5 then 2 can fail, etc.
For normal writes Caassandra allows multiple conflicting writes to be accepted by different nodes which may be temporary unable to communicate. In that case doesn't use Paxos so can loose data when two Writes occur at the same time for the same key. There are special data structures built into Cassandra that won't loose data which are insert-only.
Poker and Paxos:
As other answers note poker is turn based and has rules. If you allow one master and many replicas then the master arbitrates the next action. Let's say a user first clicks the "check" button then changes their mind and clicks "fold". Those are conflicting commands only the first should be accepted. The browser should not let them press the second button it will disable it when they pressed the first button. Since money is involved the master server should also enforce the rules and only allow one action per player per turn. The problem comes when the master crashes during the game. Which replica can become master and how do you enforce that only one replica becomes master?
One way to handle choosing a new master is to use an external strong consistently service. We can use Cassandra to create a lease for the master node. The replicas can timeout on the master and attempt to take the lease. As Cassandra is using Paxos it is fault tolerant; you can still read or update the lease even if Cassandra nodes crash.
In the above example the poker master and replicas are eventually consistent. The master can send heartbeats so the replicas know that they are still connected to the master. That is fast as messages flow in one direction. When the master crashes there may be race conditions in replicas trying to be the master. Using Paxos at that point gives you strong consistently on the outcome of which node is now the master. This requires additional messages between nodes to ensure a consensus outcome of a single master.
Real life use cases:
The Chubby lock service for loosely-coupled distributed systems
Apache ZooKeeper
Paxos is used for WAN-based replication of Subversion repositories and high availability of the Hadoop NameNode by the company I work for (WANdisco plc.)
In the case you describe, you're right, Paxos isn't really necessary: A single central authority can generate a permutation for the deck and distribute it to everyone at the beginning of the hand. In fact, for a poker game in general, where there's a strict turn order and a single active player as in poker, I can't see a sensible situation in which you might need to use Paxos, except perhaps to elect the central authority that shuffles decks.
A better example might be a game with simultaneous moves, such as Jeopardy. Paxos in this situation would allow all the servers to decide together what sequence a series of closely timed events (such as buzzer presses) occurred in, such that all the servers come to the same conclusion.

Resources