voting algorithm in distributed systems

voting algorithm in distributed systems - algorithm

assume distributed systems network. Each system measures a value. There is a correct decision to be made in consensus by all systems depending on all values. communication links may drop. Is there a voting and synchronization algorithm for this case?

Examples of voting algorithm in distributed systems:
Bully algorithm (http://en.wikipedia.org/wiki/Bully_algorithm)
Chang and Roberts algorithm (http://en.wikipedia.org/wiki/Chang_and_Roberts_algorithm)

I have solved a similar problem. It is a failure detection scheme, so I'll describe it in those terms instead of the generic terms of the OP.
Clients ping our servers periodically, and after some time of no pings the client is considered dead or behind a network partition. (They are the same to us.) Because the clients can pick an arbitrary server to connect to, different servers have different views on if the client is dead or alive.
Our servers use a gossip/epidemic protocol to exchange their view of the clients with each other. This is where the logic comes that one server's data is better than another's. The nice thing about an epidemic protocol is that it is light on the network, yet will still converge.
When a decision is made (in our case, declaring that a client is dead) any of the servers has a tolerably up-to-date table of all the client's heartbeats. Any server is free to make the decision, which we do via a consensus protocol amongst themselves (Paxos or Raft). Note that the server may be wrong with its decision—but it's unlikely that it doesn't have a somewhat up-to-date table but still run a successful Paxos round.

Related

Simple leader election (Stateless leader election)

I am building an app in golang that I would like to be fault-tolerant. I looked at different algorithms like RAFT and Paxos and their implementations in golang (etcd's raft, hashicorp's raft), but I feel like they might be an overkill for my specific use case.
In my application, the nodes just wait in standby and act as failovers in case the leader fails. I do not need to replicate any states throughout the cluster. All I need is the following properties:
If a node is a leader:
Run a given code
If a node is not a leader:
Wait for a leader to fail
Reelect the leader once the existing leader fails
Any suggestions?

Since you want a leader election protocol it sounds like you want to avoid having more than one node acting as the leader at once. The answer really depends on how strictly you require this property. In some cases it is acceptable to occasionally have more than one node acting as the leader; perhaps the worst that happens is a bit of duplicated work. In other cases the whole system may operate incorrectly if there's ever any duplicate leaders, so you must be much more careful.
If you can accept occasional cases of duplicate leaders then a simpler protocol may be for you. However, if you absolutely cannot tolerate having more than one leader at once then you will have to combine your leader election protocol with some kind of replication of state, and a proven implementation of Paxos or Raft or similar is a very good way to do this. There's lots of subtly different protocols for this but they're all basically doing the same thing.
The fundamental problem here is pinning down what "at once" means in a realistic network in which messages may sometimes be delivered after a very long delay. Typically one assumes that the network is completely asynchronous with no time bounds on delivery, and indeed Paxos, Raft etc. are all designed to work correctly under that assumption. These algorithms work around this by defining their own internal notion of time (ballots in Paxos, terms in Raft) and attaching this "internal time" to all state transitions under their control. This gives some very strong guarantees and, in particular, ensures that no two nodes may take actions as leader at the same "internal time".
If you don't replicate any state via something like Paxos or Raft then you won't be able to make use of this strong notion of internal time.

You can use the client go Kubernetes library if you will be deploying it in a Kubernetes cluster for your specific use case.
https://github.com/kubernetes-client/go

which consensus algorithm is synchronous in nature?

There are different consensus algorithm, which are used in permission-oriented blockchain, such as
PAXOS
RAFT
Byzantine General Model
Which of the consensus algorithms are synchronous and asynchronous and why ? Please explain in detail.
Thanks

*I am not an expert on distributed systems still i will try to answer your question.
In distributed systems, People use an underlying model that assumes some properties about time (“how long will it take for this message to arrive?”) and some properties about the types of faults (“how can nodes in the protocol do the wrong thing?”).
There are three main types of timing models usually used for distributed systems the synchronous model, the asynchronous model and the partially synchronous model. Each of these models makes some guarantees about the length of time (“latency”) that can occur between the exchange of messages amongst nodes in a given round of the protocol execution. This categorization is important because in the distributed setting a single node cannot distinguish between a peer node that has failed and a peer node that is just taking a long time to respond.
In the synchronous model, there is some maximum value (“upper bound”) T on the time between when a node sends a message and when you can be certain that the receiving node hears the message. You also have an upper bound P on the relative difference in speed between nodes (so you can account for machines with slow processors).
In the asynchronous model, we remove both upper bounds T and P. Messages can take arbitrarily long to reach peers and each node can take an arbitrary amount of time to respond. When we say arbitrary, we include “infinity” meaning that it takes forever for some event to occur.
The partially synchronous model in a mix of the two: upper bounds exist for T and P but the protocol designer does not know them and the task is designing mechanisms that still come to consensus in light of this fact. In practice, protocol implementers can achieve systems resembling this model given the realistic characteristics of modern networks/machines (messages usually get where they are going) and use of tactics like timeouts to indicate when a node should retry sending a message.
Keeping in mind the above facts, Both Paxos and Raft belongs to the partial synchronous models.
The Byzantine Generals’ Problem is a classic problem faced by any distributed computer system network. Aim is to maintain same state on all participant nodes in presence of malicious nodes.
In distributed systems, there a collection of hard problems that you constantly need to deal with.
Things fail. You can never count on anything being reliable. Even if you have
perfectly bug-free software, and hardware that never breaks, you’ve still got
to deal with the fact that network connections can break, or messages within a
network can get lost, or that some bozo might sever your network connection
with a bulldozer. (That really happened while I was at Google!)
Given (1), you can never rely on one copy of anything, because that copy might
become unavailable due to a failure. So you need to keep multiple copies, and
those copies need to be consistent – meaning that at any time, all of the
copies agree about their contents.
There’s no way to maintain a single completely consistent view of time between
multiple computers. Due to inconsistencies in individual machine performance,
and variable network delays, variable storage latency, and several other
factors, there’s no canonical way of saying that for two events X and Y, “X
happened before Y”. What that means is that when you try to maintain a consistent set of data, you can’t just say “Run all of the events in order”, because while one server maintaining one copy might “know” that X happened before Y, another server maintaining another copy might be just as certain that Y happened before X.
In short, everything can fail at any time; after failure, participants can recover and rejoin the system; any no part of the system acts in an actively adversarial way(byzantine failures may be because of malware).
To solve this problem we have consensus algorithm with the aim to make all participants to agree on the same state.
Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers is available.
Paxos and Raft are consensus algorithms which solves byzantine general problem in distributed networks public or private.

Real world example of Paxos

Can someone give me a real-world example of how Paxos algorithm is used in a distributed database? I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.
A simple example could be a banking application where an account is being modified through multiple sessions (i.e. a deposit at a teller, a debit operation etc..). Is Paxos used to decide which operation happens first? Also, what does one mean by multiple instances of Paxos protocol? How is when is this used? Basically, I am trying to understand all this through a concrete example rather than abstract terms.

For example, we have MapReduce system where master consists of 3 hosts. One is master and others are slaves. The procedure of choosing master uses Paxos algorithm.
Also Chubby of Google Big Table uses Paxos: The Chubby Lock Service for Loosely-Coupled Distributed Systems, Bigtable: A Distributed Storage System for Structured Data

The Clustrix database is a distributed database that uses Paxos in the transaction manager. Paxos is used by the database internals to coordinate messages and maintain transaction atomicity in a distributed system.
The Coordinator is the node the transaction originated on
Participants are the nodes that modified the database on behalf of
the transaction Readers are nodes that executed code on behalf of the
transaction but did not modify any state
Acceptors are the nodes that log the state of the transaction.
The following steps are taken when performing a transaction commit:
Coordinator sends a PREPARE message to each Participant.
The Participants lock transaction state. They send PREPARED messages back to the Coordinator.
Coordinator sends ACCEPT messages to Acceptors.
The Acceptors log the membership id, transaction, commit id, and participants. They send ACCEPTED messages back to the Coordinator.
Coordinator tells the user the commit succeeded.
Coordinator sends COMMIT messages to each Participant and Reader.
The Participants and Readers commit the transaction and update transaction state accordingly. They send COMMITTED messages back to the Coordinator.
Coordinator removes internal state and is now done.
This is all transparent to the application and is implemented in the database internals. So for your banking application, all the application level would need to do is perform exception handling for deadlock conflicts. The other key to implementing a database at scale is concurrency, which is generally helped via MVCC (Multi-Version concurrency control).

Can someone give me a real-world example of how Paxos algorithm is
used in a distributed database?
MySQL uses Paxos. This is why a highly available MySQL setup needs three servers. In contrast, a typical Postgres setup is a master-slave two-node configuration which isn't running Paxos.
I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.
Here is a fairly detailed explanation of Paxos for transaction log replication. And here is the source code that implements it in Scala. Paxos (aka multi-Paxos) is optimally efficient in terms of messages as in a three node cluster, in steady state, the leader accepts it's own next value, transmits to both of the other two nodes, and knows the value is fixed when it gets back one response. It can then put the commit message (the learning message) into the front of the next value that it sends.
A simple example could be a banking application where an account is
being modified through multiple sessions (i.e. a deposit at a teller,
a debit operation etc..). Is Paxos used to decide which operation
happens first?
Yes if you use a MySQL database cluster to hold the bank accounts then Paxos is being used to ensure that the replicas agree with the master as to the order that transactions were applied to the customer bank accounts. If all the nodes agree on the order that transactions were applied they will all hold the same balances.
Operations on a bank account cannot be reordered without coming up with different balances that may violate the business rules of not exceeding your credit. The trivial way to ensure the order is to just use one server process that decides the official order simply based on the order of the messages that it receives. It can then track the balances of each bank account and enforce the business rules. Yet you don't want just a single server as it may crash. You want replica servers that are also receiving the credit and debit commands and agree with the master.
The challenge with having replicas that should hold the same balances are that messages may be lost and resent and messages are buffered by switches that may deliver some messages late. The net effect is that if the network is unstable it is hard to prove that fast replication protocols will never cause different servers to see that the messages arrived in different orders. You will end up with different servers in the same cluster holding different balances.
You don't have to use Paxos to solve the bank accounts problem. You can just do simple master-slave replication. You have one master, one or more slaves, and the master waits until it has got acknowledgements from the slaves before telling any client the outcome of a command. The challenge there is lost and reordered messages. Before Paxos was invented database vendors just created expensive hardware designed to have very high redundancy and reliability to run master-slave. What was revolutionary about Paxos is that it does work with commodity networking and without specialist hardware.
Since banking applications were profitable with expensive custom hardware it is likely that many real-world banking systems are still running that way. In such scenarios, the database vendor supplies the specialist hardware with built-in reliable networking that the database software runs on. That is very expensive and not something that smaller companies want. Cost-conscious companies can set up a MySQL cluster on VMs in any public cloud with normal networking and Paxos will make it reliable rather than using specialist hardware.
Also, what does one mean by multiple instances of Paxos protocol? How
is when is this used?
I wrote a blog about multi-Paxos being the original Paxos protocol. Simply put, in the case of choosing the order of transactions in a cluster, you want to stream the transactions as a stream of values. Each value is fixed in a separate logical instance of the protocol. As described in my blog about Paxos for cluster replication the algorithm is very efficient in steady-state needing only one round trip between the master and enough nodes to have a majority which is one other node in a three node cluster. When there are crashes or network issues the algorithm is always safe but needs more messages. So to answer your question typical applications need multiple rounds of Paxos to establish the order of client commands in the cluster.
I should note that Raft was specifically invented as a detailed description of how to perform cluster replication. The original Paxos papers require you to figure out many of the details to do cluster replication. So we can expect that people who are specifically trying to implement cluster replication would use Raft as it leaves nothing for the implementor to have to figure out for themselves.
So when might you use Paxos? It can be used to change the cluster membership of a cluster that is writing values based on a different protocol that can only be correct when you know the exact cluster membership. Corfu is a great example of that where it removes the bottleneck of writing via a single master by having clients write to shards of servers concurrently. Yet it can only do that accurately when all clients have an accurate view of the current cluster membership and shard layout. When nodes crash or you need to expand the cluster you propose a new cluster membership and shard layout and run it through Paxos to get consensus across the cluster.

When to use Paxos (real practical use cases)?

Could someone give me a list of real use cases of Paxos. That is real problems that require consensus as part of a bigger problem.
Is the following a use case of Paxos?
Suppose there are two clients playing poker against each other on a poker server. The poker server is replicated. My understanding of Paxos is that it could be used to maintain consistency of the inmemory data structures that represent the current hand of poker. That is, ensure that all replicas have the exact same inmemory state of the hand.
But why is Paxos necessary? Suppose a new card needs to be dealt. Each replica running the same code will generate the same card if everything went correct. Why can't the clients just request the latest state from all the replicated servers and choose the card that appears the most. So if one server had an error the client will still get the correct state from just choosing the majority.

You assume all the servers are in sync with each other (i.e., have the same state), so when a server needs to select the next card, each of the servers will select the exact same card (assuming your code is deterministic).
However, your servers' state also depends on the the user's actions. For example, if a user decided to raise by 50$ - your server needs to store that info somewhere. Now, suppose that your server replied "ok" to the web-client (I'm assuming a web-based poker game), and then the server crashed. Your other servers might not have the information regarding the 50$ raise, and your system will be inconsistent (in the sense that the client thinks that the 50$ raise was made, while the surviving servers are oblivious of it).
Notice that majority won't help here, since the data is lost. Moreover, suppose that instead of the main server crashing, the main server plus another one got the 50$ raise data. In this case, using majority could even be worse: if you get a response from the two servers with the data, you'll think the 50$ raise was performed. But if one of them fails, then you won't have majority, and you'll think that the raise wasn't performed.
In general, Paxos can be used to replicate a state machine, in a fault tolerant manner. Where "state machine" can be thought of as an algorithm that has some initial state, and it advances the state deterministically according to messages received from the outside (i.e., the web-client).
More properly, Paxos should be considered as a distributed log, you can read more about it here: Understanding Paxos – Part 1

Update 2018:
Mysql High Availability uses paxos: https://mysqlhighavailability.com/the-king-is-dead-long-live-the-king-our-homegrown-paxos-based-consensus/
Real world example:
Cassandra uses Paxos to ensure that clients connected to different cluster nodes can safely perform write operations by adding "IF NOT EXISTS" to write operations. Cassandra has no master node so two conflicting operations can to be issued concurrently at multiple nodes. When using the if-not-exists syntax the paxos algorithm is used order operations between machines to ensure only one succeeds. This can then be used by clients to store authoritative data with an expiration lease. As long as a majority of Cassandra nodes are up it will work. So if you define the replication factor of your keyspace to be 3 then 1 node can fail, of 5 then 2 can fail, etc.
For normal writes Caassandra allows multiple conflicting writes to be accepted by different nodes which may be temporary unable to communicate. In that case doesn't use Paxos so can loose data when two Writes occur at the same time for the same key. There are special data structures built into Cassandra that won't loose data which are insert-only.
Poker and Paxos:
As other answers note poker is turn based and has rules. If you allow one master and many replicas then the master arbitrates the next action. Let's say a user first clicks the "check" button then changes their mind and clicks "fold". Those are conflicting commands only the first should be accepted. The browser should not let them press the second button it will disable it when they pressed the first button. Since money is involved the master server should also enforce the rules and only allow one action per player per turn. The problem comes when the master crashes during the game. Which replica can become master and how do you enforce that only one replica becomes master?
One way to handle choosing a new master is to use an external strong consistently service. We can use Cassandra to create a lease for the master node. The replicas can timeout on the master and attempt to take the lease. As Cassandra is using Paxos it is fault tolerant; you can still read or update the lease even if Cassandra nodes crash.
In the above example the poker master and replicas are eventually consistent. The master can send heartbeats so the replicas know that they are still connected to the master. That is fast as messages flow in one direction. When the master crashes there may be race conditions in replicas trying to be the master. Using Paxos at that point gives you strong consistently on the outcome of which node is now the master. This requires additional messages between nodes to ensure a consensus outcome of a single master.

Real life use cases:
The Chubby lock service for loosely-coupled distributed systems
Apache ZooKeeper

Paxos is used for WAN-based replication of Subversion repositories and high availability of the Hadoop NameNode by the company I work for (WANdisco plc.)

In the case you describe, you're right, Paxos isn't really necessary: A single central authority can generate a permutation for the deck and distribute it to everyone at the beginning of the hand. In fact, for a poker game in general, where there's a strict turn order and a single active player as in poker, I can't see a sensible situation in which you might need to use Paxos, except perhaps to elect the central authority that shuffles decks.
A better example might be a game with simultaneous moves, such as Jeopardy. Paxos in this situation would allow all the servers to decide together what sequence a series of closely timed events (such as buzzer presses) occurred in, such that all the servers come to the same conclusion.

Low-latency, large-scale message queuing

I'm going through a bit of a re-think of large-scale multiplayer games in the age of Facebook applications and cloud computing.
Suppose I were to build something on top of existing open protocols, and I want to serve 1,000,000 simultaneous players, just to scope the problem.
Suppose each player has an incoming message queue (for chat and whatnot), and on average one more incoming message queue (guilds, zones, instances, auction, ...) so we have 2,000,000 queues. A player will listen to 1-10 queues at a time. Each queue will have on average maybe 1 message per second, but certain queues will have much higher rate and higher number of listeners (say, a "entity location" queue for a level instance). Let's assume no more than 100 milliseconds of system queuing latency, which is OK for mildly action-oriented games (but not games like Quake or Unreal Tournament).
From other systems, I know that serving 10,000 users on a single 1U or blade box is a reasonable expectation (assuming there's nothing else expensive going on, like physics simulation or whatnot).
So, with a crossbar cluster system, where clients connect to connection gateways, which in turn connect to message queue servers, we'd get 10,000 users per gateway with 100 gateway machines, and 20,000 message queues per queue server with 100 queue machines. Again, just for general scoping. The number of connections on each MQ machine would be tiny: about 100, to talk to each of the gateways. The number of connections on the gateways would be alot higher: 10,100 for the clients + connections to all the queue servers. (On top of this, add some connections for game world simulation servers or whatnot, but I'm trying to keep that separate for now)
If I didn't want to build this from scratch, I'd have to use some messaging and/or queuing infrastructure that exists. The two open protocols I can find are AMQP and XMPP. The intended use of XMPP is a little more like what this game system would need, but the overhead is quite noticeable (XML, plus the verbose presence data, plus various other channels that have to be built on top). The actual data model of AMQP is closer to what I describe above, but all the users seem to be large, enterprise-type corporations, and the workloads seem to be workflow related, not real-time game update related.
Does anyone have any daytime experience with these technologies, or implementations thereof, that you can share?

#MSalters
Re 'message queue':
RabbitMQ's default operation is exactly what you describe: transient pubsub. But with TCP instead of UDP.
If you want guaranteed eventual delivery and other persistence and recovery features, then you CAN have that too - it's an option. That's the whole point of RabbitMQ and AMQP -- you can have lots of behaviours with just one message delivery system.
The model you describe is the DEFAULT behaviour, which is transient, "fire and forget", and routing messages to wherever the recipients are. People use RabbitMQ to do multicast discovery on EC2 for just that reason. You can get UDP type behaviours over unicast TCP pubsub. Neat, huh?
Re UDP:
I am not sure if UDP would be useful here. If you turn off Nagling then RabbitMQ single message roundtrip latency (client-broker-client) has been measured at 250-300 microseconds. See here for a comparison with Windows latency (which was a bit higher) http://old.nabble.com/High%28er%29-latency-with-1.5.1--p21663105.html
I cannot think of many multiplayer games that need roundtrip latency lower than 300 microseconds. You could get below 300us with TCP. TCP windowing is more expensive than raw UDP, but if you use UDP to go faster, and add a custom loss-recovery or seqno/ack/resend manager then that may slow you down again. It all depends on your use case. If you really really really need to use UDP and lazy acks and so on, then you could strip out RabbitMQ's TCP and probably pull that off.
I hope this helps clarify why I recommended RabbitMQ for Jon's use case.

I am building such a system now, actually.
I have done a fair amount of evaluation of several MQs, including RabbitMQ, Qpid, and ZeroMQ. The latency and throughput of any of those are more than adequate for this type of application. What is not good, however, is queue creation time in the midst of half a million queues or more. Qpid in particular degrades quite severely after a few thousand queues. To circumvent that problem, you will typically have to create your own routing mechanisms (smaller number of total queues, and consumers on those queues are getting messages that they don't have an interest in).
My current system will probably use ZeroMQ, but in a fairly limited way, inside the cluster. Connections from clients are handled with a custom sim. daemon that I built using libev and is entirely single-threaded (and is showing very good scaling -- it should be able to handle 50,000 connections on one box without any problems -- our sim. tick rate is quite low though, and there are no physics).
XML (and therefore XMPP) is very much not suited to this, as you'll peg the CPU processing XML long before you become bound on I/O, which isn't what you want. We're using Google Protocol Buffers, at the moment, and those seem well suited to our particular needs. We're also using TCP for the client connections. I have had experience using both UDP and TCP for this in the past, and as pointed out by others, UDP does have some advantage, but it's slightly more difficult to work with.
Hopefully when we're a little closer to launch, I'll be able to share more details.

Jon, this sounds like an ideal use case for AMQP and RabbitMQ.
I am not sure why you say that AMQP users are all large enterprise-type corporations. More than half of our customers are in the 'web' space ranging from huge to tiny companies. Lots of games, betting systems, chat systems, twittery type systems, and cloud computing infras have been built out of RabbitMQ. There are even mobile phone applications. Workflows are just one of many use cases.
We try to keep track of what is going on here:
http://www.rabbitmq.com/how.html (make sure you click through to the lists of use cases on del.icio.us too!)
Please do take a look. We are here to help. Feel free to email us at info#rabbitmq.com or hit me on twitter (#monadic).

My experience was with a non-open alternative, BizTalk. The most painful lesson we learnt is that these complex systems are NOT fast. And as you figured from the hardware requirements, that translates directly into significant costs.
For that reason, don't even go near XML for the core interfaces. Your server cluster will be parsing 2 million messages per second. That could easily be 2-20 GB/sec of XML! However, most messages will be for a few queues, while most queues are in fact low-traffic.
Therefore, design your architecture so that it's easy to start with COTS queue servers and then move each queue (type) to a custom queue server when a bottleneck is identified.
Also, for similar reasons, don't assume that a message queue architecture is the best for all comminication needs your application has. Take your "entity location in an instance" example. This is a classic case where you don't want guaranteed message delivery. The reason that you need to share this information is because it changes all the time. So, if a message is lost, you don't want to spend time recovering it. You'd only send the old locatiom of the affected entity. Instead, you'd want to send the current location of that entity. Technology-wise this means you want UDP, not TCP and a custom loss-recovery mechanism.

FWIW, for cases where intermediate results are not important (like positioning info) Qpid has a "last-value queue" that can deliver only the most recent value to a subscriber.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio