Is Akka cluster a good choice for running multiple socket clients - websocket

I need to run 1000+ socket clients at several nodes. Each client listens to unique endpoint.
https://doc.akka.io/docs/akka-http/current/client-side/websocket-support.html
I plan to represent each client as actor with supervision, recovery, event handling e.t.c. and hope that akka cluster will
spawn my actors where each actor represents socket client
supervision policies will help to recover actors from failing states
evenly distribute actors across several nodes
gives me guarantee that I have at most once actor for each client
akka messaging will help me to enable communication between actors with at least once semantics.
What else should I expect?

In general the Akka ecosystem has some reasonable patterns for what you're looking for:
actor per client is a reasonable starting point; supervision helps a lot
cluster sharding will evenly distribute actors across the cluster nodes and generally maintain at most one incarnation per sharding ID (you thus need to map each client to a distinct sharding ID). In a split-brain scenario, there is the chance that you'd get the same client on both sides of a network partition: the split-brain resolver configuration governs the consistency vs. availability trade-off (basically it's up to you whether you'd rather have the failure mode be no active instance of a client or multiple active instances of a client).
For ensuring that when a shard moves in a rebalancing, the clients in that shard are started on the new node, the two options are "remember entities" in cluster sharding and having a cluster singleton which regularly pings each client actor. If the set to keep-alive is reasonably static, I tend to prefer the singleton approach.
Akka's basic message semantics are at-most once, but there are well established and well documented patterns (most notably the ask pattern) for at-least-once.

Related

Consul support or alternative for 2 nodes

I want to use consul for a 2-node cluster. Drawback is there's no failure tolerance for two nodes :
https://www.consul.io/docs/internals/consensus.html
Is there a way in Consul to make a consistent leader election with only two nodes? Can Consul Raft Consensus algorithm be changed?
Thanks a lot.
It sounds like you're limited to 2 machines of this type, because they are expensive. Consider acquiring three or five cheaper machines to run your orchestration layer.
To answer protocol question, no, there is no way to run a two-node cluster with failure tolerance in Raft. To be clear, you can safely run a two-node cluster just fine - it will be available and make progress like any other cluster. It's just when one machine goes down, because your fault tolerance is zero you will lose availability and no longer make no progress. But safety is never compromised - your data is still persisted consistently on these machines.
Even outside Raft, there is no way to run a two-node cluster and guarantee progress upon a single failure. This is a fundamental limit. In general, if you want to support f failures (meaning remain safe and available), you need 2f + 1 nodes.
There are non-Raft ways to improve the situation. For example, Flexible Paxos shows that we can require both nodes for leader election (as it already is in Raft), but only require a single node for replication. This would allow your cluster to continue working in some failure cases where Raft would have stopped. But the worst case is still the same: there are always failures that will cause any two-node cluster to become unavailable.
That said, I'm not aware of any practical flexible paxos implementations anyway.
Considering the expense of even trying to hack up a solution to this, your best bet is to either get a larger set of cheaper machines, or just run your two-node cluster and accept unavailability upon failure.
Talking about changing the protocol, there is impossibility proof by FLP which states that consensus cannot be reached if systems are less than 2f + 1 for f failures (fail-stop). Although, safety is provided but progress (liveness) cannot be ensured.
I think, the options suggested in earlier post are the best.
The choice of leader election on top of the Consul’s documentation itself requires 3 nodes. This relies on the health-checks mechanism, as well as the sessions. Sessions are essentially distributed locks automatically released by TTL or when the service crashes.
To build 2-node Consul cluster we have to use another approach, supposedly called Leader Lease. Since we already have Consul KV-storage with CAS support, we can simply write to it which machine is the leader before the expiration of such and such time. As long as the leader is alive and well, it can periodically extend it's time. If the leader dies, someone will replace it quickly. For this approach to work, it is enough to synchronize the time on the machines using ntpd and when the leader performs any action, verify that it has enough time left to complete this action.
A key is created in the KV-storage, containing something like “node X is the leader before time Y”, where Y is calculated as the current time + some time interval(T). As a leader, node X updates the record once every T/2 or T/3 units of time, thereby extending it's leadership role. If a node falls or cannot reach the KV-storage, after the interval(T) its place will be taken by the node, which will be the first to discover that the leadership role has been released.
CAS is needed to prevent a race condition if the two nodes simultaneously try to become a leader. CAS Specifies to use a Check-And-Set operation. This is very useful as a building block for more complex synchronization primitives. If the index is 0, Consul will only put the key if it does not already exist. If the index is non-zero, the key is only set if the index matches the ModifyIndex of that key.

How do I practically use Raft algorithm

In the Raft paper, they mentioned that all the client interaction happens with the leader node. What I don't understand is that the leader keeps changing. So let's say my cluster is behind a load balancer. How do I notify the load balancer that the leader has changed? Or if I'm correct, is it that load balancer can send out client request to any of the node (follower or leader) and it is the responsibility of the follower node to send the request to the leader?
After the voting finish, you will have a leader (new or old). It is the responsibility of the leader to notify all the nodes in the network to send heartbeats at a regular interval(smaller than the keep-alive time of network but bigger than the maximum round trip time) to all the nodes.
Your load balancer should update the leader every time it gets heartbeats. Load balancer will send data only to the leader as according to raft algorithm all client requests directly goes to leader only, other nodes can't send data but only acknowledgments to voting and append commands.
There is a really good presentation here on the same:- Raft: Log-Replication
There are really two ways this can be done: either the load balancer needs to understand where the leader is or the followers can proxy requests to the leader.
There's nothing wrong with proxying client requests to the leader through a follower, and in fact there can be major benefits to it. Many Raft implementations allow clients to read from followers while maintaining sequential consistency. This can still be safely done with a load balancer sending requests to arbitrary nodes if the client keeps track of the last index it has seen and sends that with each request to ensure it does not see state go back in time. I won't write the full algorithm here, but this is described in the Raft dissertation which you should consult.
But using a load balancer in this manner can become unsafe in certain cases. If clients are allowed to send multiple concurrent requests, the load balancer could route those requests through different nodes and they could arrive at the leader out of order. This can be accounted for by attaching a sequence number to client requests and reordering requests on the leader. But to do so, the implementation has to include sessions to allow the leader to track per-client state.
Typically, though, Raft clients connect to specific nodes and stay connected to them for as long as possible to reduce the overhead of maintaining consistency while switching servers. If an implementation supports reading from followers, it can still be costly to switch servers since servers have to wait for state to catch up to maintain sequential consistency.

Active MQ load balancing to achieve high throughput

Currently my activeMQ configuration (non persistent messaging) allows me to achieve 2000 msgs/sec. There are four queues and four consumers consuming the messages. There's only one activeMQ broker in this configuration. I would like to achieve a higher throughput of about 5000 msgs/sec (with addition of additional brokers). I'm pretty clueless on how to achieve this with out splitting individual queues on to individual ActiveMQ instances. What are the topologies that support higher throughput than the individual instance with out splitting the queues among instances ?
Adding a network of brokers might help. That is if you have a decent number of consumers and a decent number of producers connecting to different brokers.
If you have a single producer or a single consumer, all traffic will still go over one of the brokers, making it the bottleneck in any case. So, your actual setup of the servers using the AMQ broker is important.
You will also need to check what's the bottleneck of your physical machines. Is it I/O? CPU? Memory usage/heap size? Even Linkspeed? Use OS tools together with visualvm to track this down. Then you at least know what kind of server you need next.
In any case, some semi-manual load balancing is always possible over several nodes, weather you are using a network of brokers or not. Just make sure messages are routed through certain brokers depending on their content or whatnot. If you cannot distinguish between different message types in any logical way - you can do things like finding some integer number in the message (be it client IP, yesterdays temperature in celsius or whatever), and do a number modulo <num brokers>. Then route it to the destination you selected. Round robin is also an option. There is almost always a way to distribute the load in a logical way among several brokers.

Real world example of Paxos

Can someone give me a real-world example of how Paxos algorithm is used in a distributed database? I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.
A simple example could be a banking application where an account is being modified through multiple sessions (i.e. a deposit at a teller, a debit operation etc..). Is Paxos used to decide which operation happens first? Also, what does one mean by multiple instances of Paxos protocol? How is when is this used? Basically, I am trying to understand all this through a concrete example rather than abstract terms.
For example, we have MapReduce system where master consists of 3 hosts. One is master and others are slaves. The procedure of choosing master uses Paxos algorithm.
Also Chubby of Google Big Table uses Paxos: The Chubby Lock Service for Loosely-Coupled Distributed Systems, Bigtable: A Distributed Storage System for Structured Data
The Clustrix database is a distributed database that uses Paxos in the transaction manager. Paxos is used by the database internals to coordinate messages and maintain transaction atomicity in a distributed system.
The Coordinator is the node the transaction originated on
Participants are the nodes that modified the database on behalf of
the transaction Readers are nodes that executed code on behalf of the
transaction but did not modify any state
Acceptors are the nodes that log the state of the transaction.
The following steps are taken when performing a transaction commit:
Coordinator sends a PREPARE message to each Participant.
The Participants lock transaction state. They send PREPARED messages back to the Coordinator.
Coordinator sends ACCEPT messages to Acceptors.
The Acceptors log the membership id, transaction, commit id, and participants. They send ACCEPTED messages back to the Coordinator.
Coordinator tells the user the commit succeeded.
Coordinator sends COMMIT messages to each Participant and Reader.
The Participants and Readers commit the transaction and update transaction state accordingly. They send COMMITTED messages back to the Coordinator.
Coordinator removes internal state and is now done.
This is all transparent to the application and is implemented in the database internals. So for your banking application, all the application level would need to do is perform exception handling for deadlock conflicts. The other key to implementing a database at scale is concurrency, which is generally helped via MVCC (Multi-Version concurrency control).
Can someone give me a real-world example of how Paxos algorithm is
used in a distributed database?
MySQL uses Paxos. This is why a highly available MySQL setup needs three servers. In contrast, a typical Postgres setup is a master-slave two-node configuration which isn't running Paxos.
I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.
Here is a fairly detailed explanation of Paxos for transaction log replication. And here is the source code that implements it in Scala. Paxos (aka multi-Paxos) is optimally efficient in terms of messages as in a three node cluster, in steady state, the leader accepts it's own next value, transmits to both of the other two nodes, and knows the value is fixed when it gets back one response. It can then put the commit message (the learning message) into the front of the next value that it sends.
A simple example could be a banking application where an account is
being modified through multiple sessions (i.e. a deposit at a teller,
a debit operation etc..). Is Paxos used to decide which operation
happens first?
Yes if you use a MySQL database cluster to hold the bank accounts then Paxos is being used to ensure that the replicas agree with the master as to the order that transactions were applied to the customer bank accounts. If all the nodes agree on the order that transactions were applied they will all hold the same balances.
Operations on a bank account cannot be reordered without coming up with different balances that may violate the business rules of not exceeding your credit. The trivial way to ensure the order is to just use one server process that decides the official order simply based on the order of the messages that it receives. It can then track the balances of each bank account and enforce the business rules. Yet you don't want just a single server as it may crash. You want replica servers that are also receiving the credit and debit commands and agree with the master.
The challenge with having replicas that should hold the same balances are that messages may be lost and resent and messages are buffered by switches that may deliver some messages late. The net effect is that if the network is unstable it is hard to prove that fast replication protocols will never cause different servers to see that the messages arrived in different orders. You will end up with different servers in the same cluster holding different balances.
You don't have to use Paxos to solve the bank accounts problem. You can just do simple master-slave replication. You have one master, one or more slaves, and the master waits until it has got acknowledgements from the slaves before telling any client the outcome of a command. The challenge there is lost and reordered messages. Before Paxos was invented database vendors just created expensive hardware designed to have very high redundancy and reliability to run master-slave. What was revolutionary about Paxos is that it does work with commodity networking and without specialist hardware.
Since banking applications were profitable with expensive custom hardware it is likely that many real-world banking systems are still running that way. In such scenarios, the database vendor supplies the specialist hardware with built-in reliable networking that the database software runs on. That is very expensive and not something that smaller companies want. Cost-conscious companies can set up a MySQL cluster on VMs in any public cloud with normal networking and Paxos will make it reliable rather than using specialist hardware.
Also, what does one mean by multiple instances of Paxos protocol? How
is when is this used?
I wrote a blog about multi-Paxos being the original Paxos protocol. Simply put, in the case of choosing the order of transactions in a cluster, you want to stream the transactions as a stream of values. Each value is fixed in a separate logical instance of the protocol. As described in my blog about Paxos for cluster replication the algorithm is very efficient in steady-state needing only one round trip between the master and enough nodes to have a majority which is one other node in a three node cluster. When there are crashes or network issues the algorithm is always safe but needs more messages. So to answer your question typical applications need multiple rounds of Paxos to establish the order of client commands in the cluster.
I should note that Raft was specifically invented as a detailed description of how to perform cluster replication. The original Paxos papers require you to figure out many of the details to do cluster replication. So we can expect that people who are specifically trying to implement cluster replication would use Raft as it leaves nothing for the implementor to have to figure out for themselves.
So when might you use Paxos? It can be used to change the cluster membership of a cluster that is writing values based on a different protocol that can only be correct when you know the exact cluster membership. Corfu is a great example of that where it removes the bottleneck of writing via a single master by having clients write to shards of servers concurrently. Yet it can only do that accurately when all clients have an accurate view of the current cluster membership and shard layout. When nodes crash or you need to expand the cluster you propose a new cluster membership and shard layout and run it through Paxos to get consensus across the cluster.

When to use Paxos (real practical use cases)?

Could someone give me a list of real use cases of Paxos. That is real problems that require consensus as part of a bigger problem.
Is the following a use case of Paxos?
Suppose there are two clients playing poker against each other on a poker server. The poker server is replicated. My understanding of Paxos is that it could be used to maintain consistency of the inmemory data structures that represent the current hand of poker. That is, ensure that all replicas have the exact same inmemory state of the hand.
But why is Paxos necessary? Suppose a new card needs to be dealt. Each replica running the same code will generate the same card if everything went correct. Why can't the clients just request the latest state from all the replicated servers and choose the card that appears the most. So if one server had an error the client will still get the correct state from just choosing the majority.
You assume all the servers are in sync with each other (i.e., have the same state), so when a server needs to select the next card, each of the servers will select the exact same card (assuming your code is deterministic).
However, your servers' state also depends on the the user's actions. For example, if a user decided to raise by 50$ - your server needs to store that info somewhere. Now, suppose that your server replied "ok" to the web-client (I'm assuming a web-based poker game), and then the server crashed. Your other servers might not have the information regarding the 50$ raise, and your system will be inconsistent (in the sense that the client thinks that the 50$ raise was made, while the surviving servers are oblivious of it).
Notice that majority won't help here, since the data is lost. Moreover, suppose that instead of the main server crashing, the main server plus another one got the 50$ raise data. In this case, using majority could even be worse: if you get a response from the two servers with the data, you'll think the 50$ raise was performed. But if one of them fails, then you won't have majority, and you'll think that the raise wasn't performed.
In general, Paxos can be used to replicate a state machine, in a fault tolerant manner. Where "state machine" can be thought of as an algorithm that has some initial state, and it advances the state deterministically according to messages received from the outside (i.e., the web-client).
More properly, Paxos should be considered as a distributed log, you can read more about it here: Understanding Paxos – Part 1
Update 2018:
Mysql High Availability uses paxos: https://mysqlhighavailability.com/the-king-is-dead-long-live-the-king-our-homegrown-paxos-based-consensus/
Real world example:
Cassandra uses Paxos to ensure that clients connected to different cluster nodes can safely perform write operations by adding "IF NOT EXISTS" to write operations. Cassandra has no master node so two conflicting operations can to be issued concurrently at multiple nodes. When using the if-not-exists syntax the paxos algorithm is used order operations between machines to ensure only one succeeds. This can then be used by clients to store authoritative data with an expiration lease. As long as a majority of Cassandra nodes are up it will work. So if you define the replication factor of your keyspace to be 3 then 1 node can fail, of 5 then 2 can fail, etc.
For normal writes Caassandra allows multiple conflicting writes to be accepted by different nodes which may be temporary unable to communicate. In that case doesn't use Paxos so can loose data when two Writes occur at the same time for the same key. There are special data structures built into Cassandra that won't loose data which are insert-only.
Poker and Paxos:
As other answers note poker is turn based and has rules. If you allow one master and many replicas then the master arbitrates the next action. Let's say a user first clicks the "check" button then changes their mind and clicks "fold". Those are conflicting commands only the first should be accepted. The browser should not let them press the second button it will disable it when they pressed the first button. Since money is involved the master server should also enforce the rules and only allow one action per player per turn. The problem comes when the master crashes during the game. Which replica can become master and how do you enforce that only one replica becomes master?
One way to handle choosing a new master is to use an external strong consistently service. We can use Cassandra to create a lease for the master node. The replicas can timeout on the master and attempt to take the lease. As Cassandra is using Paxos it is fault tolerant; you can still read or update the lease even if Cassandra nodes crash.
In the above example the poker master and replicas are eventually consistent. The master can send heartbeats so the replicas know that they are still connected to the master. That is fast as messages flow in one direction. When the master crashes there may be race conditions in replicas trying to be the master. Using Paxos at that point gives you strong consistently on the outcome of which node is now the master. This requires additional messages between nodes to ensure a consensus outcome of a single master.
Real life use cases:
The Chubby lock service for loosely-coupled distributed systems
Apache ZooKeeper
Paxos is used for WAN-based replication of Subversion repositories and high availability of the Hadoop NameNode by the company I work for (WANdisco plc.)
In the case you describe, you're right, Paxos isn't really necessary: A single central authority can generate a permutation for the deck and distribute it to everyone at the beginning of the hand. In fact, for a poker game in general, where there's a strict turn order and a single active player as in poker, I can't see a sensible situation in which you might need to use Paxos, except perhaps to elect the central authority that shuffles decks.
A better example might be a game with simultaneous moves, such as Jeopardy. Paxos in this situation would allow all the servers to decide together what sequence a series of closely timed events (such as buzzer presses) occurred in, such that all the servers come to the same conclusion.

Resources