Tried scouring the documentation, but I'm still uncertain about the CP Subsystem setup for my current situation.
We have a Hazelcast cluster spread across 2 data centers, each data center having an even number of members, say 4, but can have as many as double during rollout.
The boxes in each data center are configured to be part of a separate partition group => 2 data centers - 2 partition groups, with 4-8 members each at a snapshot in time.
What would be the best number to set as CP Subsystem member count, considering that one data center might be decoupled as part of BAU?
I initially thought of setting the count to 5, to enforce having at least one box from each data center in the Raft consensus as a general situation (rollover happens only for a short amount of time during redeployment, so maybe it is not that big of a deal), but that might mean that consensus will not be possible when one data center will be decoupled. On the other hand, if I set up a value smaller than the box count in one dc, say 3, what would happen if all the boxes in the consensus group were to be assigned in the same dc and that dc would go away abruptly due to network conditions? These are mostly assumptions, since CP is a relatively new topic for me, so please correct me if I am wrong.
We prefer three datacenters, but sometimes a third is not available.
My team was faced with this same decision several years ago when expanding into a new jurisdiction. There were a lot of options, here are some. In all of these scenarios we did extensive testing for how the system behaved with network partitions.
Make a primary datacenter and a secondary datacenter
This is the option we ended up going with. We put 2/3 of the hosts in one datacenter and 1/3 in the secondary data-center. As much as possible, we weighted client traffic towards the primary datacenter. We also communicated with our customers about this preference so they could do the same if they wanted.
If the datacenter had multiple rooms, we made sure to have hosts spread across the different rooms to help mitigate power/network outages within the datacenter. At the minimum, we ensured the hosts are on different racks.
We also had multiple clusters and for each cluster we usually switched which datacenter was the primary and which was the secondary. We didn't do this in some jurisdictions with notorious power troubles.
Split half and half
It's up to the gods what happens when a datacenter goes down. This is why we chose the first option: we wanted the choice of what happens when each datacenter goes down.
Have a tie-breaker in a different region
Put a host in an entirely different region from the two datacenters. Most of the time the latency will be too high for this host to fully participate in making consensus decisions, but in the case of a network partition it can help move the majority to one of the partitions.
The tie-breaker host must be a part of the quorum and cannot be kicked out because of latency delays.
Build a new datacenter
These things are very expensive, but it makes the durability story much nicer. Not always an option.
Related
I have one scylla db cluster with 9 nodes and RF=3 using amazon AWS i3en.xlarge instance.
I'm curious if 3 i3en.3xlarge are much better than 9 i3en.xlarge.
Full disclosure - I work on the ScyllaDB project.
Theoretically, Scylla's shard-per-core architecture means that 16 4xlarges or 4 16xlarges should perform fundamentally the same. Each vCPU performs as in independent shared-nothing shard doing its own thing. So, how those shards are configured is irrelevant.
However, in the real world, there are good reasons for scaling up, rather than scaling out. For example:
Larger nodes have better network guarantees from AWS.
Larger nodes have fewer noisy neighbor problems.
Managing a few nodes is generally easier than managing many nodes.
Generally speaking, our users have had better experiences with larger nodes. But the choice is yours.
I'm trying to understand does "farm" mean in computing, so that, if a farm can be exemplified as a cluster, there must be something that a farm is independently of what a cluster is.
How is it different than a grid?
Do these concepts have a general meaning, where talking about for example web servers is only one common scennario, or do they differ deppending on the context completely? If so, what are the different meanings (or the most common ones)?
Also, should I be asking this somewhere else? (If so, my apologies).
I'll have a go:
A server farm is a group of servers, which may or may not be clustered, that together offer a higher computing capacity for a particular goal than an individual server. An example is a Web farm, which typically consists of a number of load-balanced Web servers, each with the same content and configurations.
While a farm may be clustered, typically a cluster consists of servers that are configured in a failover scenario, for instance active/passive, meaning that, say, 1/2 the servers are on active duty while the rest becomes activated only when the servers on active duty are no longer accessible. So when a server in the cluster crashes, it is not a problem because another server takes over.
A grid then is typically a configuration of computers that work together (in parallel) to solve a complicated task, such as organic chemistry computations or playing chess against the world champion. The work is divided into parts and each computer takes on a part and reports its findings to the main node which then synthesizes the final result.
I hope this helps.
Some links:
Wikipedia definition of server farm: http://en.wikipedia.org/wiki/Server_farm
Wikipedia definition of computing cluster: http://en.wikipedia.org/wiki/Cluster_(computing)
Wikipedia definition of grid computing: http://en.wikipedia.org/wiki/Grid_computing
Which is better in terms of performance, 2 medium role instances or 4 small role instances?
What are the pro's and cons of each configuration?
The only real way to know if you gain advantage of using larger instances is trying and measuring, not asking here. This article has a table that says that a medium instance has everything twice as large as a small one. However in real life your mileage may vary and how this affects your application only you can measure.
Smaller roles have one important advantage - if instances fail separately you get smaller performance degradation. Supposing you know about "guaranteed uptime" requirement of having at least two instances, you have to choose between two medium and four small instances. If one small instance fails you lose 1/4 of your performance, but if one medium instance fails you lose half of performance.
Instances will fail if for example you have an unhandled exception inside Run() of your role entry point descendant and sometimes something just goes wrong big time and your code can't handle this and it'd better just restart. Not that you should deliberately target for such failures but you should expect them and take measures to minimize impact to your application.
So the bottom line is - it's impossible to say which gets better performance, but uptime implications are just as important and they are clearly in favor of smaller instances.
Good points by #sharptooth. One more thing to consider: When scaling in, the fewest number of instances is one, not zero. So, say you have a worker role that does some nightly task for an hour, and it requires either 2 Medium or 4 Small instances to get the job done in that timeframe. When the work is done, you may want to save costs by scaling to one instance and let it run as one instance for 23 hours until the next nightly job. With a single Small instance, you'll burn 23 core-hours, and with a single Medium instance, you'll burn 46 core-hours. This thinking also applies to your Web role, but probably more-so since you will probably have minimum two instances to make sure you have uptime SLA (it may not be as important for you to have SLA on your worker if, say, your end user never interacts with it and it's just for utility purposes).
My general rule of thumb when sizing: Pick the smallest VM size that can properly do the work, and then scale out/in as needed. Your choice will primarily be driven by CPU, RAM, and network bandwidth needs (and don't forget you need network when moving data between Compute and Storage).
For a start, you won't get the guaranteed uptime of 99% unless you have at least 2 roles role instances, this allows one to die and be restarted while the other one takes the burden. Otherwise, it is a case of how much you want to pay and what specs you get on each. It has not caused me any hassle having more than one role role instance, Azure hides the difficult stuff.
One other point maybe worth a mention if you use four small roles you would be able to run two in one datacenter and two in another datacenter and use traffic manager to route people at least which is closer. This might give you some performance gains.
Two mediums will give you more options to store stuff in cache at compute level and thus more in cache rather than coming off SQL Azure it is going to be faster.
Ideally you have to follow #sharptooth and measure and test. This is all very subjective and I second David also you want to start as small as possible and scale outwards. We run this way, you really want to think about designing your app around a more sharding aspect as this fits azure model better than working in traditional sense of just getting a bigger box to run everything on, at some point you run out into limits thinking in the bigger box process, ie.Like SQL Azure Connection limits.
Using technologies like Jmeter is your friend here and should give you some tools to test your app.
http://jmeter.apache.org/
I'm looking for a distributed message queue that will support millions of queues, with each queue handling tens of messages per second.
The messages will be small (tens of bytes), and I don't expect the queues to get very long--on the order of tens of messages per queue at maximum, but when the system is humming along, the queues should stay fairly empty.
I'm not sure how many nodes to expect in the cluster--probably depends on the specific solution, but if I had to guess, I would say ten nodes. I would prefer that queues were relatively resilient to individual node failures within the cluster, but a few lost messages here and there won't make me lose sleep.
Does such a message queue exist? Seems like most of the field is optimized toward handling hundreds of queues with high throughput. But what is SQS built on? Surely not magic.
Update:
By request, it may indeed help to shed light on my problem domain. (I'd left details out before so as not to muddy the waters.) I'm experimenting with distributed cellular automata, with an initial target of a million cells in simulation. In some CA models, it's useful to add an event model, so that a cell can send events to its neighbors. Hence, a million queues, each with one consumer and 8 or so producers.
Costs are a concern for now, as I'm funding the experiments myself. (Thus Amazon's SQS is probably out of reach.)
From your description, it looks like OMG's Data Distribution Service could be a good fit. It is related to message queueing technologies, but I would rather call it a distributed data management infrastructure. It is completely distributed and supports advanced features that give you a lot of control over how the data is distributed, by means of a rich set of Quality of Service settings.
Not knowing much about your problem, I could guess what an approach might be. DDS is about distributing the state of strongly-typed data-items, as structures with typed attributes. You could create a data-type describing the state of an automaton. One of its attributes could be an ID uniquely identifying the automaton in the system. If possible, that would be assigned according to a scheme such that every automaton knows what the ID's of its neighbors are (if they are present). Each automaton would publish its state as needed, resulting in a distributed data-space containing the current state of all automatons. DDS supports so-called partitioning of that data-space. If you took advantage of that, then each of the nodes in your machine would be responsible for a well-defined subset of all automatons. Communication over the wire would only happen for those automatons neighboring a different partition. Since automatons know the ID's of their neighbors, they would be able to query the data-space for the states of the automatons it's interested in.
It is a bit hard to explain without a white board, but the end-result would be a single instance (which are a sort of very light-weight message queues) for most automatons, and two or three instances for those automatons at the border of a partition. If you had ten nodes and one million automata, then each node would have to be able to hold administration for approximately hundred thousand automata. I have seen systems being built with DDS of that scale, and larger, with tens of updates per second for each instance. The nice thing is that this technology scales well with the number of nodes, so you could bring down the resource load per node by adding more nodes.
If this is a research project, then you might even be able to use a commercial product without charge. Just google on dds research license.
Could someone give me a list of real use cases of Paxos. That is real problems that require consensus as part of a bigger problem.
Is the following a use case of Paxos?
Suppose there are two clients playing poker against each other on a poker server. The poker server is replicated. My understanding of Paxos is that it could be used to maintain consistency of the inmemory data structures that represent the current hand of poker. That is, ensure that all replicas have the exact same inmemory state of the hand.
But why is Paxos necessary? Suppose a new card needs to be dealt. Each replica running the same code will generate the same card if everything went correct. Why can't the clients just request the latest state from all the replicated servers and choose the card that appears the most. So if one server had an error the client will still get the correct state from just choosing the majority.
You assume all the servers are in sync with each other (i.e., have the same state), so when a server needs to select the next card, each of the servers will select the exact same card (assuming your code is deterministic).
However, your servers' state also depends on the the user's actions. For example, if a user decided to raise by 50$ - your server needs to store that info somewhere. Now, suppose that your server replied "ok" to the web-client (I'm assuming a web-based poker game), and then the server crashed. Your other servers might not have the information regarding the 50$ raise, and your system will be inconsistent (in the sense that the client thinks that the 50$ raise was made, while the surviving servers are oblivious of it).
Notice that majority won't help here, since the data is lost. Moreover, suppose that instead of the main server crashing, the main server plus another one got the 50$ raise data. In this case, using majority could even be worse: if you get a response from the two servers with the data, you'll think the 50$ raise was performed. But if one of them fails, then you won't have majority, and you'll think that the raise wasn't performed.
In general, Paxos can be used to replicate a state machine, in a fault tolerant manner. Where "state machine" can be thought of as an algorithm that has some initial state, and it advances the state deterministically according to messages received from the outside (i.e., the web-client).
More properly, Paxos should be considered as a distributed log, you can read more about it here: Understanding Paxos – Part 1
Update 2018:
Mysql High Availability uses paxos: https://mysqlhighavailability.com/the-king-is-dead-long-live-the-king-our-homegrown-paxos-based-consensus/
Real world example:
Cassandra uses Paxos to ensure that clients connected to different cluster nodes can safely perform write operations by adding "IF NOT EXISTS" to write operations. Cassandra has no master node so two conflicting operations can to be issued concurrently at multiple nodes. When using the if-not-exists syntax the paxos algorithm is used order operations between machines to ensure only one succeeds. This can then be used by clients to store authoritative data with an expiration lease. As long as a majority of Cassandra nodes are up it will work. So if you define the replication factor of your keyspace to be 3 then 1 node can fail, of 5 then 2 can fail, etc.
For normal writes Caassandra allows multiple conflicting writes to be accepted by different nodes which may be temporary unable to communicate. In that case doesn't use Paxos so can loose data when two Writes occur at the same time for the same key. There are special data structures built into Cassandra that won't loose data which are insert-only.
Poker and Paxos:
As other answers note poker is turn based and has rules. If you allow one master and many replicas then the master arbitrates the next action. Let's say a user first clicks the "check" button then changes their mind and clicks "fold". Those are conflicting commands only the first should be accepted. The browser should not let them press the second button it will disable it when they pressed the first button. Since money is involved the master server should also enforce the rules and only allow one action per player per turn. The problem comes when the master crashes during the game. Which replica can become master and how do you enforce that only one replica becomes master?
One way to handle choosing a new master is to use an external strong consistently service. We can use Cassandra to create a lease for the master node. The replicas can timeout on the master and attempt to take the lease. As Cassandra is using Paxos it is fault tolerant; you can still read or update the lease even if Cassandra nodes crash.
In the above example the poker master and replicas are eventually consistent. The master can send heartbeats so the replicas know that they are still connected to the master. That is fast as messages flow in one direction. When the master crashes there may be race conditions in replicas trying to be the master. Using Paxos at that point gives you strong consistently on the outcome of which node is now the master. This requires additional messages between nodes to ensure a consensus outcome of a single master.
Real life use cases:
The Chubby lock service for loosely-coupled distributed systems
Apache ZooKeeper
Paxos is used for WAN-based replication of Subversion repositories and high availability of the Hadoop NameNode by the company I work for (WANdisco plc.)
In the case you describe, you're right, Paxos isn't really necessary: A single central authority can generate a permutation for the deck and distribute it to everyone at the beginning of the hand. In fact, for a poker game in general, where there's a strict turn order and a single active player as in poker, I can't see a sensible situation in which you might need to use Paxos, except perhaps to elect the central authority that shuffles decks.
A better example might be a game with simultaneous moves, such as Jeopardy. Paxos in this situation would allow all the servers to decide together what sequence a series of closely timed events (such as buzzer presses) occurred in, such that all the servers come to the same conclusion.