We have a small cluster of 25 nodes running Slurm, of which the nodes can fall into a number of categories, as all the nodes are not the same. We have bigger/powerful nodes, and small/weak nodes.
All these nodes are, for the most part, basically in one partition, and we use the various job request settings for specifying which node(s) a job gets.
We also use the Weight setting on all the nodes, so that small jobs are to go to the small/weak nodes first, and not take space on the bigger nodes.
And here is the problem: If the nodes are on, (we use Slurm's power saving feature to switch unused nodes off) it works as expected. A small job goes to a small node.
However, if the nodes are off, (there are no nodes that can take it currently on) the node assigned seems to ignore the Weight setting, and seems to go where-ever. A small job may end up assigned to and switching on a big node. It seems to show the most when some nodes are on, but in use, and other nodes are off.
Can someone shed some light on this?
Related
I am leaning some basic concept of cluster computing and I have some questions to ask.
According to this article:
If a cluster splits into two (or more) groups of nodes that can no longer communicate with each other (aka.partitions), quorum is used to prevent resources from starting on more nodes than desired, which would risk data corruption.
A cluster has quorum when more than half of all known nodes are online in the same partition, or for the mathematically inclined, whenever the following equation is true:
total_nodes < 2 * active_nodes
For example, if a 5-node cluster split into 3- and 2-node paritions, the 3-node partition would have quorum and could continue serving resources. If a 6-node cluster split into two 3-node partitions, neither partition would have quorum; pacemaker’s default behavior in such cases is to stop all resources, in order to prevent data corruption.
Two-node clusters are a special case.
By the above definition, a two-node cluster would only have quorum when both nodes are running. This would make the creation of a two-node cluster pointless
Questions:
From above,I came out with some confuse, why we can not stop all cluster resources like “6-node cluster”?What`s the special lies in the two node cluster?
You are correct that a two node cluster can only have quorum when they are in communication. Thus if the cluster was to split, using the default behavior, the resources would stop.
The solution is to not use the default behavior. Simply set Pacemaker to no-quorum-policy=ignore. This will instruct Pacemaker to continue to run resources even when quorum is lost.
...But wait, now what happens if the cluster communication is broke but both nodes are still operational. Will they not consider their peers dead and both become the active nodes? Now I have two primaries, and potentially diverging data, or conflicts on my network, right? This issue is addressed via STONITH. Properly configured STONITH will ensure that only one node is ever active at a given time and essentially prevent split-brains from even occurring.
An excellent article further explaining STONITH and it's importance was written by LMB back in 2010 here: http://advogato.org/person/lmb/diary/105.html
In my production environment, I have a two-node cluster (ES 2.2.0) and each node sits on a different physical box. Inside elasticsearch.yml, I have the following:
discovery.zen.minimum_master_nodes: 2
My question is: if one box is down, can the other node continues to function normally to provide uninterrupted search services (index and search, write and read)?
If you have two nodes and each is master-eligible and you have discovery.zen.minimum_master_nodes: 2, if the network goes down and the two nodes don't see each other for a while, you'll get into a split brain situation because each node will elect itself as a master.
However, with a setting of 2, you have two possible situations:
if the non-master goes down, the other node will continue to function properly (since it is already master)
if the master goes down, the other won't be able to elect itself as the master (since it will wait for a second master-eligible node to be visible).
For this reason, with only two nodes, you need to choose between the possibility of a split brain (with minimum_master_nodes: 1) or a potentially RED cluster (with minimum_master_nodes: 2). The best way to overcome this is to include a third master-only node and then minimum_master_nodes: 2 would make sense.
Just try it out:
Start your cluster, bring down the master node, what happens?
Start your cluster, bring down the non-master node, what happens?
The purpose of minimum master nodes is to maintain the stability of the cluster.
If you have only 2 nodes in cluster and with 2 minimum master nodes settings.
If you are setting minimum master as 2, the cluster will expect 2 nodes to be UP to serve the various search services.
If one node goes down in your 2 node cluster (which had 2 minimum master node settings), theoretically cluster will goes down.
First, this setting helps prevent split brains, the existence of two masters in a single cluster.
If you have two nodes, A setting of 1 will allow your cluster to function, but doesn’t protect against split brain. It is best to have a minimum of three nodes in situations.
I have a 3 node cluster with minimum_master_nodes set to 2. If I shut down all nodes except the master, leaving one node online, the cluster is no longer operational.
Is this by design? It seems like the node that was the master shouldd remain operational, instead I get errors like this:
{"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}
All the other settings are stock and I am using the aws cloud plugin.
Yes, this is intentional.
Split brain
Imagine a situation where the other 2 nodes were still running but couldn't communicate to the the third node - you'd end up with two clusters otherwise known as a "split brain".
As the two clusters could be updating and deleting data independently of each other then recovery would be very difficult - you wouldn't have a single source of truth for the data.
By setting minimum_master_nodes to (n/2)+1 (were n is the number of nodes) you can prevent a split brain.
Single Node
If you know that the first two nodes have definitely died and not coming back - you can set the minimum_master_nodesto 1 on the remaining node (and also set to one on the other nodes before you restart them).
There is also an option no master block that lets you control what happens when you don't have a valid cluster - e.g. you could make the remaining node read-only until the cluster is re-established.
Good Day
We have a 6 node casssandra cluster witha replication factor of 3 on our keyspaces. Our applications make use of QUORUM so we can survive the loss of a single node wihtout it affecting the application.
Lets assume I lose 2 nodes at the same time. If my application was using consistency level of ONE then it would have been fine and my application would have run without any issues but we would like to keep the level at QUORUM.
My question is if 2 nodes crash at the same time and I do a nodetool removenode for each of the crashed nodes, will the cluster then rebalance the data over the remaining 4 nodes (and getting ir back to a 3 replica) and if done should my application then be able to work again usinng QUORUM?
In title you write RF=2, in text RF=3. You did not specify Cassandra version and if you are using single-token or vnodes. Quorum CL means, in a RF = 3 that 2 nodes must write/read before returning. It is possible that you face minimal issues/no issue even if 2 nodes dies, it depends on how many common ranges (partitions) the nodes shares.
Give a look at this distribution example that is exactly like the one you describe: RF3, 6 nodes.
using single tokens:
if you loose couples like (1,4) - (2,5) - (3,6) -- your cluster should allow all writes and reads, no issues. A good client will recognize nodes down and won't use them anymore as coordinators. Other situations, for example loss of nodes (1,6) might lead to a situation in which any r/w of F and E tokens will fail (assuming an equal distribution about 33% r/w operation will fail)
using vnodes:
here the situation is slightly different and also depends on couples you loose -- now if you repeat the worst scenario above -- you loose couple of nodes like (1,6) only B tokens will be affected in r/w operations since it's the only token shared between them.
Said that, just to clarify the possible scenarios, here's your answer. Nodetool removenode should be used like explained in this document. Use removenode IF AND ONLY IF you want reduce the cluster size (here what to do if you want replace a dead node). Once you did that your application will start working again using Quorum since other nodes will be responsible for partitions previously assigned to a dead node.
If you are using the official Datastax Java Driver you might want to let the driver temporary fight your monsters specifying a DowngradingConsistencyRetryPolicy
HTH,
Carlo
I am trying to find out if 3 node HA cluster is common practice? Most of the references on Google point to 2 node cluster. But i not able to convince myself that an application that require 5 Nine's, can implement 2 node HA cluster on commodity hardware.
The reason behind it is simple. If a machine on which one node goes offline, then there will be only one node left without any back up.
To reduce dependency on node that went offline, i think a 3 node cluster is a min requirement.
In order to give a factual answer, much more data would be required.
But from an anecdotal perspective, two nodes of commodity hardware are not nearly enough to give you five-nines with any level of reliability (or at least sleep-at-night comfort).
Most cluster diagrams are likely drawn with only two nodes for ease of explanation, "If A fails, B keeps working".
Given your five-nines however, and "commodity hardware", I would consider more than three as a requirement; perhaps as many as five or more.
Remember to allow for network, power and perhaps even geographical diversity if you are really after that kind of reliability.